Crowdly

In reinforcement learning, a policy that results in the maximum cumulative reward is called:

The stochastic policy

The optimal policy

The greedy policy

The deterministic policy

Which of the following describes the interaction in the RL loop?

The agent chooses rewards and the environment selects actions

The environment chooses the next state and action

The agent interacts with the environment by taking actions and receiving rewards

The agent interacts with itself to optimize actions

What is the objective of policy iteration in reinforcement learning?

To find an optimal value function

To balance exploration and exploitation

To find the best action for each state

To find an optimal policy

Suppose γ= 0.5 and the following sequence of rewards is received R₁ = -1, R₂ = 2, R₃ = 6, R₄ = 3, and R₅ = 2, with T = 5. What is the G₀?

Hint: Work backward.

Which element in reinforcement learning defines the behavior of the agent?

Model of the environment

Value Function

Policy

Reward Signal

In an MDP, what defines the probability of moving to a new state given a current state and action?

Reward function

Policy

Transition probability

State-value function

What is a key assumption behind the Markov decision process (MDP) model?

The agent's future state depends on accumulated rewards

The agent has perfect knowledge of future states

The agent's future state depends on the history of all previous states and actions

The agent's future state depends only on its current state and action