Crowdly

Додати до Chrome

Університети
elearning.aua.am
Reinforcement Learning - Fall 2025

Reinforcement Learning - Fall 2025

Шукаєте відповіді та рішення тестів для Reinforcement Learning - Fall 2025? Перегляньте нашу велику колекцію перевірених відповідей для Reinforcement Learning - Fall 2025 в elearning.aua.am.

Отримайте миттєвий доступ до точних відповідей та детальних пояснень для питань вашого курсу. Наша платформа, створена спільнотою, допомагає студентам досягати успіху!

What is the main idea behind multi-step bootstrapping in Reinforcement Learning?

To always use the next reward as the estimate for future returns

To interpolate between using a single-step TD update and the full Monte Carlo return

To update the value function using the entire return of an episode

To update the policy after every action

Переглянути це питання

TD(0) can be used for the solution of

Control task

Planning task

both Prediction and Control tasks

Prediction task

Переглянути це питання

Which one of these is a key feature of TD Learning?

It requires a model of the environment

It updates estimates based on observed transitions

It only works in episodic tasks

It can only be applied to deterministic environments

Переглянути це питання

Consider the undiscounted, episodic MDP below. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The right half of the figure shows the value of each state under the equiprobable random policy. If π is the equiprobable random policy, what is q(5, down)?

Grid Example

-20

-15

-14

-21

Переглянути це питання

-23

-25

-22

-24

-26

Переглянути це питання

When it is not possible to determine a policy that is greedy with respect to the value functions v_π, q_π(Select all that apply).

When action values q_πare available but no model is available

When state values v_πare available but no model is available

When action values q_πand a model are available

When state values v_πand model are available

Переглянути це питання

Which of the following is a requirement on the behavior policy b for using off-policy Monte Carlo policy evaluation? This is called the assumption of coverage.

For each state s and action a, if b(a∣s)>0 then π(a∣s)>0

❌

All actions have non-zero probabilities under π

❌

For each state s and action a, if π(a∣s)>0 then b(a∣s)>0

✅

Переглянути це питання

When does Monte Carlo prediction perform its first update?

After every state is visited at least once

❌

After the first time step

❌

All of the answers are correct

❌

At the end of the first episode

✅

Переглянути це питання

After 99 episodes, the estimated value of the state s is 5.8. For the next episode, for the state s the agent receives a Return G₁₀₀=7. What will be the new estimate of the state value of the state s?

Переглянути це питання

Which approach can not find an optimal deterministic policy? (Select all that apply)

Off-policy learning with an ε-soft behavior policy and a deterministic target policy

ε-greedy exploration

Exploring Starts

Переглянути це питання

Попередня
1
2
3
4
5
6
Наступна

Хочете миттєвий доступ до всіх перевірених відповідей на elearning.aua.am?

Отримайте необмежений доступ до відповідей на екзаменаційні питання - встановіть розширення Crowdly зараз!

Додати до Chrome

Telegram Instagram TikTok Question Bank

Умови використання Зв'яжіться з нами

Додати до Chrome