Looking for 365.221/2/4/6/7/8/9/30/56/67/81/325/326/348/349, UE Hands-on AI II, Rainer Dangl et al., 2026S test answers and solutions? Browse our comprehensive collection of verified answers for 365.221/2/4/6/7/8/9/30/56/67/81/325/326/348/349, UE Hands-on AI II, Rainer Dangl et al., 2026S at moodle.jku.at.
Get instant access to accurate answers and detailed explanations for your course questions. Our community-driven platform helps students succeed!
What does one row of a self-attention matrix represent?
In self-attention, what happens after the attention weights have been computed?
Switch to probability sampling and set the temperature to 3.0. What output do you get now? Add the text here. Why do you think it looks like this?
Also explain the temperature parameter and how it affects the probability distribution of next generated token - what does a high/low temperature mean?
Set the following training hyperparameters:
Interpret your results:
Generate some text.
Generate three texts (probabilistic sampling, top-1 and top-k). List the three texts here. How do the texts differ? Do they capture the style and tone of Shakespearean language in your opinion? Which version manages to do that best?
Switch to top1 (greedy) generation and generate the text again. Then check out the next token probabilities. It is highly likely that the next predicted token after Romeo: is the <eos> token. Why do you think that is?
Enter just the word gleefully in the tokenization textbox. Which token(s) do you get? Why do you think is this word processed in this way?
Elaborate on the tradeoff between dictionary size and sequence length between tokenization schemes (word/char/BPE-subword level).
Now work with the LSTM - Simple (custom forget bias) preset.
Train two models:
Note: don't forget to click on 'Apply Changes' when you modify the preset in the architecture editor.
Train for 20 epochs with a learning rate of 0.01.
Examine the development of the loss/accuracy values over the epochs. How does the initial forget gate bias affect the training here?
Keep in mind - here we only set an initial bias, gradient computation is not disabled, thus in both cases the bias parameter will adapt during training.
In your analysis, also consider for both models the gradient magnitude plots. Show the gradients at initialization and after training and include both plots here.
Now choose 'LSTM' as model type and load the preset LSTM Simple (with forget).
Train for 20 epochs and use learning rate = 0.01. Take a look at how the train/validation accuracies develop over the epochs and what the final test accuracy is. Also check out the gradient magnitude plot (initialization + after training, batches to sample = 10).
Do the same for the LSTM Simple (no forget) preset (20 epochs, lr = 0.01).
Elaborate on the differences between the two models. How does disabling the forget gate affect performance and the gradient magnitudes (initialization and after training)?
Attach both plots here.