Answer Key for Deep Learning Exam (Autumn 2017)
1. A - this is the "most obvious", since it involves complex images (for convolution)
and sequences of those images (for recurrence). An added need for recurrence comes
from the caption production. These 3 factors make this a bit more obvious
answer than 3-dimensional checkers, which does involve complex patterns, but the
neural net itself normally would not process a long sequence of checker boards in one
go, and no captions are needed.
2. Accuracy = 0.28,
3. Precision = 0.11,
4. Recall = 0.9
5. G (all are true except for B)
6. C - Self-training from scratch is the big deal about Alpha-Go Zero.
7. 1 million
8. 50,000 - The key point here is that the gradients are computed for EVERY element of
the minibatch, but the weights are only updated at the end of each minibatch. Hence,
with a minibatch size of 20, there are 20 times more gradient calculations than
weight updates.
9. A - (B,C and D are all false)
10. D - (the word "Jacobian" was in nearly every sentence of my lectures).
11. D - (all others are significant contributors)
12. G - It is the sigmoid activation for V, not the hyperbolic tangent for U, that is
important here, and it's derivative is v(1-v).
13. C - Again, it's V's activation function that matters here.
14. I (options A and D are true)
15. E - This is the standard approach to introducing momentum into backpropagation, and
the Adam optimizer uses a (slightly more complex) approach to incorporating momentum.
16. B - Given that the words have different SEMANTICS (i.e. meaning), a good embedding
will have good separation between the hidden vectors. The vectors in B have much
better separation than in the other collections.
17. F - All of these statements are true.
18. 0.00054 = -1 x (0.2) x (0.3)(0.9)(1 - 0.9)(0.9 - 1) - This is a lot simpler
than it looks, since we're only concerned with ONE weight and its effect upon the
error. Thus, only the first values in the X, Y and T vectors are required to calculate
the derivative, and hence the change in w.
19. A - This is the key link between RL and Deep Learning.
20. H (A, C and D are all significant causes of vanishing gradients)
21. 39 (40 is an acceptable answer as well)
22. 39 (40 is an acceptable answer as well) - In both of these problems, the relevant
weights are those going OUT of H(k) and those going IN to H(i). There are 20 weights
in each of those two groups, so 40 in total. However, since this is a
recurrent net, one of the output weights of H(k) is also the input weight to H(i), so
there is one duplicate in the two groups of 20, giving a total of 39 different weights.
Whether i and k are equal or not does not affect this number.
***** Final Note *****
Although I've done my best to be objective in writing and grading these
questions, there is always some "wiggle room" when it comes to multiple-choice
tests. There are many arguments (some strange, but many reasonable) for alternate
answers for these questions based on different interpretations of the questions and
the answer options.
My best "solution" to this problem was to give you 22 questions, but you only needed to
answer 20 of them correctly to get a "perfect" score. As it turned out, the exam was
a bit tougher than I had assumed, so the "perfect" score ended up being a 17, despite
the fact that a few (very sharp) students did get 18, 19 and there was even a 20. But
all final grades on this exam were based on 17 as the "gold standard". Your total
points, P, were then calculated from your number of correct answers, C, using the
following equation: P / 100 = C / 17. Any fractions for P were rounded up.