C5_W4.pdf
https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0#:~:text=The output of the first,attend on the decoder's input.
Transformers
Quiz
Programming Assignment
Transformer Subclass
What you should remember
- The combination of self-attention and convolutional network layers allows of parallization of training and faster training.
- Self-attention is calculated using the generated query Q, key K, and value V matrices.
- Adding positional encoding to word embeddings is an effective way of include sequence information in self-attention calculations.
- Multi-head attention can help detect multiple features in your sentence.
- Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences.
Transformer Applications