Transformers

Quiz

Untitled

The combination of self-attention and convolutional network layers allows of parallization of training and faster training.
Self-attention is calculated using the generated query Q, key K, and value V matrices.
Adding positional encoding to word embeddings is an effective way of include sequence information in self-attention calculations.
Multi-head attention can help detect multiple features in your sentence.
Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences.