Detalhes do funcionamento na lousa
onde:
Mecanismo self-attention. Fonte:Jeremy Jordan.
Elemento | Intuição | Como aparece na prática? |
---|---|---|
Queries (Q) | Pergunta: Cada palavra da frase está “fazendo uma pergunta” sobre quais outras palavras ela quer saber. | Vetor que representa a própria palavra, gerado por multiplicação do embedding pela matriz |
Keys (K) | Chaves de um armário: As demais palavras têm “chaves” que podem ser comparadas com as perguntas. Se uma chave for semelhante à pergunta, ela “abre” a porta para a informação relevante. | Vetor gerado pela mesma palavra, mas usando |
Values (V) | Conteúdo guardado nas portas: Quando a porta abre, o que vem dentro é a informação que a palavra quer transmitir à pergunta. | Vetor resultante da multiplicação do embedding por |
Multi-Head Attention. Fonte: Jeremy Jordan.
# Self-Attention  Fonte: https://jalammar.github.io/illustrated-transformer/ The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process. Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant. What are the “query”, “key”, and “value” vectors? They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays. The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position. The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2. --- # Multi-head Attention