Attention

See for implementations: https://github.com/uzaymacar/attention-mechanisms

The key/value/query formulation of attention is from the paper Attention Is All You Need.

How should one understand the queries, keys, and values

The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description, etc.) associated with candidate videos in the database, then present you the best matched videos (values).

The attention operation turns out can be thought of as a retrieval process as well, so the key/value/query concepts also apply here. (BTW the above example is just a toy system for illustration, in practice search engines and recommendation systems are much more complex.)

As mentioned in the paper you referenced (Neural Machine Translation by Jointly Learning to Align and Translate), attention by definition is just a weighted average of values,

\[ \begin{align}\begin{aligned}c=\sum_{j}\alpha_jh_j\\where :math:`\sum \alpha_j=1`.\end{aligned}\end{align} \]

If we restrict $\alpha$ to be a one-hot vector, this operation becomes the same as retrieving from a set of elements $h$ with index $\alpha$. With the restriction removed, the attention operation can be thought of as doing “proportional retrieval” according to the probability vector $\alpha$.

It should be clear that $h$ in this context is the value. The difference between the two papers lies in how the probability vector $\alpha$ is calculated. The first paper (Bahdanau et al. 2015) computes the score through a neural network

\[ \begin{align}\begin{aligned}e_{ij}=a(s_i,h_j), \qquad \alpha_{i,j}=\frac{\exp(e_{ij})}{\sum_k\exp(e_{ik})}\\where :math:`h_j` is from the encoder sequence, and :math:`s_i` is from the decoder sequence. One problem of this approach is, say the encoder sequence is of length :math:`m` and the decoding sequence is of length :math:`n`, we have to go through the network :math:`m*n` times to acquire all the attention scores :math:`e_{ij}`.\end{aligned}\end{align} \]

A more efficient model would be to first project $s$ and $h$ onto a common space, then choose a similarity measure (e.g. dot product) as the attention score, like

\[ \begin{align}\begin{aligned}e_{ij}=f(s_i)g(h_j)^T\\so we only have to compute :math:`g(h_j)` :math:`m` times and :math:`f(s_i)` :math:`n` times to get the projection vectors and :math:`e_{ij}` can be computed efficiently by matrix multiplication.\end{aligned}\end{align} \]

This is essentially the approach proposed by the second paper (Vaswani et al. 2017), where the two projection vectors are called query (for decoder) and key (for encoder), which is well aligned with the concepts in retrieval systems. (There are later techniques to further reduce the computational complexity, for example Reformer, Linformer.)

How are the queries, keys, and values obtained

The proposed multihead attention alone doesn’t say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.

\[ \begin{align}\begin{aligned}\begin{split} \begin{align*}\text{MultiHead($Q$, $K$, $V$)} & = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^{O} \\ \text{where head$_i$} & = \text{Attention($QW_i^Q$, $KW_i^K$, $VW_i^V$)} \end{align*}\end{split}\\Where the projections are parameter matrices:\end{aligned}\end{align} \]

\[\begin{split}\begin{align*} W_i^Q & \in \mathbb{R}^{d_\text{model} \times d_k}, \\ W_i^K & \in \mathbb{R}^{d_\text{model} \times d_k}, \\ W_i^V & \in \mathbb{R}^{d_\text{model} \times d_v}, \\ W_i^O & \in \mathbb{R}^{hd_v \times d_{\text{model}}}. \end{align*}\end{split}\]

For unsupervised language model training like GPT, $Q, K, V$ are usually from the same source, so such operation is also called self-attention.

For the machine translation task in the second paper, it first applies self-attention separately to source and target sequences, then on top of that it applies another attention where $Q$ is from the target sequence and $K, V$ are from the source sequence.

For recommendation systems, $Q$ can be from the target items, $K, V$ can be from the user profile and history.

[1]:

from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax

# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])

# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])

# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V

# scoring the query vectors against all key vectors
scores = Q @ K.transpose()

# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(attention)

[[0.98522025 1.74174051 0.75652026]
 [0.90965265 1.40965265 0.5       ]
 [0.99851226 1.75849334 0.75998108]
 [0.99560386 1.90407309 0.90846923]]