Attention Scores

Table of Contents

Attention Scores are intermediate scalar values in the Attention Mechanism that quantify the alignment or importance of an input token relative to a specific Query token.

Calculation

Simplified (No Weights)

In the simplest form, the attention score is calculated as the Dot Product between the embedding vector of the query token and the embedding vector of the input token being attended to.

Trainable Self-Attention

In modern transformers with Trainable Self-Attention, attention scores are computed by taking the dot product between the Query vector and the Key vector for each token pair.

Interpretation

Higher scores indicate higher similarity or “alignment” between the vectors. For example, if the query is “Journey”, a high score with “Step” implies “Journey” should strictly attend to “Step”.

However, raw attention scores are not directly interpretable as probabilities because they can take any real value (positive, negative, large, small) and do not sum to one. They must be normalized into Attention Weights.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: