Attention Scores are intermediate scalar values in the Attention Mechanism that quantify the alignment or importance of an input token relative to a specific Query token.
Calculation
Simplified (No Weights)
In the simplest form, the attention score is calculated as the Dot Product between the embedding vector of the query token and the embedding vector of the input token being attended to.
- Formula:
Trainable Self-Attention
In modern transformers with Trainable Self-Attention, attention scores are computed by taking the dot product between the Query vector and the Key vector for each token pair.
- Formula:
- This operation efficiently computes the alignment between all queries and all keys simultaneously.
Interpretation
Higher scores indicate higher similarity or “alignment” between the vectors. For example, if the query is “Journey”, a high score with “Step” implies “Journey” should strictly attend to “Step”.
However, raw attention scores are not directly interpretable as probabilities because they can take any real value (positive, negative, large, small) and do not sum to one. They must be normalized into Attention Weights.
