Scaled Dot-Product Attention is a specific type of Self-Attention Mechanism used in Transformer Architecture, where the Attention Scores are scaled by the inverse square root of the dimension of the keys () before applying the Softmax Normalization.
This scaling is critical to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients (vanishing gradients), making optimization difficult.
Mechanism
The process involves: 1. Computing Attention Scores by taking the dot product of queries and keys.
- Scaling these scores by .
- Applying Softmax to obtain Attention Weights.
- Computing the Context Vector as the weighted sum of Values.
Mathematical Formulation
Sample Code
# Calculate attention scores
attn_scores = queries @ keys.T
# Scale and apply softmax to get attention weights
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
# Compute context vector
context_vec = attn_weights @ values
