Softmax Function

For the final output layer, especially in classification tasks (like predicting the next token in a LLM), we want probabilities. We want to know: “What is the % chance that this token is Next?”

Softmax function takes raw numbers (logits) and converts them into a probability distribution summing to 1.

Softmax(zi)=ezijezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

“Sigma of z, sub i, equals e to the power of z, sub i, divided by the sum from j equals 1 to K, of e to the power of z, sub j.”


Softmax Normalization is a mathematical function often used in machine learning (and the Attention Mechanism) to convert a vector of raw scores (logits) into a probability distribution.

Why Use Softmax for Attention?

  1. Interpretability: It ensures all values sum to 1.0, allowing us to interpret them as probabilities or attention percentages.
  2. Handling Extreme Values: Compared to simple normalization (summation), Softmax handles extreme values better.
    • If one score is significantly higher (e.g., 400 vs 1, 2, 3), simple normalization might still give non-zero weight to small values (e.g., 0.0025) and not quite 1.0 to the large one (0.99).
    • Softmax pushes the large value very close to 1.0 and others very close to 0.0, which is often desired for focusing attention.
  3. Numerical Stability: Implementations (like PyTorch’s) often subtract the maximum value before exponentiation (eximaxe^{x_i - max}) to prevent overflow errors with large numbers while mathematically yielding the same result.

In Classification Output

In a classification neural network, the Softmax function is applied to the output layer to interpret the neuron outputs as confidence scores or probabilities for each class.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: