Fundamental Mechanics
Executive Summary
Attention
$Softmax(\frac{QK^T}{\sqrt{d_k}})V$
Quadratic complexity $O(L^2)$
Scaling Laws
Performance $\propto$ Compute, Data, Params
Chinchilla optimality ($20$ tokens/param)
Decoding
Greedy, Beam, Top-P, Top-K
Precision vs. Diversity trade-off
KV-Cache
Reuse past Keys/Values during inference
Reduces $O(L)$ to $O(1)$ per token
1. The Transformer Architecture
The core of modern LLMs (GPT, Llama, Claude).
The Attention Mechanism
Allows the model to focus on relevant context $K$ for a given query $Q$.
Queries ($Q$): What I am looking for.
Keys ($K$): What I contain.
Values ($V$): The information I contribute.
Formula: $Attention(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
Why $\sqrt{d_k}$?: To prevent the dot product from growing too large, which would push the Softmax into regions with near-zero gradients.
Positional Encodings
Since Transformers process tokens in parallel (unlike RNNs), they have no inherent sense of order. We add Sinusoidal or Rotary ($RoPE$) encodings to the input embeddings.
2. Training at Scale
Chinchilla Scaling Laws
DeepMind discovered that most models are "under-trained".
Rule of Thumb: For optimal performance, you should scale training data and parameters equally.
Ratio: Approximately 20 tokens per parameter (e.g., a 7B model needs ~140B tokens).
Tokenization: BPE & WordPiece
LLMs don't read words; they read tokens.
BPE (Byte Pair Encoding): Iteratively merges the most frequent pairs of characters.
Advantage: Handles "out-of-vocabulary" words by breaking them into sub-units (e.g.,
un-happi-ness).
3. Inference Optimization: KV-Caching
In autoregressive generation, each new token requires re-computing the attention for all previous tokens.
The Problem: $O(L^2)$ complexity.
The Solution: Store the Keys ($K$) and Values ($V$) of previous tokens in memory.
The Result: Only compute $K$ and $V$ for the newest token, reducing the per-token computational cost significantly.
Interview Questions
1. "Explain the difference between Encoder-only, Decoder-only, and Encoder-Decoder architectures."
Encoder-only (BERT): Sees whole sentence at once. Best for NLU (Sentiment, NER).
Decoder-only (GPT): Causal masking (sees only past). Best for generation.
Encoder-Decoder (T5): Encoder for input, Decoder for output. Best for Translation or Summarization.
2. "What is Perplexity ($PPL$) and how does it relate to Cross-Entropy?"
Perplexity is the exponentiated cross-entropy loss: $PPL = e^{H(p,q)}$. It represents the "weighted branching factor" of the model. A lower PPL means the model is less "confused" and more certain in its predictions.
3. "Why is Softmax temperature used in decoding?"
$P_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$.
$T < 1$: Makes the distribution "sharper" (more deterministic).
$T > 1$: Makes the distribution "flatter" (more creative/diverse, but higher risk of hallucinations).
Code Snippet: Minimal Softmax with Temperature
Last updated