Scaling Laws & Data

Executive Summary

The most influential paper in modern AI (Kaplan et al., 2020 / Chinchilla, 2022) proved that model performance is a predictable function of three variables: Compute ($C$), Data ($D$), and Parameters ($N$).

Concept

Key Finding

Real-world Impact

Kaplan Laws

Over-emphasized parameters

Led to massive, under-trained models (GPT-3)

Chinchilla Laws

Data is just as important as size

Models should be "Compute Optimal"

Optimal Ratio

~20 tokens per parameter

Llama 3 (70B) was trained on 15T tokens

1. The Chinchilla Optimality

DeepMind's research showed that for every doubling of compute budget, we should increase model size and training tokens in equal proportions.

Compute Budget $C$: Measured in PFLOPS-days.
Equation: $C \approx 6ND$ (where $N$ is params and $D$ is tokens).
The Finding: Most models (like GPT-3) were too big for the amount of data they were trained on. To be optimal, a 175B model should have been trained on trillions of tokens.

2. Data Mixture & Quality

"More data" is no longer enough; it must be "high-quality" data.

The Lifecycle of Data

Deduplication: Removing identical or near-identical web pages to prevent "bucketing" of gradients.
Quality Filtering: Uses a small classifier (e.g., fastText) to detect "Wikipedia-quality" text from "spam-quality" text.
Synthetic Data: Building high-quality reasoning/math data using stronger models (e.g., GPT-4 feeding Llama 3).

Interview Questions

1. "If you have a fixed compute budget, would you train a larger model for fewer steps or a smaller model for more steps?"

According to Chinchilla laws, there is a specific "optimal" point. However, in practice, we often train smaller models for longer (over-training) because it makes inference cheaper for millions of users.

2. "What are the limits of scaling? Can we just keep adding data forever?"

We are approaching the "Token Crisis" where we have exhausted most of the high-quality human-generated text on the internet. Future scaling depends on Synthetic Data, Self-Play, and Multi-Modal (Video/Audio) data.

3. "How does the 'Inverse Scaling' law work?"

While most tasks improve with scale, some exhibit "inverse scaling" where larger models perform worse initially because they over-rely on common patterns. Examples include certain types of logical traps or tasks where the model must ignore a frequent but irrelevant association.

Summary Table: Model Sizes

Model

Params

Training Tokens

Status

GPT-3

175B

300B

Under-trained

Llama 2

70B

Well-trained

Llama 3

70B

15T

Massive Over-training

PreviousTraining Lifecycle NextAdvanced Architecture

Last updated 21 days ago

hashtagExecutive Summary

hashtag1. The Chinchilla Optimality

hashtag2. Data Mixture & Quality

hashtagThe Lifecycle of Data

hashtagInterview Questions

hashtagSummary Table: Model Sizes