An Introduction to Large Language Models
Here are detailed study notes for Chapter 1: “An Introduction to Large Language Models” based entirely on the content provided from your uploaded book. Citations are included using the required format.
Study Notes: Chapter 1 — An Introduction to Large Language Models
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
1. Introduction & Significance of LLMs
LLMs began transforming the world around 2012, with deep neural networks rapidly advancing.
Major breakthrough: GPT-2, capable of writing human-like articles.
ChatGPT (2022) massively accelerated public adoption of AI, reaching:
1M users in 5 days
100M users in 2 months
These systems reshaped tasks such as translation, summarization, programming assistance, and creative writing.
2. What Is Language AI?
A subfield of AI focused on systems capable of understanding, processing, and generating human language.
Often used interchangeably with NLP, though “Language AI” is broader.
Not all language-related systems are “intelligent”—e.g., early game NPC logic.
Definition (John McCarthy, 2007):
AI is the science of making intelligent machines and programs, not necessarily tied to biological intelligence.
3. A Brief History of Language AI
3.1 Challenges
Text is unstructured, unlike numerical data.
Goal over time: make text machine-understandable by creating structured representations.
3.2 Bag-of-Words (BoW)
One of the earliest methods (1950s; popular in 2000s).
Steps:
Tokenize text (split by whitespace).
Build a vocabulary of unique words.
Represent sentences using word counts → vectors.
Useful but limited: ignores meaning and word order.
4. Dense Vector Embeddings
4.1 word2vec (2013)
Learns embeddings (dense vectors) capturing semantic meaning.
Trains a neural network to predict if two words are neighbors in text.
Words with similar meanings get closer in embedding space.
Embedding dimensions loosely correspond to latent properties (figure analogy), though not literally.
4.2 Strengths of Embeddings
Enable semantic similarity between words.
Foundation for downstream tasks: clustering, search, classification, etc.
5. Recurrent Neural Networks (RNNs) and Attention
Before Transformers:
5.1 RNN Sequence-to-Sequence Models
Used for tasks like translation.
Had encoders (represent sentence) and decoders (generate output).
Autoregressive: processes tokens one by one.
Problem: long sequences → context compression into a single vector.
5.2 Attention Mechanism (2014)
Helps models “focus” on relevant parts of the input.
Allows decoder to look at each input token’s hidden state.
Greatly improves translation and contextual understanding.
6. The Transformer Revolution
6.1 “Attention is All You Need” (2017)
Removed recurrence; relied solely on self-attention.
Enabled parallel processing, faster training, and better performance.
Architecture:
Encoder blocks: self-attention + feedforward layers
Decoder blocks: masked self-attention + cross-attention + feedforward
Key idea: self-attention lets the model look forward and backward in a sequence.
7. Representation Models (Encoder-Only)
BERT (2018)
Uses only the encoder part of the Transformer.
Trained with Masked Language Modeling (MLM).
Generates rich contextual embeddings.
Great for:
Classification
Clustering
Feature extraction
Uses a special [CLS] token to represent whole input sequences.
8. Generative Models (Decoder-Only)
GPT-1 → GPT-2 → GPT-3
Decoder-only Transformers.
Trained for next-token prediction.
Sizes:
GPT-1: 117M params
GPT-2: 1.5B
GPT-3: 175B
Large generative Transformer models became known as LLMs.
Why they matter
Act as completion models: continue text, answer questions, follow instructions (after fine-tuning).
Introduced the idea of context window: maximum tokens the model can consider.
9. The Year of Generative AI (2023)
Explosion of:
OpenAI GPT-3.5 / GPT-4
Open-source models (Llama, Mistral, Phi, Command R, etc.)
New architectures emerging:
Mamba
RWKV
Alternative architectures aimed to achieve Transformer-level capability with better efficiency.
10. Redefining the Term “LLM”
“Large” is a moving target.
Authors argue LLMs include:
Both generative (decoder-only) and representation (encoder-only) models.
Models with <1B parameters if impactful.
Focus should be on capability, not size.
11. How LLMs Are Trained
Two-Stage Process
Pretraining (Language Modeling)
Trained on massive corpora to learn general language patterns.
Output: base or foundation model.
Fine-Tuning
Adaptation to tasks:
Classification
Following instructions
Domain specialization
Pretraining is extremely expensive (e.g., Llama-2 cost $5m+ in compute).
12. Applications of LLMs
Examples discussed:
Sentiment classification
Topic modeling / clustering
Semantic search
Retrieval-augmented generation (RAG)
Tool-using chatbots
Multimodal reasoning (e.g., write recipe from a fridge photo)
LLMs combine multiple components (retrievers, prompt engineering, fine-tunin
13. Responsible LLM Use
Risks highlighted:
Bias & fairness
Transparency and accountability
Harmful / incorrect content
Intellectual property concerns
Emerging regulations (e.g., EU AI Act)
14. Compute Constraints & “GPU-Poor” Users
LLMs require significant VRAM.
Many people don’t have expensive GPUs; this book emphasizes:
Lightweight models
Google Colab usage
Efficient loading/inference techniques
15. Interacting with LLMs
Closed-source models
Accessed through APIs (e.g., OpenAI, Anthropic).
Pros: Easy, powerful, no GPU required.
Cons: Costly, cannot be self-hosted, limited transparency.
Open-source models
Fully downloadable and customizable (Llama, Phi, Mistral…).
Pros: Control, privacy, fine-tuning.
Cons: Require hardware + setup effort.
16. First Code Example: Text Generation
The chapter ends by demonstrating:
Loading Phi-3-mini-4k-instruct using Hugging Face Transformers.
Running a simple text-generation pipeline to produce a joke about chickens.
Summary of Key Takeaways
LLMs evolved from simple word counting to deep contextual models powered by attention.
Transformers revolutionized Language AI by enabling parallel processing and long-range understanding.
Representation models (like BERT) and generative models (like GPT) serve different purposes but both fall under “LLM.”
Open-source and proprietary LLMs each have strengths depending on user needs.
Ethical, computational, and regulatory factors play major roles today.
The chapter provides the conceptual foundation for all following chapters.
If you’d like, I can also prepare:
Flashcards for Chapter 1
A mind map of the concepts
Practice questions + answers
A condensed quick-revision sheet
Below are detailed study notes for Chapter 2: Tokens and Embeddings based entirely on the full chapter text retrieved from your uploaded PDF. All statements include citations where needed.
Chapter 2 — Tokens and Embeddings (Detailed Study Notes)
From: Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (2024)
1. Introduction: Why Tokens and Embeddings Matter
Chapter 2 explains two of the most fundamental concepts behind LLMs:
Tokens → the atomic text units LLMs read and generate
Embeddings → the numeric vectors representing tokens or entire sentences
LLMs cannot operate directly on characters or raw text—they convert text into structured tokens and then embeddings.
2. What Tokens Are
LLMs operate in small chunks of text called tokens.
Examples of token types shown in the text:
Start-of-text token (<s>)
End-of-text token
Padding token
Unknown token
Special tokens like CLS, MASK
Tokens may be:
Whole words
Subwords
Characters
Even punctuation or whitespace
The choice depends on the tokenizer’s training method.
3. Why Tokenization Is Necessary
Generative LLMs do not output whole sentences at once. Instead, they predict one token at a time, autoregressively.
At inference time:
Input text → tokenizer → tokens
Tokens fed to the model
Model predicts the next token repeatedly
Tokenisation is done before the model sees any text.
4. How Tokenizers Prepare Inputs
Before being processed by the model, text is broken down by the tokeniser into small pieces that map to IDs.
Example from GPT-4’s tokenizer:
Each part of a sentence—even within a single word—is split into subword units.
5. Running a Tokenizer in Code
The chapter includes code demonstrating:
Loading the Phi-3 Mini model & tokenizer
Converting a prompt into tokens
Passing those tokens into the model
Generating 20 new tokens
This demonstrates token → embedding → generation pipeline.
6. Factors Influencing Tokenizer Design
The chapter explains how tokenizer quality greatly affects model quality. The key considerations include:
6.1 Vocabulary Size
More tokens → more expressiveness, but model loses capacity for reasoning
Fewer tokens → compact but may fragment common words
6.2 Handling Capitalization
Should uppercase versions be separate tokens?
Capitalization carries meaning (e.g., names), but storing all variants costs vocabulary.
6.3 Domain-Specific Tokenization
Tokenizer performance depends heavily on the dataset used to train it.
Example: code tokenization
A general text tokenizer might split indentation spaces in Python code into many tokens—bad for code models.
A code-aware tokenizer groups indentation intelligently, improving downstream pe
7. Token Embeddings
Once text is tokenized, the model must convert tokens → numbers.
7.1 What Embeddings Are
Embeddings are dense numeric vectors representing tokens.
These vectors encode:
Meaning relationships
Context patterns
Semantic similarity
If training data contains lots of:
English → model becomes good at English
Wikipedia → model gains factual capabilities
8. Why Embeddings Matter
Embeddings allow LLMs to:
Capture complex language patterns
Measure similarity (distance in vector space)
Transform human text into computable numerical form
The chapter notes that patterns in embeddings reveal:
Coherence of generated text
Code-writing ability
Knowledge about facts
9. Word2Vec: The Origin of Modern Embeddings
word2vec (2013) was an early breakthrough system.
9.1 How word2vec Works
Assigns each word a vector (random at start)
Trains a small neural network
Predicts whether word pairs appear next to each other
If two words often appear in similar contexts → embeddings become close.
9.2 How Training Works
Steps:
Generate word pairs from text
Feed them into neural network
Update embeddings to reflect neighbor relationships
9.3 What Meaning in Embeddings Means
The chapter uses a conceptual example:
“baby” has high scores on “human”, “newborn”
“apple” has low scores on those dimensions
These dimensions are not literal human concepts but abstract latent patterns.
10. Semantic Similarity Using Embeddings
Embeddings allow measuring closeness:
Similar words → nearby in vector space
Different words → far apart
Example visualizations show clusters of similar meaning words.
11. Types of Embeddings
The chapter distinguishes:
Word embeddings (single tokens)
Sentence/text embeddings (whole sentences)
Sentence embeddings enable:
Semantic search
Topic modeling
Document classification
These will be used in Part II of the book.
12. Modern Uses of Token & Sentence Embeddings
The chapter explains how embeddings are foundational in:
Recommender systems
Search engines
Document retrieval
Clustering topics
Thus embeddings power many applications beyond LLM chatbots.
13. Tokenization → Embeddings → Computation Pipeline
Overall pipeline:
Raw text
Tokenizer converts text → tokens
Tokens converted → embeddings
LLM processes embeddings
Outputs new tokens
This cycle allows all generative and analytical tasks.
14. Domain-Specific Embeddings and RAG Context
A note in the chapter explains:
Good embeddings capture factual patterns
But relying solely on generation is unreliable
→ led to Retrieval-Augmented Generation (RAG)
RAG combines:
Search
Embeddings
LLM reasoning power
Complete Summary of Chapter 2
Chapter 2 teaches that:
Tokens are the basic units LLMs read and write.
Tokenizers heavily influence model performance and domain adaptation.
Embeddings numerically encode the meaning of tokens and sentences.
word2vec pioneered semantic vector representation.
Embeddings enable semantic search, clustering, recommendations, and RAG.
Understanding tokens + embeddings is foundational for understanding the Transformer architecture in Chapter 3.
If you want, I can also create:
Flashcards for Chapter 2
A visual mind map of tokens and embeddings
Simplified notes for quick revision
Practice questions with answers
Chapter 3 — Looking Inside Large Language Models (Detailed Notes)
Chapter 3 is an expanded, modernized version of Jay Alammar’s famous Illustrated Transformer, explaining how Transformer-based LLMs actually work inside. This chapter bridges Chapter 2’s token/embedding concepts with the Transformer architecture responsible for modern generative LLM behavior.
1. Purpose of the Chapter
Chapter 3 explains:
How generative LLMs process input tokens
How they select the next token
What happens inside the Transformer block
How modern improvements (RoPE, efficient attention) make state-of-the-art models possible
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
The chapter uses both conceptual intuition and code examples with the Phi-3 model to demonstrate the forward pass and token generation.
2. Loading a Transformer LLM (Code Setup)
The chapter begins with code using Hugging Face Transformers to load:
microsoft/Phi-3-mini-4k-instruct
Create a text-generation pipeline
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
This sets the stage for understanding the model’s internal computation when generating text.
3. High-Level Overview of Transformer Models
3.1 Inputs & Outputs of a Transformer LLM
LLMs are described as text-in → text-out systems.
Once pre-trained on huge datasets, they can generate high-quality output like emails, explanations, or stories.
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
A key figure shows:
User input
Tokenization
Model inference
Generated output (token by token)
4. The Forward Pass: How the Model Processes Tokens
The forward pass determines the next token.
The major components of the forward pass
include:
Token embeddings
Positional embeddings
Transformer blocks
Self-attention
Feedforward layers
Language modeling (LM) head
Probability distribution over next tokens
Decoding strategy: Choose a token
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
5. Choosing the Next Token (Sampling / Decoding)
After generating probabilities for all tokens, a decoding strategy must choose the next output token.
The chapter highlights:
The model doesn’t output full sentences, only one token at a time
The chosen token is appended back into the prompt and fed again
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
Common strategies:
Greedy decoding (choose highest probability)
Sampling
Top-k
Top-p (nucleus)
This choice strongly affects creativity and coherence.
6. Parallel Token Processing & Context Window
Transformers revolutionized NLP by letting models process input tokens in parallel, unlike RNNs.
Key concepts:
Each token flows through its own computation stream inside the model
Number of tokens that can be processed = context size
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
Models with larger context windows can handle longer documents and conversations.
7. Speeding Up Generation: Key/Value Caching
During generation, the model repeatedly performs the forward pass for each newly generated token.
To avoid recomputing attention over the entire history each time:
Transformers store:
Key matrices
Value matrices
for previously processed tokens.
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
This greatly speeds up inference.
8. Inside the Transformer Block
A Transformer block contains the core computation of LLMs.
Each block has two main components:
8.1 Multi-Head Self-Attention
Role:
Determines which previous tokens each token should focus on
Computes relevance scores using Queries, Keys, and Values
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
Modern LLMs use variants of self-attention to scale to long sequences.
8.2 Feedforward Layer
Role:
Stores and activates knowledge learned during pretraining
Performs nonlinear transformation on each token’s representation
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
This layer is crucial for storing factual and reasoning capabilities.
9. Recent Improvements to Transformer Architecture
The chapter outlines emerging techniques that have pushed LLMs forward.
9.1 More Efficient Attention
Transformers originally had quadratic attention complexity.
New approaches:
Linear-time attention
Sparse attention
Sliding windows
These aim to reduce cost and handle longer sequences.
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
9.2 Rotary Position Embeddings (RoPE)
RoPE has become the default positional encoding method in modern models like Llama, Phi, and Mistral.
Key features:
Encodes both absolute and relative position information
Based on rotating token embeddings in vector space
Applied inside the attention step, not added at input
OceanofPDF.com_Hands-On_Large_Language_Models
-_Jay_Alammar.pdf
Figures 3-32 and 3-33 show:
RoPE mixing with queries & keys
Rotation applied right before relevance scoring
9.3 Other Experiments & Improvements
The chapter references multiple research directions:
Vision Transformers
Transformers in robotics
Transformers for time series
State-space models like Mamba
These highlight how Transformer ideas are spreading beyond text.
10. Full Summary of Chapter 3
A Transformer LLM generates one token at a time, looping with a forward pass.
Three major components: tokenizer, Transformer blocks, LM head.
Tokenization + embeddings are the entry point of the forward pass.
The LM head outputs probabilities for the next token, and decoding chooses which token to generate.
Parallel token processing gives Transformers scalability.
KV caching prevents redundant computation and speeds up inference.
Transformer blocks consist of:
Self-attention
Feedforward neural networks
Recent improvements include efficient attention, RoPE positional embeddings, and numerous architecture experiments.
Chapter 4 — Text Classification (Detailed Notes)
1. What Is Text Classification?
Text classification is a fundamental NLP task where a model assigns a label or class to a piece of text.
Figure 4-1 from the book illustrates this basic idea: a model receives text and outputs a class.
Examples include:
Sentiment analysis (positive/negative)
Intent detection
Entity extraction
Language detection
The authors emphasize that both generative (decoder-only) and representation (encoder-only) models have significantly improved text-classification capabilities.
2. Why Language Models Matter for Classification
LLMs offer two major advantages:
Understanding context rather than relying on bag-of-words.
Zero-shot and few-shot classification—performing classification without explicit training data.
Classification can be achieved through:
Generative prompting (LLMs classify by generating the label)
Embeddings-based similarity methods (encoder/embedding models)
3. Zero-Shot Classification with Embeddings
One of the core lessons of Chapter 4 is how embeddings can be used to classify text without training a classifier.
Overview of the method
Encode the documents using an embedding model.
Encode the labels as text descriptions using the same embedding model.
Use cosine similarity between document embeddings and label embeddings.
Assign the label with the highest similarity.
4. Embedding the Labels
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
Because labels like “positive” and “negative” are too abstract, the authors describe them using short phrases, such as:
“A negative review”
“A positive review”
These label descriptions are turned into embeddings using e.g. sentence-transformers:
The idea is that embeddings of descriptive labels capture richer semantic meaning.
5. Cosine Similarity for Classification
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
The similarity between each document and each label is computed using cosine similarity.
Cosine similarity represents the angle between vectors, where:
Smaller angle → higher similarity
Illustrated in Figures 4-15 and 4-16.
The predicted label is the label with the highest similarity to the document embedding.
Example code from the book:
6. Performance of Embedding-Based Zero-Shot Classification
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf
Despite using no labeled training data, the method achieves:
Accuracy: ~78%
F1 Score: ~0.78
This performance highlights the power and flexibility of embeddings as a general-purpose representation for text tasks.
TIP from the book:
Label descriptions can be improved for better classification.
Example improvement:
“A very negative movie review”
“A very positive movie review”
More specific label descriptions → embeddings carry more task-relevant information.
7. Key Takeaways from Chapter 4
A. LLMs enable effective classification even without training
Zero-shot classification shows strong performance using embeddings alone.
B. Label design matters
Clear, descriptive label phrases improve embedding-based methods.
C. Embeddings are highly versatile
They can be reused for clustering, search, semantic similarity, and more.
D. Cosine similarity is central
Simple but expressive metric for comparing embedding vectors.
E. Generative vs Representation models
Chapter sets the stage for seeing how both kinds of LLMs can be applied to NLP tasks.
8. How Chapter 4 Connects to Later Content
This chapter kicks off Part II: Using Pretrained Language Models.
It introduces general techniques (embeddings, cosine similarity, zero-shot reasoning) that become building blocks for:
Topic modeling
Semantic search
RAG systems
Clustering
Fine-tuning (discussed in Part III)
Last updated