An Introduction to Large Language Models

Here are detailed study notes for Chapter 1: “An Introduction to Large Language Models” based entirely on the content provided from your uploaded book. Citations are included using the required format.

Study Notes: Chapter 1 — An Introduction to Large Language Models

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

1. Introduction & Significance of LLMs

LLMs began transforming the world around 2012, with deep neural networks rapidly advancing.
Major breakthrough: GPT-2, capable of writing human-like articles.
ChatGPT (2022) massively accelerated public adoption of AI, reaching:
- 1M users in 5 days
- 100M users in 2 months
These systems reshaped tasks such as translation, summarization, programming assistance, and creative writing.

2. What Is Language AI?

A subfield of AI focused on systems capable of understanding, processing, and generating human language.
Often used interchangeably with NLP, though “Language AI” is broader.
Not all language-related systems are “intelligent”—e.g., early game NPC logic.

Definition (John McCarthy, 2007):

AI is the science of making intelligent machines and programs, not necessarily tied to biological intelligence.

3. A Brief History of Language AI

3.1 Challenges

Text is unstructured, unlike numerical data.
Goal over time: make text machine-understandable by creating structured representations.

3.2 Bag-of-Words (BoW)

One of the earliest methods (1950s; popular in 2000s).
Steps:
1. Tokenize text (split by whitespace).
2. Build a vocabulary of unique words.
3. Represent sentences using word counts → vectors.
Useful but limited: ignores meaning and word order.

4. Dense Vector Embeddings

4.1 word2vec (2013)

Learns embeddings (dense vectors) capturing semantic meaning.
Trains a neural network to predict if two words are neighbors in text.
Words with similar meanings get closer in embedding space.
Embedding dimensions loosely correspond to latent properties (figure analogy), though not literally.

4.2 Strengths of Embeddings

Enable semantic similarity between words.
Foundation for downstream tasks: clustering, search, classification, etc.

5. Recurrent Neural Networks (RNNs) and Attention

Before Transformers:

5.1 RNN Sequence-to-Sequence Models

Used for tasks like translation.
Had encoders (represent sentence) and decoders (generate output).
Autoregressive: processes tokens one by one.
Problem: long sequences → context compression into a single vector.

5.2 Attention Mechanism (2014)

Helps models “focus” on relevant parts of the input.
Allows decoder to look at each input token’s hidden state.
Greatly improves translation and contextual understanding.

6. The Transformer Revolution

6.1 “Attention is All You Need” (2017)

Removed recurrence; relied solely on self-attention.
Enabled parallel processing, faster training, and better performance.
Architecture:
- Encoder blocks: self-attention + feedforward layers
- Decoder blocks: masked self-attention + cross-attention + feedforward
Key idea: self-attention lets the model look forward and backward in a sequence.

7. Representation Models (Encoder-Only)

BERT (2018)

Uses only the encoder part of the Transformer.
Trained with Masked Language Modeling (MLM).
Generates rich contextual embeddings.
Great for:
- Classification
- Clustering
- Feature extraction
Uses a special [CLS] token to represent whole input sequences.

8. Generative Models (Decoder-Only)

GPT-1 → GPT-2 → GPT-3

Decoder-only Transformers.
Trained for next-token prediction.
Sizes:
- GPT-1: 117M params
- GPT-2: 1.5B
- GPT-3: 175B
Large generative Transformer models became known as LLMs.

Why they matter

Act as completion models: continue text, answer questions, follow instructions (after fine-tuning).
Introduced the idea of context window: maximum tokens the model can consider.

9. The Year of Generative AI (2023)

Explosion of:
- OpenAI GPT-3.5 / GPT-4
- Open-source models (Llama, Mistral, Phi, Command R, etc.)
New architectures emerging:
- Mamba
- RWKV
Alternative architectures aimed to achieve Transformer-level capability with better efficiency.

10. Redefining the Term “LLM”

“Large” is a moving target.
Authors argue LLMs include:
- Both generative (decoder-only) and representation (encoder-only) models.
- Models with <1B parameters if impactful.
Focus should be on capability, not size.

11. How LLMs Are Trained

Two-Stage Process

Pretraining (Language Modeling)
- Trained on massive corpora to learn general language patterns.
- Output: base or foundation model.
Fine-Tuning
- Adaptation to tasks:
  - Classification
  - Following instructions
  - Domain specialization

Pretraining is extremely expensive (e.g., Llama-2 cost $5m+ in compute).

12. Applications of LLMs

Examples discussed:

Sentiment classification
Topic modeling / clustering
Semantic search
Retrieval-augmented generation (RAG)
Tool-using chatbots
Multimodal reasoning (e.g., write recipe from a fridge photo)

LLMs combine multiple components (retrievers, prompt engineering, fine-tunin

13. Responsible LLM Use

Risks highlighted:

Bias & fairness
Transparency and accountability
Harmful / incorrect content
Intellectual property concerns
Emerging regulations (e.g., EU AI Act)

14. Compute Constraints & “GPU-Poor” Users

LLMs require significant VRAM.
Many people don’t have expensive GPUs; this book emphasizes:
- Lightweight models
- Google Colab usage
- Efficient loading/inference techniques

15. Interacting with LLMs

Closed-source models

Accessed through APIs (e.g., OpenAI, Anthropic).
Pros: Easy, powerful, no GPU required.
Cons: Costly, cannot be self-hosted, limited transparency.

Open-source models

Fully downloadable and customizable (Llama, Phi, Mistral…).
Pros: Control, privacy, fine-tuning.
Cons: Require hardware + setup effort.

16. First Code Example: Text Generation

The chapter ends by demonstrating:

Loading Phi-3-mini-4k-instruct using Hugging Face Transformers.
Running a simple text-generation pipeline to produce a joke about chickens.

Summary of Key Takeaways

LLMs evolved from simple word counting to deep contextual models powered by attention.
Transformers revolutionized Language AI by enabling parallel processing and long-range understanding.
Representation models (like BERT) and generative models (like GPT) serve different purposes but both fall under “LLM.”
Open-source and proprietary LLMs each have strengths depending on user needs.
Ethical, computational, and regulatory factors play major roles today.
The chapter provides the conceptual foundation for all following chapters.

If you’d like, I can also prepare:

Flashcards for Chapter 1

A mind map of the concepts

Practice questions + answers

A condensed quick-revision sheet

Below are detailed study notes for Chapter 2: Tokens and Embeddings based entirely on the full chapter text retrieved from your uploaded PDF. All statements include citations where needed.

Chapter 2 — Tokens and Embeddings (Detailed Study Notes)

From: Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (2024)

1. Introduction: Why Tokens and Embeddings Matter

Chapter 2 explains two of the most fundamental concepts behind LLMs:

Tokens → the atomic text units LLMs read and generate
Embeddings → the numeric vectors representing tokens or entire sentences

LLMs cannot operate directly on characters or raw text—they convert text into structured tokens and then embeddings.

2. What Tokens Are

LLMs operate in small chunks of text called tokens.

Examples of token types shown in the text:

Start-of-text token (<s>)
End-of-text token
Padding token
Unknown token
Special tokens like CLS, MASK

Tokens may be:

Whole words
Subwords
Characters
Even punctuation or whitespace

The choice depends on the tokenizer’s training method.

3. Why Tokenization Is Necessary

Generative LLMs do not output whole sentences at once. Instead, they predict one token at a time, autoregressively.

At inference time:

Input text → tokenizer → tokens
Tokens fed to the model
Model predicts the next token repeatedly

Tokenisation is done before the model sees any text.

4. How Tokenizers Prepare Inputs

Before being processed by the model, text is broken down by the tokeniser into small pieces that map to IDs.

Example from GPT-4’s tokenizer:

Each part of a sentence—even within a single word—is split into subword units.

5. Running a Tokenizer in Code

The chapter includes code demonstrating:

Loading the Phi-3 Mini model & tokenizer
Converting a prompt into tokens
Passing those tokens into the model
Generating 20 new tokens

This demonstrates token → embedding → generation pipeline.

6. Factors Influencing Tokenizer Design

The chapter explains how tokenizer quality greatly affects model quality. The key considerations include:

6.1 Vocabulary Size

More tokens → more expressiveness, but model loses capacity for reasoning

Fewer tokens → compact but may fragment common words

6.2 Handling Capitalization

Should uppercase versions be separate tokens?
Capitalization carries meaning (e.g., names), but storing all variants costs vocabulary.

6.3 Domain-Specific Tokenization

Tokenizer performance depends heavily on the dataset used to train it.

Example: code tokenization

A general text tokenizer might split indentation spaces in Python code into many tokens—bad for code models.

A code-aware tokenizer groups indentation intelligently, improving downstream pe

7. Token Embeddings

Once text is tokenized, the model must convert tokens → numbers.

7.1 What Embeddings Are

Embeddings are dense numeric vectors representing tokens.

These vectors encode:

Meaning relationships
Context patterns
Semantic similarity

If training data contains lots of:

English → model becomes good at English
Wikipedia → model gains factual capabilities

8. Why Embeddings Matter

Embeddings allow LLMs to:

Capture complex language patterns
Measure similarity (distance in vector space)
Transform human text into computable numerical form

The chapter notes that patterns in embeddings reveal:

Coherence of generated text
Code-writing ability
Knowledge about facts

9. Word2Vec: The Origin of Modern Embeddings

word2vec (2013) was an early breakthrough system.

9.1 How word2vec Works

Assigns each word a vector (random at start)
Trains a small neural network
Predicts whether word pairs appear next to each other

If two words often appear in similar contexts → embeddings become close.

9.2 How Training Works

Steps:

Generate word pairs from text
Feed them into neural network
Update embeddings to reflect neighbor relationships

9.3 What Meaning in Embeddings Means

The chapter uses a conceptual example:

“baby” has high scores on “human”, “newborn”
“apple” has low scores on those dimensions

These dimensions are not literal human concepts but abstract latent patterns.

10. Semantic Similarity Using Embeddings

Embeddings allow measuring closeness:

Similar words → nearby in vector space
Different words → far apart
Example visualizations show clusters of similar meaning words.

11. Types of Embeddings

The chapter distinguishes:

Word embeddings (single tokens)
Sentence/text embeddings (whole sentences)

Sentence embeddings enable:

Semantic search
Topic modeling
Document classification

These will be used in Part II of the book.

12. Modern Uses of Token & Sentence Embeddings

The chapter explains how embeddings are foundational in:

Recommender systems
Search engines
Document retrieval
Clustering topics

Thus embeddings power many applications beyond LLM chatbots.

13. Tokenization → Embeddings → Computation Pipeline

Overall pipeline:

Raw text
Tokenizer converts text → tokens
Tokens converted → embeddings
LLM processes embeddings
Outputs new tokens

This cycle allows all generative and analytical tasks.

14. Domain-Specific Embeddings and RAG Context

A note in the chapter explains:

Good embeddings capture factual patterns
But relying solely on generation is unreliable
→ led to Retrieval-Augmented Generation (RAG)

RAG combines:

Search
Embeddings
LLM reasoning power

Complete Summary of Chapter 2

Chapter 2 teaches that:

Tokens are the basic units LLMs read and write.
Tokenizers heavily influence model performance and domain adaptation.
Embeddings numerically encode the meaning of tokens and sentences.
word2vec pioneered semantic vector representation.
Embeddings enable semantic search, clustering, recommendations, and RAG.
Understanding tokens + embeddings is foundational for understanding the Transformer architecture in Chapter 3.

If you want, I can also create:

Flashcards for Chapter 2

A visual mind map of tokens and embeddings

Simplified notes for quick revision

Practice questions with answers

Chapter 3 — Looking Inside Large Language Models (Detailed Notes)

Chapter 3 is an expanded, modernized version of Jay Alammar’s famous Illustrated Transformer, explaining how Transformer-based LLMs actually work inside. This chapter bridges Chapter 2’s token/embedding concepts with the Transformer architecture responsible for modern generative LLM behavior.

1. Purpose of the Chapter

Chapter 3 explains:

How generative LLMs process input tokens
How they select the next token
What happens inside the Transformer block
How modern improvements (RoPE, efficient attention) make state-of-the-art models possible
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

The chapter uses both conceptual intuition and code examples with the Phi-3 model to demonstrate the forward pass and token generation.

2. Loading a Transformer LLM (Code Setup)

The chapter begins with code using Hugging Face Transformers to load:

microsoft/Phi-3-mini-4k-instruct
Create a text-generation pipeline

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

This sets the stage for understanding the model’s internal computation when generating text.

3. High-Level Overview of Transformer Models

3.1 Inputs & Outputs of a Transformer LLM

LLMs are described as text-in → text-out systems.

Once pre-trained on huge datasets, they can generate high-quality output like emails, explanations, or stories.

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

A key figure shows:

User input
Tokenization
Model inference
Generated output (token by token)

4. The Forward Pass: How the Model Processes Tokens

The forward pass determines the next token.

The major components of the forward pass

include:

Token embeddings
Positional embeddings
Transformer blocks
- Self-attention
- Feedforward layers
Language modeling (LM) head
Probability distribution over next tokens
Decoding strategy: Choose a token
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

5. Choosing the Next Token (Sampling / Decoding)

After generating probabilities for all tokens, a decoding strategy must choose the next output token.

The chapter highlights:

The model doesn’t output full sentences, only one token at a time
The chosen token is appended back into the prompt and fed again
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

Common strategies:

Greedy decoding (choose highest probability)
Sampling
Top-k
Top-p (nucleus)

This choice strongly affects creativity and coherence.

6. Parallel Token Processing & Context Window

Transformers revolutionized NLP by letting models process input tokens in parallel, unlike RNNs.

Key concepts:

Each token flows through its own computation stream inside the model
Number of tokens that can be processed = context size
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

Models with larger context windows can handle longer documents and conversations.

7. Speeding Up Generation: Key/Value Caching

During generation, the model repeatedly performs the forward pass for each newly generated token.

To avoid recomputing attention over the entire history each time:

Transformers store:

Key matrices
Value matrices

for previously processed tokens.

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

This greatly speeds up inference.

8. Inside the Transformer Block

A Transformer block contains the core computation of LLMs.

Each block has two main components:

8.1 Multi-Head Self-Attention

Role:

Determines which previous tokens each token should focus on
Computes relevance scores using Queries, Keys, and Values
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

Modern LLMs use variants of self-attention to scale to long sequences.

8.2 Feedforward Layer

Role:

Stores and activates knowledge learned during pretraining
Performs nonlinear transformation on each token’s representation
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

This layer is crucial for storing factual and reasoning capabilities.

9. Recent Improvements to Transformer Architecture

The chapter outlines emerging techniques that have pushed LLMs forward.

9.1 More Efficient Attention

Transformers originally had quadratic attention complexity.

New approaches:

Linear-time attention
Sparse attention
Sliding windows

These aim to reduce cost and handle longer sequences.

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

9.2 Rotary Position Embeddings (RoPE)

RoPE has become the default positional encoding method in modern models like Llama, Phi, and Mistral.

Key features:

Encodes both absolute and relative position information
Based on rotating token embeddings in vector space
Applied inside the attention step, not added at input
OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

Figures 3-32 and 3-33 show:

RoPE mixing with queries & keys
Rotation applied right before relevance scoring

9.3 Other Experiments & Improvements

The chapter references multiple research directions:

Vision Transformers
Transformers in robotics
Transformers for time series
State-space models like Mamba

These highlight how Transformer ideas are spreading beyond text.

10. Full Summary of Chapter 3

A Transformer LLM generates one token at a time, looping with a forward pass.
Three major components: tokenizer, Transformer blocks, LM head.
Tokenization + embeddings are the entry point of the forward pass.
The LM head outputs probabilities for the next token, and decoding chooses which token to generate.
Parallel token processing gives Transformers scalability.
KV caching prevents redundant computation and speeds up inference.
Transformer blocks consist of:
- Self-attention
- Feedforward neural networks
Recent improvements include efficient attention, RoPE positional embeddings, and numerous architecture experiments.

Chapter 4 — Text Classification (Detailed Notes)

1. What Is Text Classification?

Text classification is a fundamental NLP task where a model assigns a label or class to a piece of text.
Figure 4-1 from the book illustrates this basic idea: a model receives text and outputs a class.
Examples include:
- Sentiment analysis (positive/negative)
- Intent detection
- Entity extraction
- Language detection
The authors emphasize that both generative (decoder-only) and representation (encoder-only) models have significantly improved text-classification capabilities.

2. Why Language Models Matter for Classification

LLMs offer two major advantages:
1. Understanding context rather than relying on bag-of-words.
2. Zero-shot and few-shot classification—performing classification without explicit training data.
Classification can be achieved through:
- Generative prompting (LLMs classify by generating the label)
- Embeddings-based similarity methods (encoder/embedding models)

3. Zero-Shot Classification with Embeddings

One of the core lessons of Chapter 4 is how embeddings can be used to classify text without training a classifier.

Overview of the method

Encode the documents using an embedding model.
Encode the labels as text descriptions using the same embedding model.
Use cosine similarity between document embeddings and label embeddings.
Assign the label with the highest similarity.

4. Embedding the Labels

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

Because labels like “positive” and “negative” are too abstract, the authors describe them using short phrases, such as:
- “A negative review”
- “A positive review”
These label descriptions are turned into embeddings using e.g. sentence-transformers:

label_embeddings = model.encode(["A negative review", "A positive review"])

The idea is that embeddings of descriptive labels capture richer semantic meaning.

5. Cosine Similarity for Classification

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

The similarity between each document and each label is computed using cosine similarity.
Cosine similarity represents the angle between vectors, where:
- Smaller angle → higher similarity
  Illustrated in Figures 4-15 and 4-16.
The predicted label is the label with the highest similarity to the document embedding.

Example code from the book:

sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

6. Performance of Embedding-Based Zero-Shot Classification

OceanofPDF.com_Hands-On_Large_Language_Models-_Jay_Alammar.pdf

Despite using no labeled training data, the method achieves:
- Accuracy: ~78%
- F1 Score: ~0.78
This performance highlights the power and flexibility of embeddings as a general-purpose representation for text tasks.

TIP from the book:

Label descriptions can be improved for better classification.
Example improvement:
- “A very negative movie review”
- “A very positive movie review”
More specific label descriptions → embeddings carry more task-relevant information.

7. Key Takeaways from Chapter 4

A. LLMs enable effective classification even without training

Zero-shot classification shows strong performance using embeddings alone.

B. Label design matters

Clear, descriptive label phrases improve embedding-based methods.

C. Embeddings are highly versatile

They can be reused for clustering, search, semantic similarity, and more.

D. Cosine similarity is central

Simple but expressive metric for comparing embedding vectors.

E. Generative vs Representation models

Chapter sets the stage for seeing how both kinds of LLMs can be applied to NLP tasks.

8. How Chapter 4 Connects to Later Content

This chapter kicks off Part II: Using Pretrained Language Models.
It introduces general techniques (embeddings, cosine similarity, zero-shot reasoning) that become building blocks for:
- Topic modeling
- Semantic search
- RAG systems
- Clustering
- Fine-tuning (discussed in Part III)

PreviousHandsOn Large Language Models - Jay_Alammar NextAgentic design pattern

Last updated 21 days ago

hashtagStudy Notes: Chapter 1 — An Introduction to Large Language Models

hashtag1. Introduction & Significance of LLMs

hashtag2. What Is Language AI?

hashtag3. A Brief History of Language AI

hashtag3.1 Challenges

hashtag3.2 Bag-of-Words (BoW)

hashtag4. Dense Vector Embeddings

hashtag4.1 word2vec (2013)

hashtag4.2 Strengths of Embeddings

hashtag5. Recurrent Neural Networks (RNNs) and Attention

hashtag5.1 RNN Sequence-to-Sequence Models

hashtag5.2 Attention Mechanism (2014)

hashtag6. The Transformer Revolution

hashtag6.1 “Attention is All You Need” (2017)

hashtag7. Representation Models (Encoder-Only)

hashtag8. Generative Models (Decoder-Only)

hashtagGPT-1 → GPT-2 → GPT-3

hashtagWhy they matter

hashtag9. The Year of Generative AI (2023)

hashtag10. Redefining the Term “LLM”

hashtag11. How LLMs Are Trained

hashtagTwo-Stage Process

hashtag12. Applications of LLMs

hashtag13. Responsible LLM Use

hashtag14. Compute Constraints & “GPU-Poor” Users

hashtag15. Interacting with LLMs

hashtagOpen-source models

hashtag16. First Code Example: Text Generation

hashtagSummary of Key Takeaways

hashtagChapter 2 — Tokens and Embeddings (Detailed Study Notes)

hashtag1. Introduction: Why Tokens and Embeddings Matter

hashtag2. What Tokens Are

hashtag3. Why Tokenization Is Necessary

hashtag4. How Tokenizers Prepare Inputs

hashtag5. Running a Tokenizer in Code

hashtag6. Factors Influencing Tokenizer Design

hashtag6.1 Vocabulary Size

hashtag6.2 Handling Capitalization

hashtag6.3 Domain-Specific Tokenization

hashtag7. Token Embeddings

hashtag7.1 What Embeddings Are

hashtag8. Why Embeddings Matter

hashtag9. Word2Vec: The Origin of Modern Embeddings

hashtag9.1 How word2vec Works

hashtag9.2 How Training Works

hashtag9.3 What Meaning in Embeddings Means

hashtag10. Semantic Similarity Using Embeddings

hashtag11. Types of Embeddings

hashtag12. Modern Uses of Token & Sentence Embeddings

hashtag13. Tokenization → Embeddings → Computation Pipeline

hashtag14. Domain-Specific Embeddings and RAG Context

hashtagComplete Summary of Chapter 2

hashtagIf you want, I can also create:

hashtagChapter 3 — Looking Inside Large Language Models (Detailed Notes)

hashtag1. Purpose of the Chapter

hashtag2. Loading a Transformer LLM (Code Setup)

hashtag3. High-Level Overview of Transformer Models

hashtag3.1 Inputs & Outputs of a Transformer LLM

hashtag4. The Forward Pass: How the Model Processes Tokens

hashtagThe major components of the forward pass

hashtaginclude:

hashtag5. Choosing the Next Token (Sampling / Decoding)

hashtag6. Parallel Token Processing & Context Window

hashtagKey concepts:

hashtag7. Speeding Up Generation: Key/Value Caching

hashtagTransformers store:

hashtag8. Inside the Transformer Block

hashtagEach block has two main components:

hashtag8.1 Multi-Head Self-Attention

hashtag8.2 Feedforward Layer

hashtag9. Recent Improvements to Transformer Architecture

hashtag9.1 More Efficient Attention

hashtag9.2 Rotary Position Embeddings (RoPE)

hashtagKey features:

hashtag9.3 Other Experiments & Improvements

hashtag10. Full Summary of Chapter 3

hashtagChapter 4 — Text Classification (Detailed Notes)

hashtag1. What Is Text Classification?

hashtag2. Why Language Models Matter for Classification

hashtag3. Zero-Shot Classification with Embeddings

Study Notes: Chapter 1 — An Introduction to Large Language Models

1. Introduction & Significance of LLMs

2. What Is Language AI?

3. A Brief History of Language AI

3.1 Challenges

3.2 Bag-of-Words (BoW)

4. Dense Vector Embeddings

4.1 word2vec (2013)

4.2 Strengths of Embeddings

5. Recurrent Neural Networks (RNNs) and Attention

5.1 RNN Sequence-to-Sequence Models

5.2 Attention Mechanism (2014)

6. The Transformer Revolution

6.1 “Attention is All You Need” (2017)

7. Representation Models (Encoder-Only)

8. Generative Models (Decoder-Only)

GPT-1 → GPT-2 → GPT-3

Why they matter

9. The Year of Generative AI (2023)

10. Redefining the Term “LLM”

11. How LLMs Are Trained

Two-Stage Process

12. Applications of LLMs

13. Responsible LLM Use

14. Compute Constraints & “GPU-Poor” Users

15. Interacting with LLMs

Open-source models

16. First Code Example: Text Generation

Summary of Key Takeaways

Chapter 2 — Tokens and Embeddings (Detailed Study Notes)

1. Introduction: Why Tokens and Embeddings Matter

2. What Tokens Are

3. Why Tokenization Is Necessary

4. How Tokenizers Prepare Inputs

5. Running a Tokenizer in Code

6. Factors Influencing Tokenizer Design

6.1 Vocabulary Size

6.2 Handling Capitalization

6.3 Domain-Specific Tokenization

7. Token Embeddings

7.1 What Embeddings Are

8. Why Embeddings Matter

9. Word2Vec: The Origin of Modern Embeddings

9.1 How word2vec Works

9.2 How Training Works

9.3 What Meaning in Embeddings Means

10. Semantic Similarity Using Embeddings

11. Types of Embeddings

12. Modern Uses of Token & Sentence Embeddings

13. Tokenization → Embeddings → Computation Pipeline

14. Domain-Specific Embeddings and RAG Context

Complete Summary of Chapter 2

If you want, I can also create:

Chapter 3 — Looking Inside Large Language Models (Detailed Notes)

1. Purpose of the Chapter

2. Loading a Transformer LLM (Code Setup)

3. High-Level Overview of Transformer Models

3.1 Inputs & Outputs of a Transformer LLM

4. The Forward Pass: How the Model Processes Tokens

The major components of the forward pass

include:

5. Choosing the Next Token (Sampling / Decoding)

6. Parallel Token Processing & Context Window

Key concepts:

7. Speeding Up Generation: Key/Value Caching

Transformers store:

8. Inside the Transformer Block

Each block has two main components:

8.1 Multi-Head Self-Attention

8.2 Feedforward Layer

9. Recent Improvements to Transformer Architecture

9.1 More Efficient Attention

9.2 Rotary Position Embeddings (RoPE)

Key features:

9.3 Other Experiments & Improvements

10. Full Summary of Chapter 3

Chapter 4 — Text Classification (Detailed Notes)

1. What Is Text Classification?

2. Why Language Models Matter for Classification

3. Zero-Shot Classification with Embeddings