Transformers

Here are detailed notes on the Transformer architecture, its types, and its pros and cons.

What is a Transformer?

The Transformer is a deep learning architecture that has become the de facto standard for most tasks in Natural Language Processing (NLP) and, increasingly, in computer vision and other fields.

It was introduced in the 2017 paper "Attention Is All You Need."

Its main innovation was to completely get rid of Recurrent Neural Networks (RNNs) and instead rely entirely on self-attention mechanisms to process sequential data.

Why Was It a Revolution? The Problem it Solved

Before Transformers, models like RNNs (and their variants, LSTMs and GRUs) were the top choice for sequence data.

RNNs work sequentially: They read one word at a time, update a "hidden state" (a memory), and then move to the next word.
This created two huge problems:
1. It's Slow: This sequential nature is inherently not parallelizable. You can't process the 10th word until you've finished with the 9th. This was a massive bottleneck for training on modern GPUs.
2. It "Forgets": The "hidden state" was a bottleneck. By the time an RNN read a 100-word sentence, it had often "forgotten" the meaning of the first few words. It struggled to capture long-range dependencies.

The Transformer solved both problems.

The Core Architecture: 3 Key Ideas

The Transformer is an Encoder-Decoder model. The Encoder's job is to "read" and "understand" the input sentence. The Decoder's job is to "generate" the output sentence, one word at a time.

Its power comes from three key components:

Self-Attention (The Engine):
Instead of a sequential hidden state, self-attention allows every word in a sequence to look at and compare itself to every other word in that same sequence, all at once. This builds a rich, contextual understanding. For example, in "The animal didn't cross the street because it was too tired," self-attention can learn to connect the word "it" directly back to "animal," no matter how far apart they are.
Multi-Head Attention:
The model doesn't just do self-attention once. It does it multiple times in parallel (e.g., 8 or 12 "heads"). Each "head" can learn a different kind of relationship. For example, one head might learn to connect pronouns to their nouns, while another head learns to connect verbs to their subjects. It's like having a committee of experts looking at the sentence simultaneously.
Positional Encoding:
A major "problem" with removing RNNs is that the model no longer knows the order of the words. "The dog bit the man" and "The man bit the dog" would look identical to a self-attention-only model. To fix this, a "timestamp" vector called Positional Encoding is added to each word's embedding. This gives the model a unique signal for "position 1," "position 2," etc.

Types of Transformers

While the original was an Encoder-Decoder model, the architecture has been broken down into three main "families" based on which parts are used.

1. Encoder-Only (e.g., BERT, RoBERTa)

What it is: A stack of just the Encoder blocks.
How it works: It's designed to read an entire text and build a deep, rich understanding of it. It's "bidirectional," meaning it looks at the context from both the left and right of a word.
Best for: Understanding tasks (Natural Language Understanding - NLU).
Typical Use Cases:
- Text Classification: Is this movie review positive or negative?
- Named Entity Recognition (NER): Find all the "People" and "Organizations" in this text.
- Question Answering: Given a paragraph, find the span of text that answers a question.

2. Decoder-Only (e.g., GPT series, LLaMA, Claude)

What it is: A stack of just the Decoder blocks.
How it works: This is an autoregressive model. It's designed to generate text. It takes a sequence of words (a "prompt") and predicts the very next word. This new word is then added to the input, and the model predicts the next word, and so on.
Key Feature: It uses Masked Self-Attention, which means that when predicting the word at position 10, it can only look at words in positions 1-9. It can't "see into the future."
Best for: Generative tasks (Natural Language Generation - NLG).
Typical Use Cases:
- Chatbots
- Text Completion
- Story Writing
- Code Generation

3. Encoder-Decoder (e.g., T5, BART, the original)

What it is: The full, original architecture with both stacks.
How it works: It's designed specifically for sequence-to-sequence (Seq2Seq) tasks, where you need to transform an input sequence into a new output sequence.
Best for: Transformation tasks.
Typical Use Cases:
- Translation: English (input) French (output).
- Summarization: A long article (input) a short summary (output).
- Text Style Transfer: A formal sentence (input) an informal sentence (output).

Pros and Cons

** Pros**

Parallelization: This is its biggest advantage. Unlike RNNs, all calculations within a layer can be done simultaneously, making it dramatically faster to train on modern GPUs.
Captures Long-Range Dependencies: Self-attention can (theoretically) connect the 1st word and the 1000th word in a sequence with equal ease. This solves the "forgetting" problem of RNNs.
State-of-the-Art (SOTA) Performance: For almost every NLP benchmark, Transformers (or models based on them) hold the top score.
Scalability (Pre-training): The architecture scales incredibly well. You can train a massive model (e.g., GPT-4) on massive amounts of text (the internet) and then "fine-tune" it on a small, specific task. This is the transfer learning paradigm that dominates AI today.

** Cons**

Complexity: This is its biggest weakness. "Self-attention" means every word compares itself to every other word. If your sequence length is , the computation and memory required are .
- This means doubling your sequence length (e.g., 2k 4k tokens) doesn't double the cost—it quadruples it.
- This makes it extremely difficult and expensive to use on very long sequences (like entire books or high-res images).
Extremely Data-Hungry: To learn meaningful relationships from scratch, Transformers need to be pre-trained on massive datasets (e.g., a large portion of the internet).
Massive Model Size: State-of-the-art models have billions (or even trillions) of parameters. This makes them very expensive to train and even to just run (inference).
Less Interpretable: While "attention maps" (visualizing what the model "looks at") are helpful, the complex interplay of dozens of layers and heads makes them very difficult "black boxes" to understand.
Positional Encoding is a "Hack": The positional encoding system is an add-on, not an intrinsic part of the model. It feels less "elegant" than the natural sequential processing of an RNN.

PreviousAttention NextAutoencoders

Last updated 8 days ago

hashtagWhat is a Transformer?

hashtagWhy Was It a Revolution? The Problem it Solved

hashtagThe Core Architecture: 3 Key Ideas

hashtagTypes of Transformers

hashtagPros and Cons

What is a Transformer?

Why Was It a Revolution? The Problem it Solved

The Core Architecture: 3 Key Ideas

Types of Transformers

Pros and Cons