Machine learning design patterns
[!TIP] High-Frequency Patterns for System Design Interviews
Pattern
Problem
Solution
Hashed Feature
High cardinality categories.
Use deterministic hashing to map into fixed buckets.
Embeddings
Sparse high-dim data.
Learn dense, continuous vector representations.
Feature Cross
Non-linear interactions.
Create a new feature by concatenating/crossing two features.
Multimodal Input
Mixed data (text+image).
Separate specialized encoders with a late-fusion head.
Reframing
Hard classification.
Change to regression (probability) or ranking.
Rebalancing
Imbalanced datasets.
Weighted loss or resampling (SMOTE / Undersampling).
Ensemble
Single model weak signals.
Bagging (Variance), Boosting (Bias), or Stacking.
Cascade
Hierarchical decisions.
Chain models sequentially (e.g., Object Detection -> OCR).
Transfer Learning
Low data / High compute.
Leverage pretrained models and fine-tune on target data.
Chapter Summaries
Chapter 1: The Need for Machine Learning Design Patterns
1. Introduction
Engineering disciplines use design patterns to capture best practices and recurring solutions.
These patterns codify expert knowledge into reusable advice for practitioners.
The book is a catalog of ML design patterns—common strategies to handle challenges in data preparation, model building, and MLOps.
2. What Are Design Patterns?
Originated in architecture from Christopher Alexander’s book A Pattern Language (1977).
Later applied to software engineering in Design Patterns: Elements of Reusable Object-Oriented Software (Gamma et al., 1994).
A design pattern describes:
A recurring problem.
The core solution.
Guidelines for adapting it to local needs.
In ML, design patterns help standardize solutions for challenges like feature transformation, reproducibility, serving models, etc.
Example patterns:
Transform Pattern (Ch. 6): Separates inputs, features, and transformations for easier production deployment.
Keyed Predictions (Ch. 5): Distributes batch predictions efficiently across large datasets.
3. How to Use This Book
The book serves as a catalog rather than a linear tutorial.
You can:
Skim chapters.
Refer back to relevant patterns when facing a problem.
Each pattern includes:
Problem statement.
Canonical solution.
Why it works.
Trade-offs and alternatives.
Code examples are in SQL, scikit-learn, and TensorFlow/Keras.
Goal: Provide a common vocabulary for ML practitioners to discuss and implement these concepts.
4. Machine Learning Terminology
Models and Frameworks
ML replaces hand-coded rules with models that learn patterns from data.
Example: Predicting moving costs using a learned model instead of nested if statements.
Common model types:
Linear regression
Decision trees
Neural networks
Clustering models
Learning types:
Supervised Learning: Labeled data (classification or regression).
Unsupervised Learning: No labels (clustering, dimensionality reduction).
Frameworks:
TensorFlow (mainly via Keras API)
scikit-learn
PyTorch
XGBoost
BigQuery ML (SQL-based ML).
Data and Feature Engineering
Datasets:
Training data: Used to learn patterns.
Validation data: Used to tune hyperparameters and avoid overfitting.
Test data: Used for final evaluation; must be unseen by the model.
Structured data: Numeric/categorical (tabular).
Unstructured data: Text, image, audio, video.
Feature engineering: Converting raw inputs to model-understandable numerical features.
Example: Input = timestamp → Feature = day of week.
Terminology:
Input: Raw column.
Feature: Transformed column.
Instance: Single record used for prediction.
Label: The ground-truth value being predicted.
Data Validation: Checks data consistency, balance, and schema correctness to avoid bias or drift.
Machine Learning Process
Training: Model learns from data.
Evaluation: Measure performance on validation/test sets.
Serving: Deploy model for predictions.
Online prediction: Real-time, low-latency.
Batch prediction: Large-scale, offline predictions.
Inference: Process of making predictions (also called “prediction” in ML).
Pipelines: Automate multi-step processes for data preprocessing, training, evaluation, and deployment.
Streaming: Real-time data processing for continuous model updates.
Data and Model Tooling (Google Cloud Examples)
BigQuery: For scalable SQL-based data analysis.
BigQuery ML: For training and evaluating ML models directly in SQL.
AI Platform Training: Infrastructure for distributed training.
AI Platform Prediction: For deploying and serving ML models.
Explainable AI: For interpreting model predictions.
Roles in Machine Learning
Data Scientist: Focuses on data analysis, feature engineering, and model building.
Data Engineer: Builds data ingestion and processing pipelines.
ML Engineer: Manages model training, versioning, and deployment.
Research Scientist: Develops new ML algorithms (academic/research focus).
Data Analyst: Derives business insights from data.
Developer: Integrates ML models into end-user applications (APIs, web/mobile).
5. Common Challenges in Machine Learning
A. Data Quality
“Garbage in, garbage out.”
Four key aspects:
Accuracy: Are the features and labels correct?
Completeness: Does the dataset represent all relevant cases and diversity?
Consistency: Are collection and labeling standards uniform?
Timeliness: Is the data fresh and synchronized with real-world events?
Example: Inconsistent labeling in sentiment analysis → biased model.
B. Reproducibility
ML has randomness due to weight initialization and stochastic training.
Ensuring reproducibility requires:
Setting random seeds (tf.random.set_seed()).
Fixing data splits, hyperparameters, and batch sizes.
Version controlling dependencies and using containers.
Maintaining consistent training environments.
C. Data Drift
Data distribution changes over time.
Causes:
Evolving environment (e.g., language, behavior).
Improved data collection methods.
Example: A news classification model trained on 20th-century articles may fail on modern terms like “smartphone”.
Solutions:
Continuous retraining.
Data versioning.
Monitoring for drift.
Bridged Schema Pattern (Ch. 6) handles evolving feature availability.
D. Scale
Scaling challenges occur in:
Data processing: Handling millions of rows efficiently.
Training: Large models and datasets require GPUs/TPUs.
Serving: Supporting millions of prediction requests.
Design patterns for scale are discussed in later chapters (e.g., Distribution Strategy, Stateless Serving Function).
E. Multiple Objectives
Different stakeholders optimize for different goals:
Data scientist → minimize loss.
Product manager → reduce false negatives.
Executives → increase ROI.
Trade-offs between precision and recall or accuracy and fairness must be balanced.
Communication and shared metrics across teams are crucial.
6. Summary
Design patterns are reusable best practices to standardize ML problem-solving.
Major ML challenges:
Data quality
Reproducibility
Data drift
Scale
Multiple objectives
Patterns apply across the ML life cycle — from data collection and training to deployment and maintenance.
Would you like me to create chapter-wise flashcards or Q&A-style revision questions from this chapter next?
Here are detailed notes for Chapter 2 — “Data Representation Design Patterns” from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*
Chapter 2: Data Representation Design Patterns
1. Introduction
At the heart of every machine learning model lies a mathematical function designed to operate on specific data types (e.g., booleans, numbers).
However, real-world data—text, images, categories—often doesn’t directly fit this mathematical form.
Hence, data needs to be represented or transformed into a form that the model can process.
This chapter explains various data representation patterns, i.e., methods for converting raw inputs into mathematical features suitable for ML models.
2. Key Concepts
a. Input vs. Feature
Input: The raw, real-world data fed into the model.
Example: Baby’s weight = 2.7 kg.
Feature: The transformed, numeric or boolean value used by the model.
Example: is_low_weight = (baby_weight < 3kg) → Boolean feature.
The process of converting inputs into features is called feature engineering.
b. Why Data Representation Matters
Models like decision trees or linear regressions can only operate on specific data types (e.g., booleans, numerics).
Example:
Decision tree core math operates on boolean conditions (AND, OR).
To use non-boolean inputs (like hospital name), we must transform:
A good data representation makes learning:
Simpler (fewer nodes/features),
Faster (less computation),
More accurate (better generalization).
c. Learnable vs. Fixed Representations
Fixed (Engineered):
The transformation is manually designed by humans.
Example: (babyweight < 3kg) or one-hot encoding.
Learnable (Automatically Extracted):
The model learns its own representation during training.
Example: Embeddings in neural networks.
Hybrid:
Some representations are deterministic but partially learned.
Example: Hashed features (fixed mapping, no learning, but generalizable).
d. Relationship Between Representations
One-to-one: Each input → one feature (simple cases).
Many-to-one: Several inputs combined to make a new feature (e.g., feature crosses).
Many-to-many: Multi-modal data (text + image + tabular combined).
3. Overview of the Chapter’s Patterns
This chapter introduces four major data representation patterns:
1⃣ Hashed Feature
Represent high-cardinality categorical data efficiently
2⃣ Embeddings
Learn dense, continuous representations for categorical or textual data
3⃣ Feature Cross
Combine multiple features to capture interactions
4⃣ Multimodal Input
Handle and combine multiple data types (text, image, etc.)
Before discussing these, the book explains simple data representations — basic techniques for handling numeric and categorical data.
4. Simple Data Representations
A. Numerical Inputs
Most ML models (linear regression, random forests, neural nets) operate on numeric features.
Why Scaling is Important
Optimisers (like gradient descent) converge faster when all features have comparable magnitudes.
Scaling to [-1, 1] or [0, 1] helps:
Reduces curvature of the loss surface.
Stabilizes gradients.
Improves numerical precision.
Example experiment using scikit-learn showed ~9% faster training when data was scaled.
Scaling Techniques
Min-Max Scaling
x’ = \frac{2x - \text{max}_x - \text{min}_x}{\text{max}_x - \text{min}_x}
Maps values to [-1, 1].
Sensitive to outliers (max/min from training data).
Clipping with Min-Max Scaling
Clip outliers to a reasonable percentile (e.g., 1st and 99th) before scaling.
Z-Score (Standard) Normalization
x’ = \frac{x - \mu}{\sigma}
Centers data around 0 with unit variance.
Useful for models assuming Gaussian distribution.
Log Transformation
Applies to positively skewed data (e.g., income, population).
Makes data more symmetric.
Bucketization
Converts continuous numeric variables into discrete ranges (bins).
Example: Age → {0–20, 21–40, 41–60, 61+}.
Useful for tree-based models.
B. Categorical Inputs
Categorical data represents discrete, non-numeric values (e.g., gender, city, color).
Representation Methods
Label Encoding
Assigns integer IDs to each category.
Not suitable for linear models (implies ordering).
One-Hot Encoding
Creates binary columns for each category.
Example: Color = {Red, Blue, Green}
→ [1, 0, 0], [0, 1, 0], [0, 0, 1].
Works well for small cardinality but inefficient for large sets.
Multi-Hot Encoding
For multi-label features (e.g., movie genres: {Action, Comedy} → [1, 1, 0, 0]).
Ordinal Encoding
Assigns ordered numeric values when order matters (e.g., size: small=1, medium=2, large=3).
Target / Mean Encoding
Replaces a category with the average value of the target variable for that category.
Useful for high-cardinality features.
Risk: data leakage → must be applied on training folds only.
5. Data Representation Design Patterns
Pattern 1 — Hashed Feature
Problem: High-cardinality categorical variables (e.g., ZIP code, user ID) → huge one-hot vectors.
Solution: Apply a hashing function to map categories into a fixed-size vector space.
Example:
\text{feature\_index} = \text{hash}(category) \% N
where N = number of hash buckets.
Why it Works:
Reduces memory usage and training time.
Collisions are tolerable when N is large enough.
Trade-offs:
Loses interpretability (collisions cause mixed signals).
Not reversible (cannot recover original category).
Pattern 2 — Embeddings
Problem: One-hot vectors are sparse and high-dimensional.
Solution: Learn dense, low-dimensional vectors that represent similarity between categories.
Example: Word embeddings (Word2Vec, GloVe, BERT).
How it Works:
Each category is assigned a learnable vector (e.g., 128 dimensions).
During training, embeddings adjust to minimize model loss.
Similar items end up close together in vector space.
Advantages:
Compact and efficient representation.
Captures semantic relationships (e.g., “king” – “man” + “woman” ≈ “queen”).
Trade-offs:
Requires large datasets.
Less interpretable than one-hot.
Risk of bias if training data is biased.
Pattern 3 — Feature Cross
Problem: Models may fail to capture interactions between features.
Example: A model might not realize that “High Income + Young Age” implies a specific pattern.
Solution: Combine multiple features to form a new feature representing their interaction.
Example: cross_feature = city + "_" + job_type
Why it Works:
Introduces nonlinearity and context awareness in models like logistic regression.
Trade-offs:
Increases dimensionality.
May overfit if not regularized or pruned.
Pattern 4 — Multimodal Input
Problem: Real-world systems often use multiple data types (text, images, structured data).
Example: Product recommendation = image + text + numerical price.
Solution: Build models that accept and learn from multiple modalities.
Each data type passes through its specialized subnetwork:
CNN → for images
RNN/Transformer → for text
Dense layers → for tabular features
Outputs are concatenated and passed to a shared decision layer.
Trade-offs:
Complex architecture.
Requires more data and compute.
Harder to debug and interpret.
6. Conceptual Summary
Scaling
Normalize numeric range
Min-max, z-score
Encoding
Convert categories to numbers
One-hot, hashing
Learnable Representations
Automatically learn features
Embeddings
Feature Interactions
Capture complex relationships
Feature Cross
Multi-Input Fusion
Combine heterogeneous data types
Multimodal Input
7. Key Takeaways
ML models don’t understand raw data — they understand numerical representations.
The choice of representation heavily affects performance, generalization, and scalability.
Automatic representations (like embeddings) are powerful but less interpretable.
Feature crosses and multimodal designs improve accuracy for complex datasets.
Always validate and test data transformations to ensure consistency between training and production.
Would you like me to now create summary tables + 20 practice questions (theoretical and applied) from Chapter 2 next — similar to how we did for Arrays and Sorting in DSA?
Here are detailed notes for Chapter 3 — “Problem Representation Design Patterns”
from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*
Chapter 3: Problem Representation Design Patterns
1. Introduction
After understanding how data is represented (Chapter 2), the next challenge is how to represent the problem itself so that it can be effectively solved using machine learning.
In other words:
Sometimes, it’s not the data or model that’s wrong — it’s the way we’ve framed the problem.
For instance, a problem that seems like classification might be easier to solve as regression, ranking, or clustering.
This chapter introduces problem representation design patterns — techniques to reframe or reformulate ML problems for better performance, interpretability, and deployment efficiency.
2. Why Problem Representation Matters
ML performance often depends less on algorithm choice and more on how the problem is structured.
Proper representation helps:
Simplify the learning task.
Align model outputs with business goals.
Reduce data or compute requirements.
Enable reusability of existing models.
3. Design Patterns Covered
5
Reframing
Reformulate the ML problem for better alignment with available data and business goals.
6
Multilabel
Handle cases where an instance can belong to multiple categories simultaneously.
7
Ensembles
Combine multiple models to achieve better accuracy and robustness.
8
Cascade
Chain models sequentially so one’s output is another’s input (hierarchical modeling).
9
Neutral Class
Introduce a “none of the above” or “uncertain” label to improve real-world reliability.
10
Rebalancing
Handle imbalanced datasets by resampling or weighting techniques.
4. Pattern 5 — Reframing
Problem
Sometimes, the initial formulation of a problem is not ideal for ML.
Example: Instead of predicting “will a user buy or not,” predict “how likely is the user to buy” (a regression problem).
Solution
Reframe the task into an equivalent but more learnable or measurable ML problem.
Examples
Classification ↔ Regression:
Predicting click-through rate (CTR) can be treated as regression (probability) instead of binary classification.
Ranking instead of classification:
In recommendation systems, predicting order (rank) is more meaningful than binary relevance.
Generative → Discriminative (and vice versa):
Instead of generating all possibilities, classify the best one.
Why It Works
Reframing simplifies the target function.
Allows use of existing datasets or pretrained models.
Reduces overfitting or data inefficiency.
Trade-Offs
May lose interpretability.
Business stakeholders must agree on the reformulated output metric.
5. Pattern 6 — Multilabel
Problem
An instance can belong to multiple categories simultaneously.
Example:
An article can be both “sports” and “politics.”
A movie can be tagged “action”, “comedy”, “romance.”
Standard classifiers assume mutually exclusive labels — not valid here.
Solution
Use a multilabel classification setup:
Model outputs a vector of probabilities (one per label).
Apply a sigmoid activation on each output neuron (not softmax).
Threshold each output independently (e.g., > 0.5 → label present).
Implementation
Data: Labels represented as multi-hot vectors.
Example: [1, 0, 1, 0] → belongs to classes 1 and 3.
Evaluation: Use metrics like Hamming loss, precision@k, or F1 score per label.
Why It Works
Models learn shared features across labels.
Reflects real-world multi-dimensional tagging.
Trade-Offs
Harder to interpret or tune thresholds.
Labels may have dependency (co-occurrence) that is not modeled directly.
6. Pattern 7 — Ensembles
Problem
Single models have bias and variance limitations; one algorithm rarely captures all aspects of a complex dataset.
Solution
Combine multiple models to leverage their collective strengths.
Common Ensemble Techniques
Bagging (Bootstrap Aggregation):
Train several models on random samples of the data.
Example: Random Forest.
Reduces variance.
Boosting:
Sequentially train models, each correcting errors of the previous one.
Example: XGBoost, AdaBoost, LightGBM.
Reduces bias.
Stacking:
Train multiple base models, then use a meta-model to combine their predictions.
Why It Works
Reduces overfitting by averaging out individual model weaknesses.
Improves robustness and generalization.
Trade-Offs
Increased complexity and inference time.
Harder to interpret.
Maintenance overhead for multiple models.
7. Pattern 8 — Cascade
Problem
Some predictions are hierarchical or conditional — one model’s decision informs the next.
Example:
In object detection:
First model detects objects → second classifies object type.
Fraud detection:
First model flags suspicious transactions → second verifies fraud probability.
Solution
Chain models sequentially, where:
Output of model A becomes input (or trigger) for model B.
Each stage filters or enriches data.
Advantages
Improves efficiency (later models process fewer cases).
Allows specialized models for each sub-task.
Mimics human decision-making hierarchy.
Trade-Offs
Error propagation: mistakes in early stages affect later ones.
Hard to debug end-to-end.
Latency increases if cascaded synchronously.
Best Practices
Use confidence thresholds to decide when to trigger next model.
Log intermediate outputs for traceability.
8. Pattern 9 — Neutral Class
Problem
Real-world data often includes uncertain or ambiguous examples.
Example: Image classifier forced to choose between “dog” or “cat” even if it’s neither.
Model gives overconfident wrong predictions → dangerous in production.
Solution
Introduce a neutral class (e.g., “none,” “other,” or “uncertain”) to capture ambiguous inputs.
Why It Works
Prevents forcing classification where confidence is low.
Reduces false positives and improves user trust.
Especially useful for open-world or safety-critical systems.
Techniques
Add “Other” class during training.
Use confidence thresholds — if model confidence < threshold → “neutral.”
Calibrate probabilities using temperature scaling or Platt scaling.
Trade-Offs
May increase false negatives.
Requires well-curated examples of neutral class.
9. Pattern 10 — Rebalancing
Problem
In many ML problems, classes are imbalanced — one class has far more examples than others.
Examples:
Fraud detection: 0.1% fraud vs. 99.9% non-fraud.
Disease diagnosis: rare positive cases.
This leads to:
Biased model predictions toward majority class.
Misleading accuracy (model always predicts “non-fraud”).
Solution
Adjust data or loss to balance the learning process.
Techniques
Resampling
Oversampling minority class: Duplicate or synthesize samples (SMOTE).
Under sampling majority class: Randomly drop examples.
Hybrid approaches combine both.
Class Weighting
Assign higher loss weight to minority class during training.
Common in Keras: class_weight={0:1, 1:10}.
Threshold Adjustment
Tune probability cutoff (e.g., predict “fraud” if prob > 0.2).
Synthetic Data Generation
Use GANs or variational autoencoders to create realistic minority samples.
Evaluation Metrics
Precision, Recall, F1 score
ROC-AUC, PR-AUC
Avoid plain accuracy.
Why It Works
Forces the model to treat rare cases seriously.
Improves recall for minority classes.
Trade-Offs
Risk of overfitting with oversampling.
Removing too many majority samples may lose signal.
Weighted loss can make optimization unstable.
10. Chapter Summary
Reframing
Reformulate problem
Easier or more effective learning
Convert classification → regression
Multilabel
Predict multiple categories per sample
Capture co-occurring labels
Movie genre tagging
Ensembles
Combine models
Improve accuracy, reduce variance
Kaggle competitions, tabular data
Cascade
Chain models hierarchically
Efficient multi-step predictions
Object detection pipelines
Neutral Class
Add “uncertain” class
Increase reliability
Open-set recognition
Rebalancing
Handle imbalanced data
Fairer model training
Fraud or anomaly detection
11. Key Takeaways
Problem framing is as important as model choice.
Many ML problems benefit from creative reformulation.
Multiple models (ensembles, cascades) can outperform single-model approaches.
Introduce neutrality and balance to handle real-world uncertainty and bias.
Always evaluate models with metrics aligned to business objectives, not just accuracy.
Would you like me to make a condensed “exam revision sheet” (one-pager summary + 15 key Q&A from Chapter 3) next — similar to what we did for Chapter 1 and 2?
Here are detailed notes for Chapter 4 — “Model Training Design Patterns”
from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*
Chapter 4: Model Training Design Patterns
1. Introduction
After understanding data representation (Ch. 2) and problem representation (Ch. 3), this chapter focuses on the model training phase — the process of teaching the model to recognize patterns and relationships in data.
Model training is where the model’s parameters (weights) are adjusted based on the loss function and optimizer. However, ML engineers face several practical challenges during this process — such as data leakage, distribution mismatch, instability, and lack of reproducibility.
This chapter introduces design patterns to make training:
More efficient
More stable
More reliable and generalizable
2. Overview of Model Training Design Patterns
11
Transform
Maintain consistent data preprocessing between training and serving
12
Multistage Training
Train models in multiple phases or levels for improved performance
13
Transfer Learning
Reuse pretrained models for new, related tasks
14
Distillation
Use a larger “teacher” model to train a smaller “student” model
15
Regularization
Prevent overfitting by adding constraints or penalties
Pattern 11 — Transform
Problem
The training-serving skew problem:
Transformations applied during training are not replicated identically during inference (serving).
Example:
During training, you normalize age by (x - mean) / std using training stats.
During serving, an engineer recomputes or implements the transformation differently → prediction mismatch.
Solution
Create a shared transformation pipeline used identically in both training and serving.
Approaches
Reusable Code
Package preprocessing as reusable modules or functions.
Avoid duplicating logic in separate codebases (e.g., Python vs. Java).
Serialization
Serialize transformation logic (e.g., StandardScaler from scikit-learn → pickle file).
Store mean, std, vocabularies, bucket boundaries, etc.
Pipeline Frameworks
Use systems like TensorFlow Transform (TFT) or Apache Beam.
Compute statistics on training data; apply same transforms consistently in production.
Example
Why It Works
Eliminates skew between training and serving.
Improves reproducibility.
Reduces production bugs.
Trade-Offs
Slight overhead in maintaining shared pipelines.
Must ensure transformations are deterministic.
Pattern 12 — Multistage Training
Problem
Some problems are too complex to learn directly.
Models might:
Take too long to converge,
Overfit quickly,
Or fail to generalize when trained in one go.
Solution
Break training into multiple stages — each focusing on a specific sub-goal or learning task.
Each stage’s output (weights, embeddings, or models) is used as input to the next.
Examples
Pretraining + Fine-tuning
Pretrain on a large dataset (e.g., ImageNet).
Fine-tune on a smaller, domain-specific dataset.
Coarse-to-Fine
Stage 1: Learn to classify broad categories (animals, vehicles).
Stage 2: Refine within each category (dog breeds, car models).
Curriculum Learning
Start with easy examples → gradually introduce harder ones.
Multitask Training
Train on multiple related tasks sequentially or jointly.
Why It Works
Reduces learning complexity.
Speeds convergence.
Improves generalization and performance on small datasets.
Trade-Offs
More engineering complexity.
Requires well-planned transitions between stages.
Risk of catastrophic forgetting (later stages overwrite earlier learning).
Pattern 13 — Transfer Learning
Problem
Training models from scratch requires massive labeled data and computational power.
Solution
Leverage pretrained models (trained on large generic datasets) and fine-tune them for your specific task.
How It Works
Choose a pretrained model:
e.g., ResNet, BERT, Inception.
Freeze early layers:
Keep initial weights fixed — they capture general features.
Replace or retrain final layers:
Adapt model to the new dataset.
Example (Image Classification)
Pretrained on ImageNet (1.2M images, 1000 classes).
Fine-tune on medical X-ray dataset (10,000 images, 5 classes).
Why It Works
Reuses learned representations (edges, textures, words, etc.).
Reduces required training time and data.
Often achieves better performance.
Applications
NLP: BERT, GPT, T5
Vision: ResNet, EfficientNet
Audio: Wav2Vec
Trade-Offs
Must ensure source and target domains are related.
May transfer unwanted biases.
Fine-tuning too much can cause overfitting.
Pattern 14 — Distillation
Problem
Large, high-performing models (e.g., ensembles, deep networks) are too big or slow for deployment.
Solution
Train a smaller model (“student”) to mimic a larger one (“teacher”) — capturing its “knowledge.”
How It Works
Teacher model outputs soft probabilities (not just hard labels).
Student is trained to match teacher’s probability distribution.
Often done with a temperature parameter (T) to soften the logits.
p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
Example
Train a large BERT model on 1B documents.
Use BERT’s predictions to train a small BiLSTM model (student).
Student is faster to serve with minimal accuracy loss.
Why It Works
Soft labels contain richer information (class similarity).
Student approximates teacher’s decision boundaries.
Trade-Offs
Requires training teacher first (extra cost).
Student may underperform if too small.
Best for inference efficiency, not raw accuracy.
Pattern 15 — Regularization
Problem
Overfitting — model learns noise instead of underlying signal.
Symptoms:
High training accuracy but poor validation accuracy.
Unstable predictions on new data.
Solution
Add constraints or penalties to discourage overly complex models.
Types of Regularization
Weight Regularization
L1 / L2 penalties
Adds penalty term to loss function to limit large weights.
Dropout
Randomly drop neurons during training
Prevents co-dependency among neurons.
Early Stopping
Stop training when validation loss stops improving
Avoids overfitting.
Data Augmentation
Randomly modify inputs (e.g., rotate, crop, noise)
Increases dataset diversity.
Batch Normalisation
Normalize layer inputs
Stabilizes learning, acts as mild regularizer.
Why It Works
Encourages simpler models.
Improves generalisation.
Makes training more stable.
Trade-Offs
Too much regularization → underfitting.
Some methods (dropout) slow training convergence.
6. Summary Table
Transform
Consistency
Share preprocessing logic
Ensure identical scaling in training & serving
Multistage Training
Decomposition
Train in phases
Coarse-to-fine classification
Transfer Learning
Reuse knowledge
Fine-tune pretrained model
Use BERT for text sentiment
Distillation
Efficiency
Compress large models
Deploy smaller models on mobile
Regularization
Generalization
Prevent overfitting
Dropout, L2 regularization
7. Practical Insights and Best Practices
Keep transformations consistent
→ Avoid “train/serve skew” by using shared pipelines or serialized transformers.
Monitor data drift between stages
→ Each stage in multistage training may see slightly different distributions.
Fine-tune with caution
→ Freeze lower layers when using transfer learning; overfitting is common.
Regularisation is your friend
→ Use dropout, L2, or early stopping as standard components in your training pipeline.
Use model distillation for production
→ Train a small, fast student for deployment while retaining the performance of large teacher models.
Always log and version control
→ Save hyperparameters, random seeds, and transformation states for reproducibility.
8. Key Takeaways
The training phase defines how effectively the model learns generalizable patterns.
The five design patterns in this chapter address reliability, efficiency, and robustness of training.
Consistency (Transform) and generalization (Regularization) are foundational to stable ML systems.
Transfer learning and distillation are critical for real-world deployment where compute and data are limited.
Would you like me to now give you Chapter 5 (Model Evaluation Design Patterns) detailed notes next — continuing this same level of structured depth and clarity?
Here are detailed notes for Chapter 5 — “Model Evaluation Design Patterns”
from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*
Chapter 5: Model Evaluation Design Patterns
1. Introduction
After designing data representations (Ch. 2), framing the problem (Ch. 3), and training models effectively (Ch. 4), the next critical step is evaluating the model — i.e., determining how well it actually works.
Model evaluation is not just about computing accuracy.
It’s about verifying:
Whether the model generalizes to new data,
Whether it meets business objectives, and
Whether it’s reliable, unbiased, and fair.
This chapter introduces evaluation design patterns — reusable strategies for robust, fair, and meaningful model assessment.
2. Overview of Evaluation Design Patterns
16
Evaluation Metrics
Select the right metrics that align with business goals
17
Slicing
Evaluate model performance across meaningful subgroups
18
Skew Detection
Detect mismatches between training and serving data distributions
19
Baseline Comparison
Always evaluate your model relative to a known reference (baseline)
20
Prediction Bias
Identify and mitigate systematic bias across groups
Pattern 16 — Evaluation Metrics
Problem
Choosing the wrong metric can make a model look “good” while it fails in production or misaligns with business goals.
Example:
Fraud detection model with 99.9% accuracy, but it never predicts “fraud” because fraud cases are rare.
Solution
Choose metrics that:
Reflect business objectives,
Match the data distribution, and
Capture model trade-offs (precision vs. recall, etc.)
Types of Metrics
A. Regression
Mean Squared Error (MSE) – sensitive to large errors.
Mean Absolute Error (MAE) – robust to outliers.
R² (Coefficient of Determination) – proportion of variance explained.
B. Classification
Accuracy: (TP + TN) / Total
Precision: TP / (TP + FP) → How often positive predictions are correct.
Recall (Sensitivity): TP / (TP + FN) → How many actual positives are caught.
F1 Score: Harmonic mean of precision and recall.
ROC-AUC: Measures model’s ability to rank positives over negatives.
PR-AUC: Better for highly imbalanced data.
C. Ranking / Recommendation
Precision@k, Recall@k
Mean Average Precision (MAP)
Normalized Discounted Cumulative Gain (NDCG)
D. Probabilistic Outputs
Log Loss (Cross-Entropy) – penalizes overconfident incorrect predictions.
Brier Score – measures calibration of predicted probabilities.
Why It Works
Metrics provide quantitative feedback on model behavior.
Different tasks demand different metrics.
Composite metrics (like F1) balance conflicting goals.
Best Practices
Choose one primary metric tied to business success.
Monitor secondary metrics to detect trade-offs.
Use confidence intervals (via bootstrapping) for reliable estimates.
Pattern 17 — Slicing
Problem
A model might perform well overall but poorly for specific subgroups — leading to unfair or biased outcomes.
Example:
A credit scoring model performs well overall but discriminates against a certain age group or region.
Solution
Break down evaluation by data slices — subsets of data defined by key attributes (e.g., gender, region, device type).
How It Works
Partition test data into slices:
Compute metrics (accuracy, precision, recall, etc.) per slice.
Compare metrics across slices to detect weak areas.
Why It Works
Exposes hidden weaknesses masked by overall averages.
Encourages fairness and interpretability.
Helps prioritize retraining or data collection for weak segments.
Tools
TFMA (TensorFlow Model Analysis) supports slicing natively.
Google’s What-If Tool visualizes performance per subgroup.
Trade-Offs
Requires additional computation.
Slices must be meaningful and representative — too many leads to noise.
Pattern 18 — Skew Detection
Problem
Performance drop in production often happens because serving data differs from training data — known as data skew or drift.
Types of skew:
Training-Serving Skew:
Differences due to mismatched preprocessing or data pipelines.
Data Drift:
Real-world data changes over time.
Concept Drift:
The relationship between features and labels changes.
Solution
Continuously compare distributions of features and predictions between training and production data.
Techniques
Statistical Distance Metrics
Kullback-Leibler Divergence (KL)
Jensen-Shannon Divergence
Kolmogorov–Smirnov (KS) Test)
Visualization
Histograms, quantile plots.
Feature Monitoring
Compare mean, std, and missing value rates.
Prediction Monitoring
Compare predicted probabilities vs. ground truth when available.
Why It Works
Detects silent failures before they cause large-scale issues.
Enables model retraining triggers when drift crosses thresholds.
Trade-Offs
Requires real-time data logging and monitoring.
Hard to define “acceptable drift” threshold.
Pattern 19 — Baseline Comparison
Problem
A model’s performance number (e.g., 0.82 F1 score) is meaningless without context.
We must always ask: Better than what?
Solution
Always compare against baselines:
Simple models,
Human performance,
Or previous production models.
Types of Baselines
Heuristic Baseline:
Simple rule-based system (e.g., always predict majority class).
Statistical Baseline:
Mean/median predictor for regression.
Production Baseline:
The currently deployed model’s performance.
Human Baseline:
Expert or crowd-sourced accuracy for the same task.
Why It Works
Gives meaning to improvements.
Prevents over-engineering (you don’t need deep learning if logistic regression suffices).
Facilitates A/B testing for deployment.
Best Practices
Always store baseline performance in your ML metadata.
Re-evaluate baseline periodically as data evolves.
Trade-Offs
Baselines can become stale if not updated.
Comparing across different metrics can be misleading.
Pattern 20 — Prediction Bias
Problem
Models may systematically favor or disfavor certain groups, even if unintentionally.
Example:
A hiring model prefers candidates from certain universities.
A vision model underperforms for darker skin tones.
This is prediction bias — when errors are unevenly distributed across subpopulations.
Solution
Detect and mitigate bias during evaluation using fairness metrics and balanced data slices.
Fairness Metrics
Demographic Parity
P(pred=positive
group A) = P(pred=positive
Equal Opportunity
P(pred=positive
actual=positive, group A) = same for group B
Equalized Odds
Equal FP and FN rates across groups
Balanced fairness
Disparate Impact
Ratio of favorable outcome rates across groups
>0.8 (80% rule) considered fair
Bias Detection Process
Slice evaluation by sensitive attributes.
Compute metrics across slices.
Identify disparities > acceptable threshold.
Apply mitigation strategies.
Bias Mitigation Techniques
Data-level: Balance representation (resampling, reweighting).
Model-level: Add fairness constraints or adversarial debiasing.
Post-processing: Adjust thresholds per subgroup.
Why It Works
Promotes ethical, transparent, and socially acceptable AI systems.
Reduces reputational and legal risks.
Trade-Offs
May reduce raw accuracy.
Requires access to demographic data (which can be sensitive).
Fairness definitions can conflict (no single metric fits all).
6. Chapter Summary Table
Evaluation Metrics
Choose correct metric
Align metrics with objectives
Use AUC for imbalanced fraud detection
Slicing
Evaluate per subgroup
Uncover hidden weaknesses
Gender-based accuracy in NLP
Skew Detection
Detect data drift
Compare feature distributions
Model performance drops over time
Baseline Comparison
Contextualize performance
Compare to simple model or human
New vs. existing recommendation model
Prediction Bias
Detect and fix unfairness
Measure equality across groups
Hiring or lending models
7. Best Practices
Use multiple complementary metrics
→ Example: F1 + ROC-AUC + calibration.
Monitor model post-deployment
→ Evaluation is continuous, not one-time.
Always slice by critical attributes
→ Age, gender, geography, device, etc.
Automate drift and skew detection
→ Use tools like TensorFlow Data Validation (TFDV) and ML Monitoring.
Compare against strong baselines
→ Prevent wasted effort on marginal improvements.
Assess fairness before deployment
→ Especially in regulated or high-impact domains (finance, health, justice).
8. Key Takeaways
Metrics define success — choose them wisely based on the problem.
Slicing and bias analysis ensure fairness and robustness.
Data drift is inevitable — plan for ongoing monitoring.
Baselines ground your evaluation in practical reality.
Evaluation must be holistic: technical + ethical + business dimensions.
Last updated