githubEdit

Machine learning design patterns

[!TIP] High-Frequency Patterns for System Design Interviews

Pattern

Problem

Solution

Hashed Feature

High cardinality categories.

Use deterministic hashing to map into fixed buckets.

Embeddings

Sparse high-dim data.

Learn dense, continuous vector representations.

Feature Cross

Non-linear interactions.

Create a new feature by concatenating/crossing two features.

Multimodal Input

Mixed data (text+image).

Separate specialized encoders with a late-fusion head.

Reframing

Hard classification.

Change to regression (probability) or ranking.

Rebalancing

Imbalanced datasets.

Weighted loss or resampling (SMOTE / Undersampling).

Ensemble

Single model weak signals.

Bagging (Variance), Boosting (Bias), or Stacking.

Cascade

Hierarchical decisions.

Chain models sequentially (e.g., Object Detection -> OCR).

Transfer Learning

Low data / High compute.

Leverage pretrained models and fine-tune on target data.


Chapter Summaries


Chapter 1: The Need for Machine Learning Design Patterns

1. Introduction

  • Engineering disciplines use design patterns to capture best practices and recurring solutions.

  • These patterns codify expert knowledge into reusable advice for practitioners.

  • The book is a catalog of ML design patterns—common strategies to handle challenges in data preparation, model building, and MLOps.


2. What Are Design Patterns?

  • Originated in architecture from Christopher Alexander’s book A Pattern Language (1977).

  • Later applied to software engineering in Design Patterns: Elements of Reusable Object-Oriented Software (Gamma et al., 1994).

  • A design pattern describes:

    1. A recurring problem.

    2. The core solution.

    3. Guidelines for adapting it to local needs.

  • In ML, design patterns help standardize solutions for challenges like feature transformation, reproducibility, serving models, etc.

  • Example patterns:

    • Transform Pattern (Ch. 6): Separates inputs, features, and transformations for easier production deployment.

    • Keyed Predictions (Ch. 5): Distributes batch predictions efficiently across large datasets.


3. How to Use This Book

  • The book serves as a catalog rather than a linear tutorial.

  • You can:

    • Skim chapters.

    • Refer back to relevant patterns when facing a problem.

  • Each pattern includes:

    • Problem statement.

    • Canonical solution.

    • Why it works.

    • Trade-offs and alternatives.

  • Code examples are in SQL, scikit-learn, and TensorFlow/Keras.

  • Goal: Provide a common vocabulary for ML practitioners to discuss and implement these concepts.


4. Machine Learning Terminology

Models and Frameworks

  • ML replaces hand-coded rules with models that learn patterns from data.

  • Example: Predicting moving costs using a learned model instead of nested if statements.

  • Common model types:

    • Linear regression

    • Decision trees

    • Neural networks

    • Clustering models

  • Learning types:

    • Supervised Learning: Labeled data (classification or regression).

    • Unsupervised Learning: No labels (clustering, dimensionality reduction).

  • Frameworks:

    • TensorFlow (mainly via Keras API)

    • scikit-learn

    • PyTorch

    • XGBoost

    • BigQuery ML (SQL-based ML).


Data and Feature Engineering

  • Datasets:

    • Training data: Used to learn patterns.

    • Validation data: Used to tune hyperparameters and avoid overfitting.

    • Test data: Used for final evaluation; must be unseen by the model.

  • Structured data: Numeric/categorical (tabular).

  • Unstructured data: Text, image, audio, video.

  • Feature engineering: Converting raw inputs to model-understandable numerical features.

    • Example: Input = timestamp → Feature = day of week.

  • Terminology:

    • Input: Raw column.

    • Feature: Transformed column.

    • Instance: Single record used for prediction.

    • Label: The ground-truth value being predicted.

  • Data Validation: Checks data consistency, balance, and schema correctness to avoid bias or drift.


Machine Learning Process

  1. Training: Model learns from data.

  2. Evaluation: Measure performance on validation/test sets.

  3. Serving: Deploy model for predictions.

    • Online prediction: Real-time, low-latency.

    • Batch prediction: Large-scale, offline predictions.

  4. Inference: Process of making predictions (also called “prediction” in ML).

  5. Pipelines: Automate multi-step processes for data preprocessing, training, evaluation, and deployment.

  6. Streaming: Real-time data processing for continuous model updates.


Data and Model Tooling (Google Cloud Examples)

  • BigQuery: For scalable SQL-based data analysis.

  • BigQuery ML: For training and evaluating ML models directly in SQL.

  • AI Platform Training: Infrastructure for distributed training.

  • AI Platform Prediction: For deploying and serving ML models.

  • Explainable AI: For interpreting model predictions.


Roles in Machine Learning

  • Data Scientist: Focuses on data analysis, feature engineering, and model building.

  • Data Engineer: Builds data ingestion and processing pipelines.

  • ML Engineer: Manages model training, versioning, and deployment.

  • Research Scientist: Develops new ML algorithms (academic/research focus).

  • Data Analyst: Derives business insights from data.

  • Developer: Integrates ML models into end-user applications (APIs, web/mobile).


5. Common Challenges in Machine Learning

A. Data Quality

“Garbage in, garbage out.”

  • Four key aspects:

    1. Accuracy: Are the features and labels correct?

    2. Completeness: Does the dataset represent all relevant cases and diversity?

    3. Consistency: Are collection and labeling standards uniform?

    4. Timeliness: Is the data fresh and synchronized with real-world events?

  • Example: Inconsistent labeling in sentiment analysis → biased model.


B. Reproducibility

  • ML has randomness due to weight initialization and stochastic training.

  • Ensuring reproducibility requires:

    • Setting random seeds (tf.random.set_seed()).

    • Fixing data splits, hyperparameters, and batch sizes.

    • Version controlling dependencies and using containers.

    • Maintaining consistent training environments.


C. Data Drift

  • Data distribution changes over time.

  • Causes:

    • Evolving environment (e.g., language, behavior).

    • Improved data collection methods.

  • Example: A news classification model trained on 20th-century articles may fail on modern terms like “smartphone”.

  • Solutions:

    • Continuous retraining.

    • Data versioning.

    • Monitoring for drift.

    • Bridged Schema Pattern (Ch. 6) handles evolving feature availability.


D. Scale

  • Scaling challenges occur in:

    1. Data processing: Handling millions of rows efficiently.

    2. Training: Large models and datasets require GPUs/TPUs.

    3. Serving: Supporting millions of prediction requests.

  • Design patterns for scale are discussed in later chapters (e.g., Distribution Strategy, Stateless Serving Function).


E. Multiple Objectives

  • Different stakeholders optimize for different goals:

    • Data scientist → minimize loss.

    • Product manager → reduce false negatives.

    • Executives → increase ROI.

  • Trade-offs between precision and recall or accuracy and fairness must be balanced.

  • Communication and shared metrics across teams are crucial.


6. Summary

  • Design patterns are reusable best practices to standardize ML problem-solving.

  • Major ML challenges:

    1. Data quality

    2. Reproducibility

    3. Data drift

    4. Scale

    5. Multiple objectives

  • Patterns apply across the ML life cycle — from data collection and training to deployment and maintenance.


Would you like me to create chapter-wise flashcards or Q&A-style revision questions from this chapter next?

Here are detailed notes for Chapter 2 — “Data Representation Design Patterns” from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*


Chapter 2: Data Representation Design Patterns

1. Introduction

At the heart of every machine learning model lies a mathematical function designed to operate on specific data types (e.g., booleans, numbers).

However, real-world data—text, images, categories—often doesn’t directly fit this mathematical form.

Hence, data needs to be represented or transformed into a form that the model can process.

This chapter explains various data representation patterns, i.e., methods for converting raw inputs into mathematical features suitable for ML models.


2. Key Concepts

a. Input vs. Feature

  • Input: The raw, real-world data fed into the model.

    Example: Baby’s weight = 2.7 kg.

  • Feature: The transformed, numeric or boolean value used by the model.

    Example: is_low_weight = (baby_weight < 3kg) → Boolean feature.

The process of converting inputs into features is called feature engineering.


b. Why Data Representation Matters

  • Models like decision trees or linear regressions can only operate on specific data types (e.g., booleans, numerics).

  • Example:

    • Decision tree core math operates on boolean conditions (AND, OR).

    • To use non-boolean inputs (like hospital name), we must transform:

  • A good data representation makes learning:

    • Simpler (fewer nodes/features),

    • Faster (less computation),

    • More accurate (better generalization).


c. Learnable vs. Fixed Representations

  1. Fixed (Engineered):

    The transformation is manually designed by humans.

    • Example: (babyweight < 3kg) or one-hot encoding.

  2. Learnable (Automatically Extracted):

    The model learns its own representation during training.

    • Example: Embeddings in neural networks.

  3. Hybrid:

    Some representations are deterministic but partially learned.

    Example: Hashed features (fixed mapping, no learning, but generalizable).


d. Relationship Between Representations

  • One-to-one: Each input → one feature (simple cases).

  • Many-to-one: Several inputs combined to make a new feature (e.g., feature crosses).

  • Many-to-many: Multi-modal data (text + image + tabular combined).


3. Overview of the Chapter’s Patterns

This chapter introduces four major data representation patterns:

Pattern Name
Purpose

1⃣ Hashed Feature

Represent high-cardinality categorical data efficiently

2⃣ Embeddings

Learn dense, continuous representations for categorical or textual data

3⃣ Feature Cross

Combine multiple features to capture interactions

4⃣ Multimodal Input

Handle and combine multiple data types (text, image, etc.)

Before discussing these, the book explains simple data representations — basic techniques for handling numeric and categorical data.


4. Simple Data Representations

A. Numerical Inputs

Most ML models (linear regression, random forests, neural nets) operate on numeric features.

Why Scaling is Important

  • Optimisers (like gradient descent) converge faster when all features have comparable magnitudes.

  • Scaling to [-1, 1] or [0, 1] helps:

    • Reduces curvature of the loss surface.

    • Stabilizes gradients.

    • Improves numerical precision.

  • Example experiment using scikit-learn showed ~9% faster training when data was scaled.

Scaling Techniques

  1. Min-Max Scaling

    x’ = \frac{2x - \text{max}_x - \text{min}_x}{\text{max}_x - \text{min}_x}

    • Maps values to [-1, 1].

    • Sensitive to outliers (max/min from training data).

  2. Clipping with Min-Max Scaling

    • Clip outliers to a reasonable percentile (e.g., 1st and 99th) before scaling.

  3. Z-Score (Standard) Normalization

    x’ = \frac{x - \mu}{\sigma}

    • Centers data around 0 with unit variance.

    • Useful for models assuming Gaussian distribution.

  4. Log Transformation

    • Applies to positively skewed data (e.g., income, population).

    • Makes data more symmetric.

  5. Bucketization

    • Converts continuous numeric variables into discrete ranges (bins).

    • Example: Age → {0–20, 21–40, 41–60, 61+}.

    • Useful for tree-based models.


B. Categorical Inputs

Categorical data represents discrete, non-numeric values (e.g., gender, city, color).

Representation Methods

  1. Label Encoding

    • Assigns integer IDs to each category.

    • Not suitable for linear models (implies ordering).

  2. One-Hot Encoding

    • Creates binary columns for each category.

    • Example: Color = {Red, Blue, Green}

      → [1, 0, 0], [0, 1, 0], [0, 0, 1].

    • Works well for small cardinality but inefficient for large sets.

  3. Multi-Hot Encoding

    • For multi-label features (e.g., movie genres: {Action, Comedy} → [1, 1, 0, 0]).

  4. Ordinal Encoding

    • Assigns ordered numeric values when order matters (e.g., size: small=1, medium=2, large=3).

  5. Target / Mean Encoding

    • Replaces a category with the average value of the target variable for that category.

    • Useful for high-cardinality features.

    • Risk: data leakage → must be applied on training folds only.


5. Data Representation Design Patterns

Pattern 1 — Hashed Feature

  • Problem: High-cardinality categorical variables (e.g., ZIP code, user ID) → huge one-hot vectors.

  • Solution: Apply a hashing function to map categories into a fixed-size vector space.

    • Example:

      \text{feature\_index} = \text{hash}(category) \% N

    • where N = number of hash buckets.

  • Why it Works:

    • Reduces memory usage and training time.

    • Collisions are tolerable when N is large enough.

  • Trade-offs:

    • Loses interpretability (collisions cause mixed signals).

    • Not reversible (cannot recover original category).


Pattern 2 — Embeddings

  • Problem: One-hot vectors are sparse and high-dimensional.

  • Solution: Learn dense, low-dimensional vectors that represent similarity between categories.

    • Example: Word embeddings (Word2Vec, GloVe, BERT).

  • How it Works:

    • Each category is assigned a learnable vector (e.g., 128 dimensions).

    • During training, embeddings adjust to minimize model loss.

    • Similar items end up close together in vector space.

  • Advantages:

    • Compact and efficient representation.

    • Captures semantic relationships (e.g., “king” – “man” + “woman” ≈ “queen”).

  • Trade-offs:

    • Requires large datasets.

    • Less interpretable than one-hot.

    • Risk of bias if training data is biased.


Pattern 3 — Feature Cross

  • Problem: Models may fail to capture interactions between features.

    • Example: A model might not realize that “High Income + Young Age” implies a specific pattern.

  • Solution: Combine multiple features to form a new feature representing their interaction.

    • Example: cross_feature = city + "_" + job_type

  • Why it Works:

    • Introduces nonlinearity and context awareness in models like logistic regression.

  • Trade-offs:

    • Increases dimensionality.

    • May overfit if not regularized or pruned.


Pattern 4 — Multimodal Input

  • Problem: Real-world systems often use multiple data types (text, images, structured data).

    • Example: Product recommendation = image + text + numerical price.

  • Solution: Build models that accept and learn from multiple modalities.

    • Each data type passes through its specialized subnetwork:

      • CNN → for images

      • RNN/Transformer → for text

      • Dense layers → for tabular features

    • Outputs are concatenated and passed to a shared decision layer.

  • Trade-offs:

    • Complex architecture.

    • Requires more data and compute.

    • Harder to debug and interpret.


6. Conceptual Summary

Concept
Purpose
Examples

Scaling

Normalize numeric range

Min-max, z-score

Encoding

Convert categories to numbers

One-hot, hashing

Learnable Representations

Automatically learn features

Embeddings

Feature Interactions

Capture complex relationships

Feature Cross

Multi-Input Fusion

Combine heterogeneous data types

Multimodal Input


7. Key Takeaways

  • ML models don’t understand raw data — they understand numerical representations.

  • The choice of representation heavily affects performance, generalization, and scalability.

  • Automatic representations (like embeddings) are powerful but less interpretable.

  • Feature crosses and multimodal designs improve accuracy for complex datasets.

  • Always validate and test data transformations to ensure consistency between training and production.


Would you like me to now create summary tables + 20 practice questions (theoretical and applied) from Chapter 2 next — similar to how we did for Arrays and Sorting in DSA?


Here are detailed notes for Chapter 3 — “Problem Representation Design Patterns”

from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*


Chapter 3: Problem Representation Design Patterns

1. Introduction

After understanding how data is represented (Chapter 2), the next challenge is how to represent the problem itself so that it can be effectively solved using machine learning.

In other words:

Sometimes, it’s not the data or model that’s wrong — it’s the way we’ve framed the problem.

For instance, a problem that seems like classification might be easier to solve as regression, ranking, or clustering.

This chapter introduces problem representation design patterns — techniques to reframe or reformulate ML problems for better performance, interpretability, and deployment efficiency.


2. Why Problem Representation Matters

  • ML performance often depends less on algorithm choice and more on how the problem is structured.

  • Proper representation helps:

    • Simplify the learning task.

    • Align model outputs with business goals.

    • Reduce data or compute requirements.

    • Enable reusability of existing models.


3. Design Patterns Covered

Pattern #
Pattern Name
Core Idea

5

Reframing

Reformulate the ML problem for better alignment with available data and business goals.

6

Multilabel

Handle cases where an instance can belong to multiple categories simultaneously.

7

Ensembles

Combine multiple models to achieve better accuracy and robustness.

8

Cascade

Chain models sequentially so one’s output is another’s input (hierarchical modeling).

9

Neutral Class

Introduce a “none of the above” or “uncertain” label to improve real-world reliability.

10

Rebalancing

Handle imbalanced datasets by resampling or weighting techniques.


4. Pattern 5 — Reframing

Problem

Sometimes, the initial formulation of a problem is not ideal for ML.

Example: Instead of predicting “will a user buy or not,” predict “how likely is the user to buy” (a regression problem).

Solution

Reframe the task into an equivalent but more learnable or measurable ML problem.

Examples

  • Classification ↔ Regression:

    Predicting click-through rate (CTR) can be treated as regression (probability) instead of binary classification.

  • Ranking instead of classification:

    In recommendation systems, predicting order (rank) is more meaningful than binary relevance.

  • Generative → Discriminative (and vice versa):

    Instead of generating all possibilities, classify the best one.

Why It Works

  • Reframing simplifies the target function.

  • Allows use of existing datasets or pretrained models.

  • Reduces overfitting or data inefficiency.

Trade-Offs

  • May lose interpretability.

  • Business stakeholders must agree on the reformulated output metric.


5. Pattern 6 — Multilabel

Problem

An instance can belong to multiple categories simultaneously.

Example:

  • An article can be both “sports” and “politics.”

  • A movie can be tagged “action”, “comedy”, “romance.”

Standard classifiers assume mutually exclusive labels — not valid here.

Solution

Use a multilabel classification setup:

  • Model outputs a vector of probabilities (one per label).

  • Apply a sigmoid activation on each output neuron (not softmax).

  • Threshold each output independently (e.g., > 0.5 → label present).

Implementation

  • Data: Labels represented as multi-hot vectors.

    Example: [1, 0, 1, 0] → belongs to classes 1 and 3.

  • Evaluation: Use metrics like Hamming loss, precision@k, or F1 score per label.

Why It Works

  • Models learn shared features across labels.

  • Reflects real-world multi-dimensional tagging.

Trade-Offs

  • Harder to interpret or tune thresholds.

  • Labels may have dependency (co-occurrence) that is not modeled directly.


6. Pattern 7 — Ensembles

Problem

Single models have bias and variance limitations; one algorithm rarely captures all aspects of a complex dataset.

Solution

Combine multiple models to leverage their collective strengths.

Common Ensemble Techniques

  1. Bagging (Bootstrap Aggregation):

    • Train several models on random samples of the data.

    • Example: Random Forest.

    • Reduces variance.

  2. Boosting:

    • Sequentially train models, each correcting errors of the previous one.

    • Example: XGBoost, AdaBoost, LightGBM.

    • Reduces bias.

  3. Stacking:

    • Train multiple base models, then use a meta-model to combine their predictions.

Why It Works

  • Reduces overfitting by averaging out individual model weaknesses.

  • Improves robustness and generalization.

Trade-Offs

  • Increased complexity and inference time.

  • Harder to interpret.

  • Maintenance overhead for multiple models.


7. Pattern 8 — Cascade

Problem

Some predictions are hierarchical or conditional — one model’s decision informs the next.

Example:

  • In object detection:

    First model detects objects → second classifies object type.

  • Fraud detection:

    First model flags suspicious transactions → second verifies fraud probability.

Solution

Chain models sequentially, where:

  • Output of model A becomes input (or trigger) for model B.

  • Each stage filters or enriches data.

Advantages

  • Improves efficiency (later models process fewer cases).

  • Allows specialized models for each sub-task.

  • Mimics human decision-making hierarchy.

Trade-Offs

  • Error propagation: mistakes in early stages affect later ones.

  • Hard to debug end-to-end.

  • Latency increases if cascaded synchronously.

Best Practices

  • Use confidence thresholds to decide when to trigger next model.

  • Log intermediate outputs for traceability.


8. Pattern 9 — Neutral Class

Problem

Real-world data often includes uncertain or ambiguous examples.

  • Example: Image classifier forced to choose between “dog” or “cat” even if it’s neither.

  • Model gives overconfident wrong predictions → dangerous in production.

Solution

Introduce a neutral class (e.g., “none,” “other,” or “uncertain”) to capture ambiguous inputs.

Why It Works

  • Prevents forcing classification where confidence is low.

  • Reduces false positives and improves user trust.

  • Especially useful for open-world or safety-critical systems.

Techniques

  • Add “Other” class during training.

  • Use confidence thresholds — if model confidence < threshold → “neutral.”

  • Calibrate probabilities using temperature scaling or Platt scaling.

Trade-Offs

  • May increase false negatives.

  • Requires well-curated examples of neutral class.


9. Pattern 10 — Rebalancing

Problem

In many ML problems, classes are imbalanced — one class has far more examples than others.

Examples:

  • Fraud detection: 0.1% fraud vs. 99.9% non-fraud.

  • Disease diagnosis: rare positive cases.

This leads to:

  • Biased model predictions toward majority class.

  • Misleading accuracy (model always predicts “non-fraud”).

Solution

Adjust data or loss to balance the learning process.

Techniques

  1. Resampling

    • Oversampling minority class: Duplicate or synthesize samples (SMOTE).

    • Under sampling majority class: Randomly drop examples.

    • Hybrid approaches combine both.

  2. Class Weighting

    • Assign higher loss weight to minority class during training.

    • Common in Keras: class_weight={0:1, 1:10}.

  3. Threshold Adjustment

    • Tune probability cutoff (e.g., predict “fraud” if prob > 0.2).

  4. Synthetic Data Generation

    • Use GANs or variational autoencoders to create realistic minority samples.

Evaluation Metrics

  • Precision, Recall, F1 score

  • ROC-AUC, PR-AUC

  • Avoid plain accuracy.

Why It Works

  • Forces the model to treat rare cases seriously.

  • Improves recall for minority classes.

Trade-Offs

  • Risk of overfitting with oversampling.

  • Removing too many majority samples may lose signal.

  • Weighted loss can make optimization unstable.


10. Chapter Summary

Pattern
Purpose
Core Benefit
Typical Use Case

Reframing

Reformulate problem

Easier or more effective learning

Convert classification → regression

Multilabel

Predict multiple categories per sample

Capture co-occurring labels

Movie genre tagging

Ensembles

Combine models

Improve accuracy, reduce variance

Kaggle competitions, tabular data

Cascade

Chain models hierarchically

Efficient multi-step predictions

Object detection pipelines

Neutral Class

Add “uncertain” class

Increase reliability

Open-set recognition

Rebalancing

Handle imbalanced data

Fairer model training

Fraud or anomaly detection


11. Key Takeaways

  • Problem framing is as important as model choice.

  • Many ML problems benefit from creative reformulation.

  • Multiple models (ensembles, cascades) can outperform single-model approaches.

  • Introduce neutrality and balance to handle real-world uncertainty and bias.

  • Always evaluate models with metrics aligned to business objectives, not just accuracy.


Would you like me to make a condensed “exam revision sheet” (one-pager summary + 15 key Q&A from Chapter 3) next — similar to what we did for Chapter 1 and 2?


Here are detailed notes for Chapter 4 — “Model Training Design Patterns”

from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*


Chapter 4: Model Training Design Patterns

1. Introduction

After understanding data representation (Ch. 2) and problem representation (Ch. 3), this chapter focuses on the model training phase — the process of teaching the model to recognize patterns and relationships in data.

Model training is where the model’s parameters (weights) are adjusted based on the loss function and optimizer. However, ML engineers face several practical challenges during this process — such as data leakage, distribution mismatch, instability, and lack of reproducibility.

This chapter introduces design patterns to make training:

  • More efficient

  • More stable

  • More reliable and generalizable


2. Overview of Model Training Design Patterns

Pattern #
Pattern Name
Core Purpose

11

Transform

Maintain consistent data preprocessing between training and serving

12

Multistage Training

Train models in multiple phases or levels for improved performance

13

Transfer Learning

Reuse pretrained models for new, related tasks

14

Distillation

Use a larger “teacher” model to train a smaller “student” model

15

Regularization

Prevent overfitting by adding constraints or penalties


Pattern 11 — Transform

Problem

The training-serving skew problem:

Transformations applied during training are not replicated identically during inference (serving).

Example:

  • During training, you normalize age by (x - mean) / std using training stats.

  • During serving, an engineer recomputes or implements the transformation differently → prediction mismatch.

Solution

Create a shared transformation pipeline used identically in both training and serving.

Approaches

  1. Reusable Code

    • Package preprocessing as reusable modules or functions.

    • Avoid duplicating logic in separate codebases (e.g., Python vs. Java).

  2. Serialization

    • Serialize transformation logic (e.g., StandardScaler from scikit-learn → pickle file).

    • Store mean, std, vocabularies, bucket boundaries, etc.

  3. Pipeline Frameworks

    • Use systems like TensorFlow Transform (TFT) or Apache Beam.

    • Compute statistics on training data; apply same transforms consistently in production.

  4. Example

Why It Works

  • Eliminates skew between training and serving.

  • Improves reproducibility.

  • Reduces production bugs.

Trade-Offs

  • Slight overhead in maintaining shared pipelines.

  • Must ensure transformations are deterministic.


Pattern 12 — Multistage Training

Problem

Some problems are too complex to learn directly.

Models might:

  • Take too long to converge,

  • Overfit quickly,

  • Or fail to generalize when trained in one go.

Solution

Break training into multiple stages — each focusing on a specific sub-goal or learning task.

Each stage’s output (weights, embeddings, or models) is used as input to the next.

Examples

  1. Pretraining + Fine-tuning

    • Pretrain on a large dataset (e.g., ImageNet).

    • Fine-tune on a smaller, domain-specific dataset.

  2. Coarse-to-Fine

    • Stage 1: Learn to classify broad categories (animals, vehicles).

    • Stage 2: Refine within each category (dog breeds, car models).

  3. Curriculum Learning

    • Start with easy examples → gradually introduce harder ones.

  4. Multitask Training

    • Train on multiple related tasks sequentially or jointly.

Why It Works

  • Reduces learning complexity.

  • Speeds convergence.

  • Improves generalization and performance on small datasets.

Trade-Offs

  • More engineering complexity.

  • Requires well-planned transitions between stages.

  • Risk of catastrophic forgetting (later stages overwrite earlier learning).


Pattern 13 — Transfer Learning

Problem

Training models from scratch requires massive labeled data and computational power.

Solution

Leverage pretrained models (trained on large generic datasets) and fine-tune them for your specific task.

How It Works

  1. Choose a pretrained model:

    e.g., ResNet, BERT, Inception.

  2. Freeze early layers:

    Keep initial weights fixed — they capture general features.

  3. Replace or retrain final layers:

    Adapt model to the new dataset.

Example (Image Classification)

  • Pretrained on ImageNet (1.2M images, 1000 classes).

  • Fine-tune on medical X-ray dataset (10,000 images, 5 classes).

Why It Works

  • Reuses learned representations (edges, textures, words, etc.).

  • Reduces required training time and data.

  • Often achieves better performance.

Applications

  • NLP: BERT, GPT, T5

  • Vision: ResNet, EfficientNet

  • Audio: Wav2Vec

Trade-Offs

  • Must ensure source and target domains are related.

  • May transfer unwanted biases.

  • Fine-tuning too much can cause overfitting.


Pattern 14 — Distillation

Problem

Large, high-performing models (e.g., ensembles, deep networks) are too big or slow for deployment.

Solution

Train a smaller model (“student”) to mimic a larger one (“teacher”) — capturing its “knowledge.”

How It Works

  • Teacher model outputs soft probabilities (not just hard labels).

  • Student is trained to match teacher’s probability distribution.

  • Often done with a temperature parameter (T) to soften the logits.

p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Example

  1. Train a large BERT model on 1B documents.

  2. Use BERT’s predictions to train a small BiLSTM model (student).

  3. Student is faster to serve with minimal accuracy loss.

Why It Works

  • Soft labels contain richer information (class similarity).

  • Student approximates teacher’s decision boundaries.

Trade-Offs

  • Requires training teacher first (extra cost).

  • Student may underperform if too small.

  • Best for inference efficiency, not raw accuracy.


Pattern 15 — Regularization

Problem

Overfitting — model learns noise instead of underlying signal.

Symptoms:

  • High training accuracy but poor validation accuracy.

  • Unstable predictions on new data.

Solution

Add constraints or penalties to discourage overly complex models.

Types of Regularization

Type
Technique
Description

Weight Regularization

L1 / L2 penalties

Adds penalty term to loss function to limit large weights.

Dropout

Randomly drop neurons during training

Prevents co-dependency among neurons.

Early Stopping

Stop training when validation loss stops improving

Avoids overfitting.

Data Augmentation

Randomly modify inputs (e.g., rotate, crop, noise)

Increases dataset diversity.

Batch Normalisation

Normalize layer inputs

Stabilizes learning, acts as mild regularizer.

Why It Works

  • Encourages simpler models.

  • Improves generalisation.

  • Makes training more stable.

Trade-Offs

  • Too much regularization → underfitting.

  • Some methods (dropout) slow training convergence.


6. Summary Table

Pattern
Goal
Core Idea
Example Use Case

Transform

Consistency

Share preprocessing logic

Ensure identical scaling in training & serving

Multistage Training

Decomposition

Train in phases

Coarse-to-fine classification

Transfer Learning

Reuse knowledge

Fine-tune pretrained model

Use BERT for text sentiment

Distillation

Efficiency

Compress large models

Deploy smaller models on mobile

Regularization

Generalization

Prevent overfitting

Dropout, L2 regularization


7. Practical Insights and Best Practices

  1. Keep transformations consistent

    → Avoid “train/serve skew” by using shared pipelines or serialized transformers.

  2. Monitor data drift between stages

    → Each stage in multistage training may see slightly different distributions.

  3. Fine-tune with caution

    → Freeze lower layers when using transfer learning; overfitting is common.

  4. Regularisation is your friend

    → Use dropout, L2, or early stopping as standard components in your training pipeline.

  5. Use model distillation for production

    → Train a small, fast student for deployment while retaining the performance of large teacher models.

  6. Always log and version control

    → Save hyperparameters, random seeds, and transformation states for reproducibility.


8. Key Takeaways

  • The training phase defines how effectively the model learns generalizable patterns.

  • The five design patterns in this chapter address reliability, efficiency, and robustness of training.

  • Consistency (Transform) and generalization (Regularization) are foundational to stable ML systems.

  • Transfer learning and distillation are critical for real-world deployment where compute and data are limited.


Would you like me to now give you Chapter 5 (Model Evaluation Design Patterns) detailed notes next — continuing this same level of structured depth and clarity?


Here are detailed notes for Chapter 5 — “Model Evaluation Design Patterns”

from Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, and Michael Munn.*


Chapter 5: Model Evaluation Design Patterns

1. Introduction

After designing data representations (Ch. 2), framing the problem (Ch. 3), and training models effectively (Ch. 4), the next critical step is evaluating the model — i.e., determining how well it actually works.

Model evaluation is not just about computing accuracy.

It’s about verifying:

  • Whether the model generalizes to new data,

  • Whether it meets business objectives, and

  • Whether it’s reliable, unbiased, and fair.

This chapter introduces evaluation design patterns — reusable strategies for robust, fair, and meaningful model assessment.


2. Overview of Evaluation Design Patterns

Pattern #
Pattern Name
Purpose

16

Evaluation Metrics

Select the right metrics that align with business goals

17

Slicing

Evaluate model performance across meaningful subgroups

18

Skew Detection

Detect mismatches between training and serving data distributions

19

Baseline Comparison

Always evaluate your model relative to a known reference (baseline)

20

Prediction Bias

Identify and mitigate systematic bias across groups


Pattern 16 — Evaluation Metrics

Problem

Choosing the wrong metric can make a model look “good” while it fails in production or misaligns with business goals.

Example:

  • Fraud detection model with 99.9% accuracy, but it never predicts “fraud” because fraud cases are rare.

Solution

Choose metrics that:

  • Reflect business objectives,

  • Match the data distribution, and

  • Capture model trade-offs (precision vs. recall, etc.)


Types of Metrics

A. Regression

  • Mean Squared Error (MSE) – sensitive to large errors.

  • Mean Absolute Error (MAE) – robust to outliers.

  • R² (Coefficient of Determination) – proportion of variance explained.

B. Classification

  • Accuracy: (TP + TN) / Total

  • Precision: TP / (TP + FP) → How often positive predictions are correct.

  • Recall (Sensitivity): TP / (TP + FN) → How many actual positives are caught.

  • F1 Score: Harmonic mean of precision and recall.

  • ROC-AUC: Measures model’s ability to rank positives over negatives.

  • PR-AUC: Better for highly imbalanced data.

C. Ranking / Recommendation

  • Precision@k, Recall@k

  • Mean Average Precision (MAP)

  • Normalized Discounted Cumulative Gain (NDCG)

D. Probabilistic Outputs

  • Log Loss (Cross-Entropy) – penalizes overconfident incorrect predictions.

  • Brier Score – measures calibration of predicted probabilities.


Why It Works

  • Metrics provide quantitative feedback on model behavior.

  • Different tasks demand different metrics.

  • Composite metrics (like F1) balance conflicting goals.

Best Practices

  • Choose one primary metric tied to business success.

  • Monitor secondary metrics to detect trade-offs.

  • Use confidence intervals (via bootstrapping) for reliable estimates.


Pattern 17 — Slicing

Problem

A model might perform well overall but poorly for specific subgroups — leading to unfair or biased outcomes.

Example:

  • A credit scoring model performs well overall but discriminates against a certain age group or region.

Solution

Break down evaluation by data slices — subsets of data defined by key attributes (e.g., gender, region, device type).


How It Works

  • Partition test data into slices:

  • Compute metrics (accuracy, precision, recall, etc.) per slice.

  • Compare metrics across slices to detect weak areas.


Why It Works

  • Exposes hidden weaknesses masked by overall averages.

  • Encourages fairness and interpretability.

  • Helps prioritize retraining or data collection for weak segments.


Tools

  • TFMA (TensorFlow Model Analysis) supports slicing natively.

  • Google’s What-If Tool visualizes performance per subgroup.


Trade-Offs

  • Requires additional computation.

  • Slices must be meaningful and representative — too many leads to noise.


Pattern 18 — Skew Detection

Problem

Performance drop in production often happens because serving data differs from training data — known as data skew or drift.

Types of skew:

  1. Training-Serving Skew:

    Differences due to mismatched preprocessing or data pipelines.

  2. Data Drift:

    Real-world data changes over time.

  3. Concept Drift:

    The relationship between features and labels changes.


Solution

Continuously compare distributions of features and predictions between training and production data.


Techniques

  1. Statistical Distance Metrics

    • Kullback-Leibler Divergence (KL)

    • Jensen-Shannon Divergence

    • Kolmogorov–Smirnov (KS) Test)

  2. Visualization

    • Histograms, quantile plots.

  3. Feature Monitoring

    • Compare mean, std, and missing value rates.

  4. Prediction Monitoring

    • Compare predicted probabilities vs. ground truth when available.


Why It Works

  • Detects silent failures before they cause large-scale issues.

  • Enables model retraining triggers when drift crosses thresholds.


Trade-Offs

  • Requires real-time data logging and monitoring.

  • Hard to define “acceptable drift” threshold.


Pattern 19 — Baseline Comparison

Problem

A model’s performance number (e.g., 0.82 F1 score) is meaningless without context.

We must always ask: Better than what?

Solution

Always compare against baselines:

  • Simple models,

  • Human performance,

  • Or previous production models.


Types of Baselines

  1. Heuristic Baseline:

    • Simple rule-based system (e.g., always predict majority class).

  2. Statistical Baseline:

    • Mean/median predictor for regression.

  3. Production Baseline:

    • The currently deployed model’s performance.

  4. Human Baseline:

    • Expert or crowd-sourced accuracy for the same task.


Why It Works

  • Gives meaning to improvements.

  • Prevents over-engineering (you don’t need deep learning if logistic regression suffices).

  • Facilitates A/B testing for deployment.


Best Practices

  • Always store baseline performance in your ML metadata.

  • Re-evaluate baseline periodically as data evolves.


Trade-Offs

  • Baselines can become stale if not updated.

  • Comparing across different metrics can be misleading.


Pattern 20 — Prediction Bias

Problem

Models may systematically favor or disfavor certain groups, even if unintentionally.

Example:

  • A hiring model prefers candidates from certain universities.

  • A vision model underperforms for darker skin tones.

This is prediction bias — when errors are unevenly distributed across subpopulations.


Solution

Detect and mitigate bias during evaluation using fairness metrics and balanced data slices.


Fairness Metrics

Metric
Definition
Goal

Demographic Parity

P(pred=positive

group A) = P(pred=positive

Equal Opportunity

P(pred=positive

actual=positive, group A) = same for group B

Equalized Odds

Equal FP and FN rates across groups

Balanced fairness

Disparate Impact

Ratio of favorable outcome rates across groups

>0.8 (80% rule) considered fair


Bias Detection Process

  1. Slice evaluation by sensitive attributes.

  2. Compute metrics across slices.

  3. Identify disparities > acceptable threshold.

  4. Apply mitigation strategies.


Bias Mitigation Techniques

  • Data-level: Balance representation (resampling, reweighting).

  • Model-level: Add fairness constraints or adversarial debiasing.

  • Post-processing: Adjust thresholds per subgroup.


Why It Works

  • Promotes ethical, transparent, and socially acceptable AI systems.

  • Reduces reputational and legal risks.


Trade-Offs

  • May reduce raw accuracy.

  • Requires access to demographic data (which can be sensitive).

  • Fairness definitions can conflict (no single metric fits all).


6. Chapter Summary Table

Pattern
Goal
Key Idea
Example Use Case

Evaluation Metrics

Choose correct metric

Align metrics with objectives

Use AUC for imbalanced fraud detection

Slicing

Evaluate per subgroup

Uncover hidden weaknesses

Gender-based accuracy in NLP

Skew Detection

Detect data drift

Compare feature distributions

Model performance drops over time

Baseline Comparison

Contextualize performance

Compare to simple model or human

New vs. existing recommendation model

Prediction Bias

Detect and fix unfairness

Measure equality across groups

Hiring or lending models


7. Best Practices

  1. Use multiple complementary metrics

    → Example: F1 + ROC-AUC + calibration.

  2. Monitor model post-deployment

    → Evaluation is continuous, not one-time.

  3. Always slice by critical attributes

    → Age, gender, geography, device, etc.

  4. Automate drift and skew detection

    → Use tools like TensorFlow Data Validation (TFDV) and ML Monitoring.

  5. Compare against strong baselines

    → Prevent wasted effort on marginal improvements.

  6. Assess fairness before deployment

    → Especially in regulated or high-impact domains (finance, health, justice).


8. Key Takeaways

  • Metrics define success — choose them wisely based on the problem.

  • Slicing and bias analysis ensure fairness and robustness.

  • Data drift is inevitable — plan for ongoing monitoring.

  • Baselines ground your evaluation in practical reality.

  • Evaluation must be holistic: technical + ethical + business dimensions.


Last updated