githubEdit

Machine learning engineering

[!IMPORTANT] Executive Summary for ML Engineers

Production ML is 10% modeling and 90% engineering. This guide summarizes Andriy Burkov's Machine Learning Engineering, focusing on building reliable, scalable, and maintainable systems.

Core Engineering Principles:

  1. Data over Algorithms: A simple model with clean, high-quality data beats a complex model with noisy data.

  2. Baseline First: Never build a complex neural net without a simple heuristic or linear baseline to measure incremental value.

  3. Avoid Silent Failures: ML models don't "crash"; they degrade. Use distribution monitoring (PSI, K-S test) to detect data drift.

  4. Train-Serve Skew: Ensure identical feature engineering pipelines for training and inference using Feature Stores or shared serialization.

  5. Iterative Deployment: Use Shadow Mode (predict but don't serve) or Canary deployments to validate models on live traffic.

  6. Human-in-the-Loop: Bridge the gap between low-confidence predictions and high-stakes decisions with manual review tiers.


Table of Contents (Chapter Summaries)


Chapter 1 – Introduction

1. Overview

This chapter defines foundational machine learning (ML) concepts and sets the stage for what machine learning engineering entails. It focuses on ensuring consistent understanding of terms and introduces the ML project life cycle.


1.1 Notation and Definitions

Data Structures

  • Scalar: Single numeric value (e.g., 5, −2.3), denoted x or a.

  • Vector: Ordered list of scalars (attributes), denoted x, w.

    • Example: a = [2, 3]; b = [−2, 5].

    • Attribute j → x(j).

  • Matrix: 2D array (rows × columns), denoted A, W.

    • Columns or rows can be viewed as vectors.

  • Set: Unordered collection of unique elements, denoted calligraphically (𝒮).

    • Intersection: 𝒮₁ ∩ 𝒮₂

    • Union: 𝒮₁ ∪ 𝒮₂

    • |𝒮| = number of elements in the set.

Capital Sigma (Σ) Notation

Summation is expressed as:

\sum_{i=1}^{n} x_i = x_1 + x_2 + \dots + x_n

  • Euclidean norm (‖x‖): Measures vector length

    \sqrt{\sum (x(j))^2}

  • Euclidean distance: Distance between two vectors

    ‖a - b‖ = \sqrt{\sum (a(i) - b(i))^2}


1.2 What is Machine Learning

Definition

Machine learning is the subfield of computer science that builds algorithms relying on examples of phenomena to solve practical problems.

Steps:

  1. Collect dataset.

  2. Train a statistical model algorithmically.

Types of Learning

  1. Supervised Learning

    • Dataset: labeled pairs (x, y)

    • Predicts class (classification) or value (regression).

    • Example: spam detection, disease prediction.

    • Goal: Build model f(x) → ŷ close to true y.

  2. Unsupervised Learning

    • Dataset: unlabeled examples {x₁, x₂, …}

    • Finds structure in data.

    • Tasks:

      • Clustering: groups similar objects.

      • Dimensionality reduction: reduces feature space.

      • Outlier detection: finds anomalies.

  3. Semi-Supervised Learning

    • Mix of labeled and unlabeled data.

    • Exploits large unlabeled datasets for better learning.

  4. Reinforcement Learning

    • Agent interacts with environment.

    • Learns policy mapping states → optimal actions.

    • Goal: maximize long-term reward.


1.3 Data and Machine Learning Terminology

Data Used Directly vs. Indirectly

  • Direct data: forms feature vectors (e.g., word sequences in NER).

  • Indirect data: enriches features (e.g., dictionaries, gazetteers).

Raw vs. Tidy Data

  • Raw data: unstructured (e.g., text, images).

  • Tidy data: each row = example, each column = attribute (like a spreadsheet).

    • Needed for ML algorithms.

    • Categorical data may need numerical encoding (feature engineering).

Training, Validation, and Test Sets

  • Training set: builds the model.

  • Validation set: tunes hyperparameters and algorithm choice.

  • Test set: final unbiased performance check.

  • Must prevent data leakage between these sets.

Baseline

A simple heuristic or trivial model used as a reference for comparison. Machine Learning Pipeline

Sequential data flow from raw data → trained model.

Includes:

  • data partitioning

  • imputation

  • feature extraction

  • balancing

  • dimensionality reduction

  • model training

Parameters vs. Hyperparameters

  • Parameters: learned from data (e.g., weights w, bias b).

  • Hyperparameters: set before training (e.g., tree depth, learning rate).

Classification vs. Regression

  • Classification: predicts discrete labels.

    • Binary (spam/not spam) or multiclass.

  • Regression: predicts continuous numerical values.

Model-Based vs. Instance-Based Learning

  • Model-Based: trains parameters (e.g., SVM, Logistic Regression).

  • Instance-Based: uses training data directly (e.g., kNN).

Shallow vs. Deep Learning

  • Shallow: directly learns from features (e.g., SVM).

  • Deep: multi-layered neural networks that learn hierarchical features.

Training vs. Scoring

  • Training: building the model.

  • Scoring: applying the model to new examples for predictions.


1.4 When to Use Machine Learning

Use ML when:

  1. Problem too complex for explicit coding.

    • e.g., spam detection.

  2. Problem changes over time.

    • e.g., dynamic web data, evolving user behavior.

  3. Perceptive problems.

    • e.g., image, audio, or speech recognition.

  4. Unstudied phenomena.

    • e.g., genetic predictions, log anomaly detection.

  5. Simple objective.

    • e.g., yes/no, single-number outputs.

  6. Cost-effective.

    • Data, computation, and maintenance cost justify the outcome.


1.5 When Not to Use Machine Learning

Avoid ML when:

  • Full explainability is required.

  • Error cost is extremely high.

  • Traditional code can solve it more cheaply.

  • Data is scarce or expensive.

  • A lookup table or heuristic suffices.

  • The system logic is static and rarely changes.


1.6 What is Machine Learning Engineering (MLE)

Definition:

Application of ML and software engineering principles to build scalable, reliable, maintainable ML systems.

Key Responsibilities:

  • Data collection and preprocessing.

  • Feature programming.

  • Efficient model training and deployment.

  • Model monitoring and maintenance.

  • Preventing and handling silent failures (degradation due to data drift).

MLE bridges the gap between data analysis and software engineering.


1.7 Machine Learning Project Life Cycle

Stages:

  1. Goal definition

  2. Data collection & preparation

  3. Feature engineering

  4. Model training

  5. Model evaluation

  6. Model deployment

  7. Model serving

  8. Model monitoring

  9. Model maintenance

Loops exist to collect more data or re-engineer features if needed.


1.8 Summary of Key Points

  • ML learns from examples; supervised predicts, unsupervised organizes.

  • Tidy, numerical data is essential.

  • Split data into training, validation, and test sets.

  • Establish a baseline and design a full pipeline.

  • Understand when ML adds value and when it doesn’t.

  • MLE ensures production-readiness and stability of ML systems.

  • ML projects follow a life cycle from data collection to model maintenance.


Here are detailed notes for Chapter 2 – “Before the Project Starts” from Machine Learning Engineering by Andriy Burkov:


Chapter 2: Before the Project Starts

This chapter focuses on the planning and pre-execution stage of a machine learning (ML) project — before any data collection, model building, or deployment begins.

It discusses how to prioritize, define, and structure ML projects and teams to maximize their impact and minimize waste of time, cost, and effort.


2.1 Prioritization of Machine Learning Projects

Before starting, every ML project must be prioritized since:

  • Engineering and computational resources are limited.

  • The backlog of ideas and potential ML applications is often long.

  • ML projects have high uncertainty and hidden costs.

Key Factors in Prioritization

  1. Impact

  2. Cost


2.1.1 Impact of Machine Learning

High-impact ML projects are those where:

  1. Machine Learning replaces complex manual logic

    • If an existing rule-based system is overly complicated or difficult to maintain, ML can replace it effectively.

    • Example: replacing hundreds of nested business rules with a model that learns the patterns from data.

  2. Machine Learning adds significant benefit despite imperfection

    • Even imperfect predictions can lead to big savings.

    • Example:

      A dispatching system classifies incoming requests as “easy” or “difficult.”

      If 90% of “easy” requests are correctly automated, human time is saved significantly, even if 10% are misclassified.

Impact is high when:

  • ML simplifies a complex process.

  • ML improves scalability.

  • ML generates substantial business or user value.


2.1.2 Cost of Machine Learning

The cost of an ML project depends on three key factors:

  1. Problem Difficulty

    • Are there existing algorithms/libraries available?

    • How much compute power or training time is required?

  2. Data Availability and Cost

    • Can data be generated automatically?

    • What is the cost of labeling or annotation?

    • How much data is required for sufficient accuracy?

  3. Accuracy Requirements

    • Cost grows superlinearly with accuracy.

      Achieving 99% accuracy costs far more than 90%.

    • You must balance acceptable accuracy with cost.

    • Cost of errors must be compared with the cost of perfection.

Rule of thumb:

Going from 90% → 99% accuracy often multiplies the cost several times.


2.2 Estimating the Complexity of a Machine Learning Project

Unlike traditional software projects, ML projects have many unknowns. Estimating their complexity is hard.

2.2.1 The Unknowns

Before starting, you usually don’t know:

  • If the required model accuracy is achievable in practice.

  • How much data is needed.

  • What features are relevant or sufficient.

  • What model size or architecture is required.

  • How much time training and experimentation will take.

If required accuracy > 99%, expect significant complications — often due to lack of labeled data or data imbalance.

Benchmark:

Human-level performance can serve as a realistic upper limit for the model.


2.2.2 Simplifying the Problem

To make complexity estimation manageable:

  1. Start with a simpler version of the problem.

    • Example: If classifying into 1,000 topics, first solve it for 10 topics + “Other.”

    • Once solved, extrapolate cost, time, and feasibility for the full version.

  2. Use data slices.

    • Break down the data by location, product type, or age group.

    • Train models on subsets first to learn constraints.

  3. Use pilot projects.

    • Small-scale pilots reveal whether full-scale implementation is worth it.

Note: As the number of classes grows, the required data grows superlinearly.


2.2.3 Nonlinear Progress

ML project progress is nonlinear:

  • Rapid improvement early → plateau later.

  • Sometimes no progress until new data/features are added.

  • Progress may stagnate while waiting for labeling or new data pipelines.

Recommendation:

  • Log time and resources for every activity.

  • Keep stakeholders informed that progress will not be linear.

  • Use the 80/20 rule:

    80% of progress comes from the first 20% of effort.


2.3 Defining the Goal of a Machine Learning Project

An ML project’s goal is to build a model that contributes to solving a business problem.


2.3.1 What a Model Can Do

A model’s role in a system can take many forms. It can:

Function
Example Use Case

Automate

Automatically approve or reject transactions

Alert / Prompt

Notify admin of suspicious activity

Organize

Rank or group content for users

Annotate

Highlight key text segments

Extract

Identify names, locations in text

Recommend

Suggest items, users, or products

Classify

Assign emails as “spam” or “not spam”

Quantify

Predict house prices

Synthesize

Generate text, images, or speech

Answer Questions

“Does this text describe that image?”

Transform

Summarize text or translate language

Detect Novelty/Anomaly

Identify abnormal behavior or data patterns

If your problem doesn’t fit one of these categories, it might not be suitable for ML.


2.3.2 Properties of a Successful Model

A model is successful if it satisfies four criteria:

  1. Meets specifications:

    Input/output structure, accuracy, latency, etc.

  2. Brings business value:

    • Reduces cost, increases sales, improves user retention.

  3. Helps users:

    • Measured via engagement, satisfaction, or productivity.

  4. Is scientifically rigorous:

    • Predictable: behaves consistently for data from the same distribution.

    • Reproducible: can be rebuilt from same data and hyperparameters.


Defining the Right Goal

Poorly defined goals lead to wasted effort.

Example (Cat vs. Dog problem):

  • Goal: Allow the client’s own cat inside, block the dog.

  • Bad definition: Train model to distinguish cats vs. dogs → lets any cat in!

  • Correct definition: Identify the client’s cat specifically.

Lesson:

Define what the model must actually achieve, not what seems similar.


Multiple Stakeholders

Different teams have different objectives:

  • Product owner: maximize user engagement.

  • Executive: increase revenue.

  • Finance: reduce costs.

The ML engineer must balance these goals and translate them into:

  • Choice of inputs/outputs

  • Cost function

  • Performance metric


2.4 Structuring a Machine Learning Team

Two main organizational cultures exist for ML teams:


2.4.1 Two Cultures

  1. Specialized Roles (Collaborative Model)

    • Data analysts (scientists) handle modeling.

    • Software engineers handle deployment and scaling.

    • Pros:

      • Each expert focuses on their strength.

    • Cons:

      • Difficult handoffs; integration challenges.

  2. Hybrid Roles (Full-Stack ML Engineers)

    • Every member combines ML and software skills.

    • Pros:

      • Easier integration, faster iteration.

    • Cons:

      • Harder to hire; requires versatile talent.


2.4.2 Members of a Machine Learning Team

Typical Roles and Responsibilities:

Role
Responsibilities

Data Analysts / Scientists

Analyze data, build models, test hypotheses.

Machine Learning Engineers

Build production-ready ML systems, handle scalability, optimization, and monitoring.

Data Engineers

Build data pipelines (ETL: Extract, Transform, Load), integrate data sources, design data architecture.

Labeling Experts / Annotators

Manually label data, validate labeling quality, build labeling tools, manage outsourced teams.

Domain Experts

Provide business and contextual knowledge for feature design and target definition.

DevOps Engineers

Automate deployment, CI/CD pipelines, scaling, and monitoring of ML models.


Team Collaboration Tips

  • Work with domain experts early to define inputs, outputs, and success metrics.

  • Communicate trade-offs (accuracy vs. cost, latency vs. complexity).

  • Encourage shared understanding between business, data, and engineering teams.


2.5 Key Takeaways

  1. Prioritize ML projects by impact vs. cost; start with simpler pilot versions.

  2. Estimate complexity carefully—progress is nonlinear, data requirements unpredictable.

  3. Define clear, measurable goals that tie directly to business outcomes.

  4. Form the right team structure:

    • Specialists for large orgs.

    • Generalist ML engineers for smaller ones.

  5. Engage domain experts—they bridge business context and ML logic.

  6. Set realistic accuracy and cost expectations early.


2.6 Chapter Summary

  • Before starting, evaluate feasibility and business value.

  • High impact ML projects simplify complex or costly processes.

  • Cost grows superlinearly with desired accuracy.

  • Simplify problems into smaller experiments.

  • Define goals precisely and align them with stakeholders.

  • Build a multidisciplinary ML team with clear communication between roles.

  • Understand that project progress will be nonlinear and uncertain, requiring flexibility and iteration.


Here are detailed notes for Chapter 3 – “Framing the Problem and Project” from Machine Learning Engineering by Andriy Burkov:


Chapter 3: Framing the Problem and Project

This chapter explains how to convert a business or research problem into a concrete machine learning (ML) task.

It discusses framing, defining objectives, metrics, and the nature of predictions, while avoiding common pitfalls in the problem-definition stage.


3.1 Importance of Problem Framing

Before collecting data or building models, you must:

  • Define what exactly is being predicted.

  • Understand how predictions are used.

  • Specify measurable objectives.

The quality of an ML system is directly dependent on how well the problem is framed at the beginning.

A poorly framed problem leads to:

  • Wasted time and resources.

  • Misaligned goals.

  • Ineffective models.


3.2 From Business Problem to Machine Learning Problem

Step 1: Understand the Business Objective

  • Identify the desired business outcome (e.g., “increase retention”, “reduce churn”).

  • Define success in measurable terms.

    • Example: “Reduce customer churn by 15% over 6 months.”

Step 2: Define the Decision

  • What decision will be made based on the model output?

    • Example: “Which customers should receive retention offers?”

  • The decision determines what kind of prediction you need.

Step 3: Translate to Prediction

  • Turn the decision into a prediction problem.

    • Business: “Which users are likely to churn?”

    • ML: “Predict the probability a user will churn within 30 days.”


3.3 Target Variable (Label) Definition

The target variable (y) is what the model learns to predict.

Common Pitfalls

  1. Target Leakage

    • Occurs when the label includes data that would not be available at prediction time.

    • Example: predicting tomorrow’s stock price using tomorrow’s trading volume.

  2. Ill-Defined Targets

    • When labels are inconsistent or ambiguous.

    • Example: “Bad customer” — unclear definition.

  3. Changing Targets

    • When business objectives shift mid-project, forcing relabeling.

How to Define a Good Target

  • Based on data available at prediction time.

  • Consistent across samples.

  • Aligned with the business decision.


3.4 Features (Input Variables)

Features are input variables (x₁, x₂, …, xₙ) used for prediction.

  • Good features contain relevant, predictive information available before the event occurs.

  • Bad features include post-event data (causing leakage).

Example:

  • Predicting flight delays.

    • Use: weather forecasts, aircraft type, route.

    • Don’t use: actual arrival time.


3.5 Static vs. Dynamic Predictions

Type
Definition
Example

Static Prediction

Model makes one-time predictions.

Loan approval, image classification.

Dynamic Prediction

Model makes predictions continuously or repeatedly as new data arrives.

Fraud detection, recommendation systems.

Static

  • One-off, fixed features, no feedback loop.

Dynamic

  • Needs real-time data ingestion, continuous training, and feedback monitoring.


3.6 The Prediction Horizon

Prediction horizon = the time gap between when input data is collected and when the event (label) happens.

Example:

  • Predict churn 30 days in advance → horizon = 30 days.

Choosing horizon affects:

  • Data availability: shorter horizons → more data, less uncertainty.

  • Business value: longer horizons → more time to act, but less accuracy.


3.7 Setting a Baseline

A baseline is the simplest possible model or heuristic that performs the task.

Examples:

  • Predicting the most frequent class (majority classifier).

  • Using mean or median for regression.

  • Rule-based thresholds.

Baselines are critical because:

  • They reveal whether your ML model truly adds value.

  • They act as a sanity check for initial experiments.

If your ML model doesn’t outperform the baseline, revisit feature design or problem framing.


3.8 Metrics: How to Measure Success

Evaluation metrics quantify how well the model achieves the business goal.

Types of Metrics

  1. Classification Metrics

    • Accuracy, Precision, Recall, F1 Score, ROC-AUC, Log loss.

    • Choose based on imbalance and business costs.

  2. Regression Metrics

    • Mean Absolute Error (MAE), Mean Squared Error (MSE), R² Score.

  3. Ranking / Recommendation Metrics

    • Precision@k, Recall@k, Mean Average Precision (MAP), NDCG.

  4. Business KPIs

    • Actual real-world outcomes, e.g., revenue lift, reduced churn rate.

Cost-Based Metrics

Sometimes decisions depend on error costs:

  • False Positive (FP) cost ≠ False Negative (FN) cost.

  • Example:

    • FP: sending a retention offer unnecessarily.

    • FN: losing a valuable customer.

The metric must align with the business objective.


3.9 Offline vs. Online Evaluation

Type
Description
Example

Offline Evaluation

Uses historical labeled data; faster and cheaper.

Cross-validation, holdout test sets.

Online Evaluation

Tests model in production; measures real-world effect.

A/B tests, live metrics.

Offline

  • Good for experimentation.

  • Must ensure data distribution matches production.

Online

  • Exposes model to live users.

  • Measures impact on KPIs like conversion or engagement.

Best practice: use both — offline to validate technically, online to validate business value.


3.10 Framing Pitfalls

  1. Optimizing for the wrong metric

    • E.g., focusing on accuracy when recall matters more.

  2. Ignoring user experience

    • Even accurate models may hurt usability if their decisions seem random to users.

  3. Failing to consider the feedback loop

    • Example: a model that changes what data it later sees (self-influence).

  4. Poorly defined prediction timing

    • Predicting an event after it has already happened.

  5. Ignoring business constraints

    • A perfect model may be useless if it’s too slow, expensive, or non-interpretable.


3.11 Example: Credit Default Prediction

Business Goal

Reduce financial losses from unpaid loans.

Machine Learning Problem

Predict the probability a new loan applicant will default.

Key Components

  • Target (y): 1 = defaulted, 0 = repaid.

  • Features: income, credit history, number of loans, etc.

  • Prediction horizon: 6 months from loan approval.

  • Metric: ROC-AUC or cost-weighted accuracy.

  • Baseline: Rule-based scoring system.

Decisions

  • High-risk → reject loan.

  • Medium-risk → manual review.

  • Low-risk → approve.


3.12 Iterative Framing

Problem framing is not one-time; it’s iterative:

  1. Start with a draft problem definition.

  2. Run pilot experiments.

  3. Analyze results and feedback.

  4. Refine features, targets, and metrics.

This cycle continues until:

  • Predictions are actionable.

  • Model aligns with business value.


3.13 Human-in-the-Loop Design

In many real systems:

  • Models assist, not replace, humans.

  • Final decisions are hybrid (model + expert).

Benefits:

  • Reduces risk of model mistakes.

  • Increases interpretability and trust.

  • Enables incremental automation.


3.14 Chapter Summary

  • Proper problem framing bridges the gap between business goals and ML models.

  • Clearly define:

    • The decision to be made.

    • The target variable (label).

    • The prediction horizon.

    • The evaluation metric.

  • Always start with a baseline.

  • Avoid leakage, ill-defined labels, and misaligned metrics.

  • Evaluate both offline (for accuracy) and online (for business impact).

  • Keep framing iterative — refine based on outcomes.

  • Design systems with a human-in-the-loop where necessary.


Here are detailed notes for Chapter 4 – “Data Definition and Collection” from Machine Learning Engineering by Andriy Burkov:


Chapter 4: Data Definition and Collection

This chapter focuses on the data stage of the machine learning project lifecycle — defining, identifying, collecting, and labeling data.

It emphasizes that data quality is more critical than model complexity and explores practical methods for gathering reliable, representative, and unbiased data.


4.1 The Central Role of Data

  • Machine Learning ≈ Data + Algorithms

  • The algorithm defines how the model learns,

    but data defines what the model learns.

Garbage in → Garbage out.

Even the best algorithm cannot overcome bad data.

Key Principle:

Better data almost always beats a better model.


4.2 What Is Data in ML Projects

In ML, data is a collection of examples representing the phenomenon the model must learn.

Each example typically includes:

  • Input features (x₁, x₂, …, xₙ)

  • Target label (y)

The dataset should reflect the distribution of real-world data the model will see in production.


4.3 Defining the Data Requirements

Before data collection begins, you must define what data you need and why.

4.3.1 Identify the Unit of Observation

  • What is a single data point?

    • Example: one transaction, one user session, one product.

  • Must match how predictions are made later.

4.3.2 Define the Input Features

Ask:

  • What information is available before the prediction moment?

  • Which features are causal or correlated with the outcome?

Avoid including:

  • Information not available at prediction time (leads to data leakage).

  • Features that will change meaning over time (nonstationarity).

4.3.3 Define the Label (Target)

  • Clearly specify the event being predicted.

    • Example: “User churn = no login in the next 30 days.”

  • Ensure the label is objectively measurable and consistent across data points.

4.3.4 Define Sampling Strategy

  • Decide how examples are selected:

    • Random sampling

    • Stratified sampling (ensure representation of all classes)

    • Time-based sampling (for temporal data)


4.4 Data Quality Dimensions

Good data satisfies multiple dimensions of quality:

Dimension
Description
Example

Completeness

Missing values are minimal.

95% of users have recorded ages.

Consistency

Same format across systems.

Date format unified as YYYY-MM-DD.

Validity

Values fall within acceptable range.

“Age” between 0–120.

Uniqueness

No duplicates.

One record per customer.

Timeliness

Data reflects current conditions.

Real-time transaction logs.

Representativeness

Matches the production environment.

Distribution of regions/cities is similar.

Always measure and monitor these quality metrics before model training.


4.5 Data Collection Sources

Data can come from multiple sources, depending on the problem.

4.5.1 Internal Data

  • Company’s existing systems (databases, APIs, logs, CRMs, etc.)

  • Usually the easiest and cheapest to obtain.

  • Must check for bias and completeness.

4.5.2 External Data

  • Purchased or open-source datasets.

  • Public APIs, web scraping, government or academic data repositories.

  • Must ensure licensing and privacy compliance (e.g., GDPR).

4.5.3 Synthetic Data

  • Generated artificially to simulate rare or missing cases.

  • Useful for:

    • Balancing imbalanced datasets.

    • Privacy preservation.

    • Testing edge cases.

Synthetic data must resemble the real data distribution; otherwise, it can mislead the model.


4.6 Data Labeling

4.6.1 The Role of Labels

  • Labels are ground truth — the foundation of supervised learning.

  • Incorrect labels → incorrect learning.

4.6.2 Labeling Methods

  1. Manual Labeling

    • Humans label data via tools or platforms.

    • Example: Amazon Mechanical Turk, Labelbox.

    • Must have clear guidelines and quality checks.

  2. Programmatic Labeling

    • Automatically assign labels using rules, heuristics, or weak models.

    • Often used for initial labeling to save cost.

  3. Semi-Supervised Labeling

    • Combine small labeled data + large unlabeled data.

    • Use self-training or pseudo-labeling.

4.6.3 Labeling Quality Assurance

  • Consensus labeling: multiple annotators per item → use majority vote.

  • Inter-annotator agreement (IAA):

    • Measures label consistency across annotators.

    • Example: Cohen’s kappa (κ).

  • Spot-checking: randomly audit a subset of labels.


4.7 Dealing with Class Imbalance

When one class appears much less frequently (e.g., 1% fraud, 99% normal).

Solutions:

  1. Data-Level Approaches:

    • Oversampling: duplicate minority class.

    • SMOTE (Synthetic Minority Oversampling Technique): generate synthetic samples.

    • Undersampling: reduce majority class size.

  2. Algorithm-Level Approaches:

    • Assign class weights inversely proportional to frequency.

    • Use metrics robust to imbalance (AUC, F1, Precision-Recall).

The chosen technique depends on the problem’s cost of false negatives vs. false positives.


4.8 Avoiding Data Leakage

Data leakage occurs when:

  • Information from the future or test set influences training.

  • The model indirectly learns from data it won’t have at prediction time.

Examples:

  • Using “time of resolution” when predicting issue occurrence.

  • Normalizing using global mean (instead of training mean).

Prevention:

  • Ensure strict train/validation/test split.

  • Apply transformations (scaling, encoding) after splitting.

  • Use pipelines to isolate data preprocessing from leakage.


4.9 Temporal Data Considerations

For time-series or temporal data:

  • Always maintain chronological order in train/test splits.

  • Avoid using future data for past predictions.

Sliding Window Techniques:

  • Use data up to time t to predict outcomes at t + k.

  • Continuously retrain models as new data arrives.


4.10 Data Representativeness

Your dataset must represent the real-world population where the model will operate.

Common Issues

  • Selection Bias: training data differs from live data.

    • Example: only urban customers in training data → poor rural predictions.

  • Historical Bias: past data reflects outdated or unfair decisions.

    • Example: old hiring data biased against certain demographics.

Solutions

  • Collect diverse, up-to-date data.

  • Monitor data drift — changes in distribution between training and production.


4.11 Privacy, Ethics, and Compliance

Privacy Concerns

  • Personally Identifiable Information (PII) must be:

    • Protected (encryption, anonymization).

    • Used under consent (GDPR, HIPAA compliance).

Ethical Considerations

  • Bias in data leads to biased models.

  • Example: hiring algorithm favoring one gender due to biased training data.

Best Practices

  • Conduct data audits for bias and fairness.

  • Use differential privacy or federated learning when applicable.

  • Document data origin, purpose, and usage.


4.12 Data Storage and Access

  • Store data in a version-controlled, secure, and auditable manner.

  • Use metadata:

    • Source

    • Date collected

    • Preprocessing steps

    • Schema version

  • Enable reproducibility by tracking data versions along with model versions (DataOps).


4.13 Data Collection Pipeline

Typical stages in the data pipeline:

  1. Data Ingestion

    • Collect from databases, APIs, logs, or sensors.

  2. Data Validation

    • Check schema, missing values, outliers.

  3. Data Cleaning

    • Handle missing data, incorrect entries, duplicates.

  4. Data Transformation

    • Encode, normalize, extract features.

  5. Storage

    • Centralized data lake or warehouse.

  6. Access

    • Enable secure queries and sampling for model training.

Use automated validation and monitoring to detect anomalies early.


4.14 The Iterative Nature of Data Work

Data collection and cleaning are never one-time tasks:

  • New data changes the distribution.

  • Features evolve.

  • Systems and sensors introduce new errors.

Cycle:

→ Collect → Validate → Train → Deploy → Monitor → Recollect → Retrain


4.15 Key Takeaways

  1. Define data requirements clearly before collecting anything.

  2. Ensure data quality — completeness, consistency, validity, representativeness.

  3. Avoid leakage and respect the prediction moment (only use data available then).

  4. Handle imbalance through sampling or weighting.

  5. Labeling accuracy determines model accuracy — invest in quality assurance.

  6. Respect privacy and ethics — avoid collecting unnecessary personal data.

  7. Monitor data drift and continuously refresh datasets.

  8. Document everything — sources, transformations, schema, and collection logic.


4.16 Chapter Summary

  • The foundation of ML systems lies in high-quality, representative, and ethical data.

  • Define features, labels, and sampling carefully to align with the business use case.

  • Establish automated data pipelines with validation and versioning.

  • Balance data quality, cost, and compliance for sustainable ML development.

  • Remember: Data work is iterative, not linear.


Here are detailed notes for Chapter 5 – “Data Preparation” from Machine Learning Engineering by Andriy Burkov:


Chapter 5: Data Preparation

This chapter covers how to clean, transform, and structure raw data into a usable format for model training.

It emphasizes that data preparation is one of the most time-consuming and crucial phases of any machine learning (ML) project — often taking up 60–80% of the total effort.


5.1 The Purpose of Data Preparation

Machine learning algorithms require data in a numerical, structured, and consistent format.

However, raw data is often messy — with missing values, duplicates, outliers, or mixed formats.

“The quality of your model depends more on your data preparation than on the algorithm you choose.”

Goals of Data Preparation:

  1. Improve data quality (cleaning, fixing errors, removing noise).

  2. Ensure compatibility with ML algorithms.

  3. Enhance signal strength by transforming features.

  4. Avoid data leakage and maintain reproducibility.


5.2 Steps of Data Preparation

The process generally includes the following stages:

  1. Data Cleaning

  2. Data Transformation

  3. Feature Engineering

  4. Feature Selection

  5. Data Splitting

  6. Data Balancing

Each step transforms raw data toward a model-ready dataset.


5.3 Data Cleaning

Data cleaning fixes errors, inconsistencies, and missing values.

5.3.1 Detecting and Removing Duplicates

  • Remove identical or near-identical rows.

  • Example: duplicate user transactions or logs.

5.3.2 Handling Missing Values

  • Causes: human error, incomplete logs, faulty sensors, etc.

  • Strategies:

    1. Remove records: when missingness is random and rare.

    2. Impute values: fill using:

      • Mean/median (for numeric)

      • Mode (for categorical)

      • KNN or regression-based imputation

    3. Special category: assign “Unknown” for categorical features.

Imputation introduces bias if not done carefully — especially if data is not missing at random.

5.3.3 Detecting Outliers

Outliers can distort model learning and skew statistics.

  • Detection Methods:

    • Z-score or IQR methods.

    • Isolation Forest or DBSCAN for high-dimensional data.

  • Handling:

    • Cap at a threshold (winsorization).

    • Remove or transform (e.g., log-transform skewed distributions).

5.3.4 Correcting Data Types

  • Convert text numbers (“10,000”) to numeric.

  • Standardize date/time formats.

  • Ensure categorical data uses consistent naming.

5.3.5 Resolving Inconsistencies

  • Example: “USA”, “United States”, and “U.S.” must map to one category.


5.4 Data Transformation

Raw data often needs to be reshaped or scaled before model ingestion.

5.4.1 Scaling and Normalization

ML models like linear regression, SVM, and neural networks are sensitive to feature scales.

Techniques:

  • Normalization (Min-Max Scaling):

    x’ = \frac{x - x_{min}}{x_{max} - x_{min}}

    Scales features to [0, 1].

  • Standardization (Z-score Scaling):

    x’ = \frac{x - \mu}{\sigma}

    Transforms to mean = 0, std = 1.

When to use:

  • Normalization → bounded models (e.g., neural nets with sigmoid/tanh).

  • Standardization → unbounded models (e.g., linear regression).

Always fit the scaler on the training set only to avoid leakage.


5.4.2 Encoding Categorical Variables

ML algorithms require numeric input, so categorical data must be encoded.

Types of encoding:

  1. One-Hot Encoding:

    • Create binary column for each category.

    • Suitable for low-cardinality features.

  2. Ordinal Encoding:

    • Assign integer ranks to ordered categories.

  3. Target / Mean Encoding:

    • Replace each category with the average of the target variable.

    • Useful for high-cardinality features.

Use with caution — target encoding can lead to overfitting; apply regularization or cross-fold encoding.


5.4.3 Text Data Transformation

For textual data:

  • Tokenization: split text into words or n-grams.

  • Stopword removal: eliminate common words (like “the”, “a”).

  • Stemming/Lemmatization: reduce words to root forms.

  • Vectorization:

    • Bag of Words (BoW)

    • TF-IDF (Term Frequency–Inverse Document Frequency)

    • Word Embeddings (Word2Vec, GloVe)

    • Contextual Embeddings (BERT)


5.4.4 Handling Date and Time Features

Dates and timestamps can be converted into:

  • Year, month, day, weekday.

  • Hour of day (cyclic encoding for periodic data).

  • Time differences (e.g., days since signup).

Use sin/cos encoding for cyclical features:

\text{sin}(2\pi t/T), \quad \text{cos}(2\pi t/T)


5.4.5 Aggregation and Binning

  • Binning: convert continuous variables into discrete intervals.

    • Example: age → child, adult, senior.

  • Aggregation: summarize data (e.g., average purchases per week).

These help reduce noise and variance.


5.5 Feature Engineering

Feature engineering = creating new informative features that help the model learn patterns more easily.

5.5.1 Derived Features

  • Ratios: “price per unit”, “debt-to-income ratio”.

  • Interactions: “feature1 × feature2”.

  • Temporal trends: “number of logins in last 7 days”.

5.5.2 Domain Knowledge

  • Use domain expertise to craft meaningful signals.

  • Example (banking): combine “income” and “loan amount” → “loan-to-income ratio”.

5.5.3 Feature Reduction

  • Remove redundant or highly correlated features.

  • Use feature importance from models (Random Forests, XGBoost).

  • Dimensionality reduction (PCA, t-SNE, autoencoders) when needed.


5.6 Data Splitting

To evaluate model performance fairly, data is split into:

Split
Purpose
Typical Ratio

Training set

Model learning

60–80%

Validation set

Hyperparameter tuning

10–20%

Test set

Final evaluation

10–20%

5.6.1 Cross-Validation

  • K-Fold CV: split into k folds, rotate validation set.

  • Reduces variance and ensures robustness.

5.6.2 Temporal Splits

  • For time-series data, preserve chronological order (no random shuffling).

5.6.3 Stratified Sampling

  • Maintain same label distribution across splits (for classification).


5.7 Data Balancing

If dataset classes are imbalanced (e.g., fraud = 1%, non-fraud = 99%), balance them via:

  1. Resampling Techniques

    • Oversampling minority class (SMOTE, ADASYN).

    • Undersampling majority class.

  2. Algorithmic Solutions

    • Adjust class weights (e.g., class_weight='balanced').

  3. Threshold Tuning

    • Shift decision boundary to favor recall or precision.


5.8 Data Leakage Prevention

Data leakage is a critical issue that invalidates model evaluation.

Avoid by:

  • Splitting data before preprocessing.

  • Fitting scalers and encoders only on the training set.

  • Avoiding target-derived features (e.g., average target per user).

  • Keeping future data separate in time-based tasks.


5.9 Automating Data Preparation

Data pipelines can automate cleaning and transformation.

Tools:

  • Python libraries: Pandas, scikit-learn Pipelines, Featuretools.

  • DataOps / MLOps tools: Apache Airflow, Kubeflow, MLflow, Prefect.

  • Feature Stores: centralized repositories for preprocessed features (e.g., Feast, Tecton).

Benefits:

  • Reproducibility

  • Consistency

  • Reduced leakage

  • Scalability


5.10 Data Versioning and Reproducibility

Reproducibility requires tracking:

  • Data versions

  • Preprocessing scripts

  • Random seeds

  • Train/validation splits

Tools:

DVC (Data Version Control), Git LFS, MLflow Tracking.


5.11 Feature Scaling in Production

The same transformation logic used during training must be applied consistently during inference.

Hence, the preprocessing pipeline must be serialized (saved) with the model.

Example: If you used a StandardScaler, the same mean and std must be applied to production data.


5.12 Key Takeaways

  1. Data preparation is the foundation of model quality.

  2. Clean, validate, and standardize data before modeling.

  3. Handle missing data, outliers, and categorical features carefully.

  4. Perform scaling, encoding, and feature engineering thoughtfully.

  5. Split data properly — avoid leakage and maintain representativeness.

  6. Automate preprocessing for consistency and reproducibility.

  7. Always validate transformations using the training set only.

  8. Document and version your datasets and pipelines.


5.13 Chapter Summary

  • Data preparation ensures that raw data is structured, complete, and consistent.

  • Major steps include cleaning, transformation, encoding, feature engineering, and splitting.

  • Leakage prevention and reproducibility are key quality pillars.

  • Automation tools and pipelines make the process scalable.

  • High-quality data preparation is what separates academic models from production-grade systems.


Here are detailed notes for Chapter 6 – “Feature Engineering and Selection” from Machine Learning Engineering by Andriy Burkov:


Chapter 6: Feature Engineering and Selection

This chapter explores how to design, construct, and select features that most effectively represent the underlying structure of your data.

Feature engineering is one of the most critical stages in the ML lifecycle — it often determines the ultimate success of the model.

“The quality of your features largely determines the quality of your model.”


6.1 The Importance of Features

A feature is a measurable property or characteristic of a phenomenon being observed.

In ML, features are the inputs (x₁, x₂, …, xₙ) the model uses to learn patterns and make predictions.

Why Feature Engineering Matters

  • Models learn patterns, not raw data.

  • Poor features → poor model performance, even with complex algorithms.

  • Good features can make even simple models perform competitively.

Algorithms amplify the information in data — they don’t create it.

Hence, feature engineering adds signal before training begins.


6.2 The Feature Engineering Process

Feature engineering is iterative and involves:

  1. Understanding the data and domain.

  2. Creating new features from existing ones.

  3. Transforming features to better reflect relationships.

  4. Evaluating their usefulness via experiments or metrics.

  5. Selecting the best subset.


6.3 Types of Features

Type
Description
Examples

Numerical

Quantitative, continuous or discrete.

Age, price, temperature.

Categorical

Qualitative, nominal or ordinal.

Country, education level.

Textual

Sequences of words, phrases.

Customer reviews, tweets.

Temporal

Time-based or sequential.

Timestamps, intervals.

Spatial

Geographic or positional.

GPS coordinates, pixel grids.

Derived / Engineered

Computed from existing data.

Ratios, interactions, aggregations.


6.4 Domain Understanding and Feature Design

Feature engineering begins with understanding the domain:

  • Know what influences the target variable.

  • Understand data generation and collection context.

Example (Loan Default Prediction):

  • Domain knowledge reveals that “loan-to-income ratio” or “credit utilization” are better predictors than raw “income” or “loan amount.”

Key Questions:

  1. What features correlate with the target?

  2. What transformations might reveal hidden relationships?

  3. Are there nonlinear effects that need to be captured?


6.5 Feature Construction Techniques

6.5.1 Mathematical Transformations

Used to adjust feature distributions or linearize relationships.

Transformation
Purpose

Log(x + 1)

Reduce skewness, handle large ranges

√x

Stabilize variance

x², x³

Capture nonlinear patterns

1/x

Highlight inverse relationships

Standardization / Normalization

Adjust scales


6.5.2 Combining Features

Combine multiple raw features to create new, meaningful attributes.

Examples:

  • Ratio: price / income

  • Difference: max_temp - min_temp

  • Product: height × weight (for BMI)

  • Aggregation: avg_spent_per_week

These help models learn relationships directly rather than relying on complex nonlinearities.


6.5.3 Interaction Features

Capture combined effects of two or more features.

  • Example: Interaction between “education level” and “years of experience” for salary prediction.

  • In linear models, interactions can help simulate nonlinear patterns.


6.5.4 Temporal Features

Derived from timestamps:

  • Extract: hour, day, month, weekday.

  • Time since event: “days since last login.”

  • Rolling aggregates: “average purchase per week.”

  • Cyclical encoding for periodicity:

    x_{sin} = \sin\left(2\pi \frac{t}{T}\right), \quad x_{cos} = \cos\left(2\pi \frac{t}{T}\right)


6.5.5 Text Features

Transform unstructured text into numerical representations.

Method
Description

Bag-of-Words (BoW)

Count word frequency per document.

TF-IDF

Weigh rare but important words higher.

Word Embeddings

Map words to dense vectors (Word2Vec, GloVe).

Contextual Embeddings

Dynamic representations (BERT, GPT, etc.).


6.5.6 Encoding Categorical Variables

Different encodings for different use cases:

Method
Use Case

One-Hot Encoding

Small number of categories.

Ordinal Encoding

Ordered categories.

Target Encoding

Large cardinality, with regularization.

Binary / Hash Encoding

Very high-cardinality features.


6.5.7 Aggregation Features

Summarize grouped data (by user, session, region, etc.).

Examples:

  • average_purchase_per_user

  • max_rating_by_category

  • count_of_transactions_last_30_days

Aggregation captures behavioral or temporal context that individual events miss.


6.6 Handling Feature Correlations and Redundancy

Highly correlated features can:

  • Inflate model complexity.

  • Lead to unstable coefficients (in linear models).

  • Reduce interpretability.

Approaches:

  • Compute correlation matrix and remove duplicates (|r| > 0.9).

  • Use Variance Inflation Factor (VIF) for multicollinearity.

  • Apply Principal Component Analysis (PCA) for dimensionality reduction.


6.7 Feature Scaling

Many ML algorithms assume that all features are on comparable scales.

Scaling Technique
Formula
Use Case

Min-Max Scaling

(x – min)/(max – min)

Neural networks

Standardization

(x – μ)/σ

Linear models, SVM

Robust Scaling

(x – median)/IQR

Outlier-resistant models

Always fit scalers on training data only — never include test data.


6.8 Feature Selection

After creating features, not all are useful.

Feature selection helps remove irrelevant, redundant, or noisy variables.

Goals:

  • Reduce overfitting.

  • Improve interpretability.

  • Decrease computation time.


6.8.1 Filter Methods (Statistical Tests)

Independent of any ML algorithm.

Technique
Use Case

Correlation / Mutual Information

Continuous features

Chi-Square Test

Categorical features

ANOVA (F-test)

Group mean comparison

Example:

  • Select top k features with highest information gain or correlation with target.


6.8.2 Wrapper Methods

Use a predictive model to evaluate feature subsets.

Technique
Description

Recursive Feature Elimination (RFE)

Iteratively remove least important features using model feedback.

Forward Selection

Start with none, add best-performing one by one.

Backward Elimination

Start with all, remove least useful progressively.

Computationally expensive, but often yields better results.


6.8.3 Embedded Methods

Feature selection occurs during model training.

Method
Model Type
Logic

L1 Regularization (Lasso)

Linear

Forces some coefficients to zero.

Tree-based Feature Importance

Decision Trees, Random Forests, XGBoost

Importance = reduction in impurity or information gain.

Embedded methods are efficient and widely used in production.


6.9 Dimensionality Reduction

Used when there are hundreds or thousands of features.

Technique
Purpose

PCA (Principal Component Analysis)

Project data onto lower-dimensional axes maximizing variance.

t-SNE / UMAP

Visualize high-dimensional structure.

Autoencoders

Learn compact representations in neural networks.

These help eliminate redundancy and speed up model training.


6.10 Evaluating Feature Importance

Helps explain model decisions and select features intelligently.

Techniques:

  1. Permutation Importance:

    Randomly shuffle one feature and observe performance drop.

  2. Model-based Importance:

    Use feature_importances_ attribute from tree-based models.

  3. SHAP (SHapley Additive exPlanations):

    Quantifies each feature’s contribution to individual predictions.


6.11 Feature Drift and Monitoring

Once deployed, feature distributions may shift over time due to changing environments.

Types:

  • Covariate Drift: Input features change.

  • Label Drift: Relationship between features and labels changes.

  • Concept Drift: Target definition evolves.

Solution:

  • Continuously monitor feature statistics.

  • Retrain or recalibrate model when drift exceeds threshold.


6.12 Automation in Feature Engineering

Modern ML systems use automated feature generation and management.

Tools:

  • Featuretools (Python) → automatic feature synthesis.

  • Feast, Tecton, Vertex AI Feature Store → store, version, and serve features in production.

Benefits:

  • Consistency across training and inference.

  • Shared, reusable features across teams.

  • Easier monitoring and debugging.


6.13 Key Takeaways

  1. Feature engineering adds domain-driven signal that algorithms cannot create.

  2. Combine, transform, and scale features to expose patterns.

  3. Use statistical and model-based methods for feature selection.

  4. Reduce redundancy via correlation checks and PCA.

  5. Evaluate feature importance for interpretability and fairness.

  6. Continuously monitor feature drift post-deployment.

  7. Leverage feature stores for automation and reproducibility.

  8. Remember: simple models with strong features outperform complex models with weak ones.


6.14 Chapter Summary

  • Feature engineering bridges raw data and model insight.

  • It involves domain understanding, transformation, interaction, and construction.

  • Feature selection ensures efficiency and interpretability.

  • Dimensionality reduction helps with high-dimensional datasets.

  • Automation tools are making feature management scalable and collaborative.

  • The best ML engineers are those who know what to add, what to remove, and what to monitor.


Here are detailed notes for Chapter 7 – “Model Training and Evaluation” from Machine Learning Engineering by Andriy Burkov:


Chapter 7: Model Training and Evaluation

This chapter explains how to train, evaluate, and optimize machine learning models.

It emphasizes best practices for achieving reliable generalization, fair performance comparison, and effective model tuning — all while avoiding common pitfalls like overfitting and data leakage.

“A model that performs well on training data but fails on unseen data is worse than no model at all.”


7.1 The Purpose of Model Training

Model training is the process of finding the best parameters (θ) that minimize prediction errors on known data while maintaining generalization to unseen data.

Objective:

\hat{\theta} = \arg\min_{\theta} L(y, f(x; \theta))

Where:

  • L = loss function

  • y = true labels

  • f(x; θ) = model prediction

The main goal is not just minimizing training error, but ensuring low generalization error — i.e., performing well on new, unseen examples.


7.2 Training, Validation, and Test Splits

Proper data splitting is fundamental to unbiased model evaluation.

Dataset
Purpose
Notes

Training Set

Fit model parameters (weights).

Used for learning.

Validation Set

Tune hyperparameters, prevent overfitting.

Helps model selection.

Test Set

Final unbiased evaluation.

Used only once at the end.

Never use test data during model development — it invalidates your evaluation.


7.3 The Training Process

  1. Initialize parameters (randomly or heuristically).

  2. Compute predictions on training data.

  3. Measure error using a loss function.

  4. Adjust parameters to minimize loss (via optimization algorithm).

  5. Repeat until convergence or early stopping.


7.4 Optimization Algorithms

Optimization adjusts model parameters to minimize the loss function.

7.4.1 Batch Gradient Descent

  • Uses the entire training dataset for each update.

  • Slow but accurate.

7.4.2 Stochastic Gradient Descent (SGD)

  • Updates after each example.

  • Faster and enables online learning but noisier.

7.4.3 Mini-Batch Gradient Descent

  • Compromise between batch and SGD.

  • Most commonly used in practice.


7.5 Common Loss Functions

Task
Loss Function
Formula / Purpose

Regression

Mean Squared Error (MSE)

\frac{1}{n}\sum(y - \hat{y})^2

Mean Absolute Error (MAE)

$begin:math:text$\frac{1}{n}\sum

Classification

Binary Cross-Entropy

-[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]

Multi-class Cross-Entropy

Penalizes wrong class probability.

Ranking

Pairwise loss / NDCG

For ordered outputs.

The loss function defines what the model learns — always align it with your business goal.


7.6 Bias–Variance Tradeoff

Generalization error can be decomposed as:

E_{total} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

High Bias
High Variance

Underfitting

Overfitting

Too simple model

Too complex model

High training & test error

Low training, high test error

Goal:

Find a balance between bias and variance.

Techniques to control variance:

  • Cross-validation

  • Regularization (L1, L2)

  • Early stopping

  • Dropout (for deep learning)

  • Ensemble methods


7.7 Regularization

Regularization penalizes large parameter weights to prevent overfitting.

Type
Formula
Effect

L1 (Lasso)

λΣ

wᵢ

L2 (Ridge)

λΣwᵢ²

Shrinks weights smoothly

Elastic Net

Combination of L1 and L2

Balances both effects

λ (lambda) controls regularization strength — tune it via validation.


7.8 Cross-Validation

A method for assessing model stability and generalization.

7.8.1 k-Fold Cross-Validation

  • Split data into k folds (e.g., 5 or 10).

  • Train on k−1 folds, test on the remaining.

  • Average results for final score.

7.8.2 Stratified k-Fold

  • Maintains class balance (important for classification).

7.8.3 Time-Series Cross-Validation

  • Respects chronological order (train → validate → test progressively).

Use cross-validation for model comparison and hyperparameter tuning.


7.9 Hyperparameter Tuning

Hyperparameters are not learned during training; they define how training occurs.

Examples:

  • Learning rate

  • Regularization strength

  • Number of layers, trees, neighbors, etc.

Tuning Techniques:

Method
Description

Grid Search

Try all combinations exhaustively.

Random Search

Randomly sample combinations (often more efficient).

Bayesian Optimization

Uses prior results to guide next trials (e.g., Optuna, HyperOpt).

Early Stopping

Halt training when validation performance stops improving.

Always use the validation set (or CV folds) for hyperparameter tuning — not the test set.


7.10 Evaluation Metrics

The right metric depends on your task type and business objective.

7.10.1 Classification Metrics

Metric
Formula / Description

Accuracy

(TP + TN) / Total

Precision

TP / (TP + FP)

Recall (Sensitivity)

TP / (TP + FN)

F1 Score

2 × (Precision × Recall) / (Precision + Recall)

ROC-AUC

Area under ROC curve — discrimination power

PR-AUC

Focuses on positive class (useful for imbalance)

For imbalanced data, prefer Precision-Recall or AUC over Accuracy.


7.10.2 Regression Metrics

Metric
Interpretation

MSE / RMSE

Penalizes large errors heavily.

MAE

Measures average absolute deviation.

R² Score

Fraction of variance explained by model.


7.10.3 Ranking / Recommendation Metrics

  • Precision@k – How many of top-k results are relevant.

  • MAP (Mean Average Precision) – Overall ranking quality.

  • NDCG (Normalized Discounted Cumulative Gain) – Relevance weighted by position.


7.10.4 Business-Oriented Metrics

Sometimes, classical metrics don’t reflect real-world value.

Examples:

  • Click-through rate (CTR)

  • Revenue lift

  • False alarm cost

  • Customer churn rate reduction

Choose metrics that align with business impact.


7.11 Confusion Matrix and Error Analysis

A confusion matrix breaks down predictions into:

Predicted Positive
Predicted Negative

Actual Positive

TP

FN

Actual Negative

FP

TN

From this, all classification metrics are derived.

Error Analysis Steps:

  1. Analyze misclassified examples.

  2. Identify patterns or data issues.

  3. Adjust features, thresholds, or model complexity.


7.12 Handling Class Imbalance

When one class dominates others (e.g., fraud = 1%, non-fraud = 99%).

Techniques:

  • Resampling: SMOTE, ADASYN, or undersampling.

  • Class Weights: Penalize mistakes on minority class more.

  • Threshold Adjustment: Shift decision boundary.

  • Metric Selection: Use Precision-Recall or F1 instead of accuracy.


7.13 Model Comparison

When comparing models, ensure:

  • Same train/validation/test splits.

  • Same preprocessing steps.

  • Same evaluation metrics.

Fair comparison = same data, same metric, same protocol.

Always include a baseline model for reference:

  • Random guesser

  • Logistic regression

  • Rule-based heuristic


7.14 Ensemble Methods

Combine multiple models to improve stability and performance.

Technique
Description
Examples

Bagging

Train models independently, average predictions.

Random Forest

Boosting

Sequentially correct errors of previous models.

XGBoost, AdaBoost, LightGBM

Stacking

Combine predictions of several models using a meta-model.

Blended ensembles

Ensembles reduce variance but may sacrifice interpretability.


7.15 Avoiding Overfitting

Overfitting = low training error but high validation/test error.

Prevention Techniques:

  1. Cross-validation

  2. Regularization (L1/L2)

  3. Early stopping

  4. Dropout (for deep nets)

  5. Simplifying model complexity

  6. Collecting more data

  7. Data augmentation (especially for vision/text)


7.16 Reproducibility in Training

Ensure experiments are repeatable by fixing:

  • Random seeds

  • Data splits

  • Environment dependencies

  • Code versions

Use experiment-tracking tools:

  • MLflow

  • Weights & Biases

  • Comet.ml


7.17 Model Interpretability

Interpretability builds trust and helps debugging.

Methods:

  • Feature importance (trees, coefficients)

  • Partial Dependence Plots (PDP)

  • LIME / SHAP values

  • Decision rules / surrogate models

Choose explainable models for high-stakes domains (healthcare, finance).


7.18 Continuous Evaluation and Monitoring

Model performance can drift in production due to:

  • Data distribution shifts

  • Concept drift

  • Label leakage

Monitor:

  • Input feature statistics

  • Prediction distributions

  • Live performance (A/B testing)

  • Business KPIs

Retrain or recalibrate when drift is detected.


7.19 Key Takeaways

  1. Always separate training, validation, and test data to avoid leakage.

  2. Align loss functions and metrics with the business objective.

  3. Manage bias–variance tradeoff via regularization and model tuning.

  4. Use cross-validation for reliable performance estimation.

  5. Track and tune hyperparameters systematically.

  6. Handle class imbalance carefully using weights or resampling.

  7. Maintain reproducibility with versioning and fixed seeds.

  8. Continuously monitor model performance post-deployment.


7.20 Chapter Summary

  • Model training involves optimizing parameters to minimize loss.

  • Evaluation focuses on generalization, not just training accuracy.

  • Regularization, CV, and early stopping prevent overfitting.

  • Use the right metrics and error analysis to understand model behavior.

  • Hyperparameter tuning and ensemble learning can boost performance.

  • Monitoring and interpretability are crucial for production models.

  • Ultimately, good training practice is about discipline, reproducibility, and clarity — not just accuracy.


Here are detailed notes for Chapter 8 – “Model Deployment and Prediction Service” from Machine Learning Engineering by Andriy Burkov:


Chapter 8: Model Deployment and Prediction Service

This chapter explains how to turn a trained model into a usable service — accessible to users or systems — in a reliable, scalable, and maintainable way.

It covers deployment strategies, prediction pipelines, system architecture, versioning, monitoring, and performance optimization.

“A model that stays in a notebook helps no one — deployment turns ideas into value.”


8.1 What Is Model Deployment?

Model deployment is the process of integrating a trained ML model into a production environment where it can make predictions on new, unseen data.

Goal:

To make the model’s predictions accessible through an interface (API, batch job, or streaming system) for real-world use.


8.2 Model Deployment Paradigms

Deployment depends on how predictions are used:

Type
Description
Examples

Batch Prediction

Model runs periodically on bulk data.

Daily credit risk scoring, nightly sales forecasts.

Online (Real-time) Prediction

Predictions served instantly via an API.

Fraud detection during payment, recommendation systems.

Streaming Prediction

Model ingests continuous data flow.

Predictive maintenance on IoT data, stock price alerts.

Key Tradeoffs

  • Batch = high throughput, low immediacy.

  • Online = low latency, more complexity.

  • Streaming = high engineering overhead, but instant reaction.


8.3 Components of a Prediction Service

A prediction service typically includes:

  1. Model Artifacts

    • Serialized model file (e.g., .pkl, .onnx, .pt, .h5).

  2. Feature Pipeline

    • Transforms raw input data into model-ready features.

    • Must mirror the preprocessing used during training.

  3. Prediction Logic

    • Loads the model and computes outputs.

  4. Serving Interface

    • REST API, gRPC endpoint, or message queue consumer.

  5. Monitoring & Logging

    • Tracks requests, latency, and model performance drift.

Consistency rule: The data preprocessing used during training must be identical during inference.


8.4 Packaging and Serialization

To deploy a model, you first package it into a portable artifact.

Common Serialization Formats

Framework
Serialization Method
File Type

Scikit-learn

Pickle, Joblib

.pkl, .joblib

TensorFlow/Keras

SavedModel, HDF5

.pb, .h5

PyTorch

TorchScript

.pt, .pth

ONNX

Open Neural Network Exchange

.onnx (cross-framework)

Save both the model and preprocessing pipeline together for reproducibility.


8.5 Containerization

Models are deployed as containers for consistency across environments.

Benefits of Containerization:

  • Reproducible environment (same dependencies, libraries, OS).

  • Scalability across clusters (e.g., Kubernetes).

  • Isolation from other services.

Typical Container Stack:

  • Dockerfile → Defines environment (Python version, dependencies).

  • Docker image → Portable build.

  • Container orchestration → Managed by Kubernetes, ECS, or Docker Swarm.

Example structure:


8.6 Model Serving Architectures

8.6.1 Embedded (In-process) Serving

  • Model runs inside the same process as the main application.

  • Pros: Low latency, simple setup.

  • Cons: Harder to update model without redeploying the app.

8.6.2 Separate Microservice

  • Model is exposed via REST/gRPC API.

  • Pros: Independent scalability and updates.

  • Cons: Network latency and additional infrastructure.

8.6.3 Model Server Frameworks

  • Prebuilt serving solutions for ML workloads.

Tool
Best For

TensorFlow Serving

TensorFlow models

TorchServe

PyTorch models

MLflow Models

Cross-framework

Seldon Core / KFServing

Kubernetes-native inference

Triton Inference Server

Multi-model GPU serving


8.7 Deployment Environments

Environment
Description
Use Case

Cloud Deployment

Model hosted on AWS, Azure, or GCP.

Scalable production ML.

On-Premise

Local servers managed by company.

Data-sensitive industries (finance, healthcare).

Edge Deployment

Model runs on devices.

Mobile apps, IoT sensors, self-driving cars.

Edge deployment prioritizes latency, privacy, and offline inference.


8.8 Model Versioning and Management

Every model version must be tracked along with:

  • Training data version

  • Code version

  • Hyperparameters

  • Evaluation metrics

  • Deployment metadata

Tools:

  • MLflow Model Registry

  • DVC (Data Version Control)

  • Weights & Biases

  • Git + Tagging

Version control ensures:

  • Reproducibility

  • Rollback ability (if performance degrades)

  • Auditability for compliance


8.9 CI/CD for Machine Learning (MLOps)

Continuous Integration / Continuous Deployment pipelines automate model lifecycle tasks.

CI/CD Stages:

  1. Data validation (schema checks)

  2. Model training (automated retraining pipeline)

  3. Model evaluation (compare vs. current production)

  4. Deployment (automated if new model passes threshold)

  5. Monitoring & rollback

MLOps = DevOps + DataOps + ModelOps.

It ensures automation, reliability, and scalability of ML systems.

Popular Tools:

  • Kubeflow

  • Airflow

  • MLflow

  • Jenkins

  • Argo Workflows


8.10 Model Performance Monitoring

After deployment, models must be continuously monitored for:

1. Data Drift

  • Distribution of inputs changes over time.

  • Detected via KS-test, population stability index (PSI).

2. Concept Drift

  • Relationship between features and target changes.

3. Latency and Throughput

  • Measure prediction response time and load capacity.

4. Prediction Quality

  • Compare real-world outcomes vs. model predictions (if labels become available).

5. Business Metrics

  • Track user engagement, conversions, fraud rate, etc.


8.11 Canary and Shadow Deployments

Safe rollout strategies for new models.

Canary Deployment

  • Deploy new model to small % of users → monitor performance.

  • If stable → gradually increase traffic.

Shadow Deployment

  • New model runs in parallel with production model.

  • Receives same inputs but its predictions are not used.

  • Used for testing before production rollout.

Both strategies prevent full-scale failures.


8.12 Model Scaling

Scalability ensures that your service handles increasing requests efficiently.

Scaling Strategies

  1. Vertical Scaling: Increase server resources (CPU/GPU/RAM).

  2. Horizontal Scaling: Add more replicas behind a load balancer.

  3. Autoscaling: Automatically adjust resources based on traffic (Kubernetes HPA, AWS Auto Scaling).

Use asynchronous queues (Kafka, RabbitMQ) for large batch or streaming workloads.


8.13 Security and Privacy in Deployment

  1. Authentication & Authorization

    • Restrict API access via keys or OAuth.

  2. Input Validation

    • Prevent malicious payloads or inference attacks.

  3. Data Encryption

    • Encrypt data in transit (TLS) and at rest.

  4. Model Protection

    • Use rate limits, model watermarking, or access control.

  5. Privacy Preservation

    • Techniques: differential privacy, federated learning, secure enclaves.

Inference APIs can unintentionally expose sensitive model behavior — protect them like other production assets.


8.14 Model Retraining and Lifecycle Management

Models degrade over time due to:

  • Data drift

  • Behavior change

  • Concept drift

Retraining Loop

  1. Collect new labeled data.

  2. Evaluate production model on updated data.

  3. Retrain if performance falls below threshold.

  4. Re-deploy and monitor.

Automation tip:

Integrate retraining pipelines in CI/CD workflow (MLOps).


8.15 A/B Testing for Model Performance

Compare two models live on real traffic.

Component
Explanation

A model (Control)

Current production model

B model (Candidate)

New version under test

Metric

Business KPI (e.g., click-through rate, conversion)

Decision

Deploy B if statistically significant improvement

Use A/B testing to ensure new models actually deliver business value, not just metric improvements.


8.16 Logging and Observability

Logging ensures traceability and debugging ability.

Key Logs:

  • Input features (with timestamps)

  • Model version used

  • Prediction outputs

  • Latency and request ID

  • System-level metrics (CPU, GPU, memory)

Observability Tools:

  • Prometheus + Grafana

  • ELK Stack (Elasticsearch, Logstash, Kibana)

  • Sentry, Datadog


8.17 Error Handling and Failover

  • Implement graceful degradation (return fallback prediction or cached result if service fails).

  • Use redundant instances or load balancing for high availability.

  • Monitor error rates and auto-alert on anomalies.


8.18 Key Takeaways

  1. Deployment turns ML research into real-world impact.

  2. Choose between batch, online, or streaming inference depending on latency needs.

  3. Package models with preprocessing logic — ensure consistency.

  4. Use containers and microservices for scalability and independence.

  5. Implement versioning, CI/CD, and MLOps for automation and reproducibility.

  6. Monitor for drift, latency, and business KPIs continuously.

  7. Use canary or shadow deployment to safely roll out updates.

  8. Secure your model endpoints and data — treat them as production software.

  9. Retrain periodically and validate improvements through A/B testing.

  10. Build observability into every stage — logging, metrics, alerts.


8.19 Chapter Summary

  • Model deployment converts a static model into an active service.

  • It involves packaging, serving, scaling, monitoring, and maintaining the model lifecycle.

  • Deployment can be batch, online, or streaming, depending on latency and complexity.

  • Tools like Docker, Kubernetes, MLflow, Seldon, and Airflow are standard in production MLOps.

  • Long-term success depends on robust pipelines, observability, and continuous retraining.

  • The key to production ML: consistency, automation, monitoring, and security.


Here are detailed notes for Chapter 9 – “Model Maintenance and Monitoring” from Machine Learning Engineering by Andriy Burkov:


Chapter 9: Model Maintenance and Monitoring

This chapter focuses on what happens after model deployment — the phase where most real-world challenges appear.

It explains how to monitor, maintain, and update machine learning models to ensure their performance remains consistent, fair, and reliable over time.

“In machine learning, deployment is not the end — it’s the beginning of a new lifecycle.”


9.1 Why Model Maintenance Is Essential

A model’s environment is dynamic, not static.

Once in production, it begins to interact with real-world data — which changes continuously.

Reasons for Degradation (a.k.a. Model Decay):

  1. Data Drift: Input data distribution changes over time.

  2. Concept Drift: Relationship between inputs and outputs evolves.

  3. Label Drift: Target variable definitions shift.

  4. Feature Drift: Important features lose predictive power or disappear.

  5. System Drift: Infrastructure or upstream changes (e.g., APIs, ETL jobs) alter inputs.

Even a perfectly trained model will fail if the world it represents changes.


9.2 Model Lifecycle Management

The ML lifecycle doesn’t stop at deployment; it’s a continuous loop:

  1. Monitor model performance in production.

  2. Detect drift or anomalies.

  3. Collect fresh labeled data.

  4. Retrain or recalibrate the model.

  5. Validate new version and redeploy.

This cycle repeats throughout the model’s operational lifespan — a concept called continuous ML (CML) or MLOps.


9.3 Key Aspects of Monitoring

Monitoring ensures the model performs as expected technically and business-wise.

Monitoring Dimensions:

Category
Goal
Examples

Data Quality Monitoring

Ensure input data integrity.

Missing values, schema mismatches.

Model Performance Monitoring

Track predictive accuracy and calibration.

Accuracy, F1, AUC, RMSE.

Drift Detection

Identify input/output distribution changes.

PSI, KL divergence, KS test.

Operational Metrics

Maintain system stability.

Latency, throughput, uptime.

Business KPIs

Measure real-world impact.

Conversion rate, churn reduction, fraud detection rate.


9.4 Monitoring Data Quality

Data quality issues are among the most common causes of performance degradation.

What to Check:

  1. Schema Changes:

    • Feature added/removed/renamed.

    • Example: “user_age” becomes “age_years.”

  2. Missing Data Patterns:

    • Missing values increase suddenly.

  3. Feature Statistics:

    • Mean, median, standard deviation drift.

    • Sudden distribution shifts.

  4. Outlier Ratios:

    • Large deviation may signal corrupted data.

Tools:

  • Great Expectations

  • TFDV (TensorFlow Data Validation)

  • Deequ (AWS)

  • Pandera (Python)

Always log input data summaries with every prediction request.


9.5 Detecting Drift

9.5.1 Types of Drift

Type
Definition
Example

Covariate Drift (Feature Drift)

Change in input distribution P(X).

Customer age distribution shifts due to new user segment.

Prior Probability Drift (Label Drift)

Change in target label distribution P(Y).

Fraud rate drops from 10% to 2%.

Concept Drift

Change in relationship between inputs and labels P(Y

X).

9.5.2 Drift Detection Methods

Method
Purpose

Population Stability Index (PSI)

Measures difference between two distributions.

Kolmogorov–Smirnov Test (KS Test)

Detects differences in continuous variable distributions.

Jensen–Shannon or KL Divergence

Quantifies dissimilarity between probability distributions.

Chi-Square Test

For categorical features.

ADWIN / DDM (Online Algorithms)

Real-time drift detection in streaming data.


9.6 Monitoring Model Performance

Measure how the model performs on live data compared to its original validation results.

Approaches:

  1. Delayed Ground Truth Monitoring:

    • Wait until actual labels become available (e.g., loan default after 6 months).

    • Compare new metrics to baseline validation results.

  2. Proxy Metrics:

    • Use surrogate indicators (like click rate, engagement) when labels are delayed.

  3. Continuous Evaluation:

    • Periodically run evaluation pipelines on new labeled batches.

Typical Metrics:

  • Classification: Precision, Recall, F1, AUC.

  • Regression: MAE, RMSE, R².

  • Business: Revenue lift, reduced churn, fraud savings.

Track both model metrics and business outcomes to assess true health.


9.7 Model Explainability and Accountability

Post-deployment interpretability helps detect unexpected model behavior or bias.

Explainability Tools:

  • Feature Importance (Tree-based models)

  • SHAP values

  • LIME (Local Interpretable Model-Agnostic Explanations)

  • Counterfactual explanations

Accountability Measures:

  • Document model lineage and decisions.

  • Store versioned model cards with:

    • Intended use case

    • Training data summary

    • Evaluation metrics

    • Known limitations and risks

This builds trust and auditability, especially in regulated industries.


9.8 Automating Model Maintenance (MLOps Loop)

Automation ensures models are updated without manual intervention.

Typical Automated Workflow:

  1. Detect Drift → Trigger Retraining Pipeline.

  2. Collect New Data → Validate Schema → Train Model.

  3. Evaluate → Compare with Baseline → If Better → Deploy.

  4. Monitor Performance → Repeat.

This can be implemented via tools like:

  • Kubeflow Pipelines

  • MLflow + Airflow

  • TFX (TensorFlow Extended)

  • AWS SageMaker Pipelines

Automation reduces human error and keeps models synchronized with reality.


9.9 Retraining Strategies

Retraining keeps models relevant but must be done strategically.

Retraining Triggers:

  1. Time-based: Retrain periodically (e.g., weekly, monthly).

  2. Performance-based: Retrain when performance drops below threshold.

  3. Data-based: Retrain when input distribution drifts beyond limit.

Retraining Types:

Type
Description
Example

Full Retraining

Train new model from scratch.

Monthly refresh with all data.

Incremental Training

Update model using recent data.

Online learning models (SGD, streaming).

Transfer Learning

Reuse part of old model with new domain data.

Adapting product recommendation for new region.

Retraining frequency depends on how dynamic the data domain is.


9.10 Model Versioning and Rollback

Every new model must be:

  • Versioned (with training data, code, and parameters).

  • Tested for regression against current production.

  • Deployable with rollback support.

Model Registry Should Include:

  • Version ID

  • Training dataset hash

  • Model configuration & hyperparameters

  • Validation & production metrics

  • Deployment date and owner

Rollback:

If a new version performs worse or introduces bias, the system should automatically revert to the previous stable version.


9.11 Bias and Fairness Monitoring

Models can become biased due to:

  • Historical bias in training data.

  • Uneven drift across user groups.

  • Feedback loops reinforcing prior errors.

Monitoring Methods:

  • Fairness Metrics: Equal Opportunity, Demographic Parity, Disparate Impact.

  • Group-wise Performance Tracking: Compare F1 or AUC across demographics.

  • Bias Dashboards: Automated visualization tools (e.g., What-If Tool, Aequitas).

Fairness is not a one-time check — it must be continuously measured.


9.12 Shadow and Champion–Challenger Models

Safe maintenance requires testing new models before replacing old ones.

Shadow Mode:

  • New model runs in parallel, receives same inputs, but predictions are not used.

  • Compares outputs silently to production model.

Champion–Challenger Setup:

  • Champion = current production model.

  • Challenger = candidate model.

  • Challenger replaces Champion only if it performs better in real-world data.

This system prevents regressions and enables experimentation without risk.


9.13 Monitoring Tools and Infrastructure

Popular tools for end-to-end monitoring and maintenance:

Purpose
Tools

Data Validation

Great Expectations, TFDV

Experiment Tracking

MLflow, Weights & Biases

Drift Detection

EvidentlyAI, WhyLabs, Arize AI

Model Registry

MLflow Registry, Sagemaker Model Registry

Alerting & Dashboards

Prometheus + Grafana, Kibana, Datadog


9.14 Model Governance and Compliance

For enterprise and regulated domains (finance, healthcare, insurance), compliance is critical.

Governance Elements:

  1. Auditability: Logs of all model decisions.

  2. Traceability: Links from prediction → model version → training data.

  3. Accountability: Who approved, trained, and deployed the model.

  4. Transparency: Clear documentation of data sources and decisions.

Tools and Standards:

  • Model Cards (Google)

  • Datasheets for Datasets

  • AI Fairness 360 (IBM)

  • Responsible AI frameworks (Azure, AWS, GCP)


9.15 Human-in-the-Loop Systems

Human oversight improves model robustness and accountability.

Examples:

  • Review edge-case predictions.

  • Provide feedback for retraining.

  • Override model outputs in critical cases.

The goal is not to replace humans but to augment them.


9.16 Key Challenges in Model Maintenance

Challenge
Description

Data Drift

New data distributions differ from training data.

Label Availability

Delayed or missing true outcomes for feedback.

Automation Failures

Pipelines break due to dependency updates.

Cost Management

Retraining and monitoring can be expensive.

Governance Complexity

Ensuring compliance across changing models.

Addressing these requires a combination of MLOps tools, alerting systems, and process discipline.


9.17 Key Takeaways

  1. Model performance decays naturally — monitor continuously.

  2. Track data drift, concept drift, and feature drift separately.

  3. Automate retraining pipelines but validate models before deployment.

  4. Use shadow/champion–challenger setups to test safely.

  5. Version models, data, and metrics for full traceability.

  6. Continuously monitor bias and fairness.

  7. Establish governance and compliance processes for accountability.

  8. Implement human-in-the-loop checks for critical decisions.

  9. Use dashboards, alerts, and logs to detect anomalies early.

  10. Maintenance is an ongoing process — success depends on iteration and observation.


9.18 Chapter Summary

  • Model maintenance ensures long-term reliability of ML systems.

  • Monitoring covers data quality, performance, drift, and fairness.

  • Automation (MLOps) is key to scaling retraining and redeployment.

  • Every model version should be tracked, explainable, and auditable.

  • Continuous monitoring bridges engineering, data science, and business.

  • In production, change is constant — the best ML engineers design for evolution, not perfection.


Here are detailed notes for Chapter 10 – “Ethics and Privacy” from Machine Learning Engineering by Andriy Burkov:


Chapter 10: Ethics and Privacy

This final chapter emphasizes the responsibility of ML engineers in developing and deploying machine learning systems that are ethical, fair, transparent, and respectful of privacy.

It discusses the moral, social, and legal implications of AI models and outlines strategies to build systems that benefit people without causing harm.

“A great machine learning engineer doesn’t just optimize models — they optimize for humanity.”


10.1 The Importance of Ethics in Machine Learning

Machine learning models influence critical decisions — in healthcare, finance, justice, hiring, and public policy.

Unethical design or misuse can lead to:

  • Discrimination and unfair outcomes

  • Invasion of privacy

  • Spread of misinformation

  • Erosion of trust in technology

Hence, ethical ML is not optional — it’s a core engineering responsibility.


10.2 Defining AI Ethics

AI Ethics is the practice of aligning machine learning systems with:

  1. Human values (fairness, autonomy, safety)

  2. Moral principles (justice, beneficence, non-maleficence)

  3. Legal and social norms

Key Objective:

Ensure ML systems help more than they harm, and their decisions can be understood and justified.


10.3 Core Ethical Principles

Principle
Definition
Example Violation

Fairness

Equal treatment for all individuals or groups.

Biased loan approval model.

Accountability

Humans remain responsible for AI decisions.

“The algorithm decided, not me.”

Transparency

Decision-making logic is explainable and accessible.

Black-box models in justice systems.

Privacy

Protecting personal and sensitive data.

Sharing identifiable health data.

Security

Prevent misuse and attacks.

Exposed API allowing data theft.

Non-maleficence

Do no harm to users or society.

Deepfake misinformation.

Beneficence

Actively promote well-being.

AI that improves healthcare access.


10.4 Fairness in Machine Learning

ML models learn from historical data, which often carries systemic bias.

10.4.1 Sources of Bias

  1. Sampling Bias: Non-representative data collection.

  2. Label Bias: Inaccurate or biased ground truth labels.

  3. Measurement Bias: Errors in data capture (e.g., camera-based skin tone recognition).

  4. Algorithmic Bias: Model amplifies existing inequalities.

  5. Societal Bias: Data reflects human prejudice and discrimination.

10.4.2 Fairness Metrics

Metric
Description

Demographic Parity

Equal positive prediction rates across groups.

Equal Opportunity

Equal true positive rates across groups.

Equalized Odds

Equal TPR and FPR across groups.

Disparate Impact

Ratio of favorable outcomes between groups ≥ 0.8.

Calibration within Groups

Predicted probabilities are consistent across demographics.

Fairness cannot be achieved by accident — it must be measured and enforced deliberately.


10.5 Addressing Fairness

Approaches to Mitigate Bias:

1. Pre-Processing

  • Clean and rebalance data before training.

  • Techniques: reweighting, resampling (e.g., SMOTE for minority classes), removing sensitive features.

2. In-Processing

  • Modify training algorithms to include fairness constraints.

  • Example: adversarial debiasing.

3. Post-Processing

  • Adjust model outputs or thresholds for fairness.

  • Example: equalize prediction probabilities across subgroups.

Tools:

IBM AI Fairness 360, Google What-If Tool, Fairlearn (Microsoft).


10.6 Transparency and Explainability

Black-box models raise questions like:

  • Why did the model reject this loan?

  • How is this diagnosis justified?

  • Can we trust an algorithmic decision?

Explainability = Understanding model decisions.

Types of Explainability

Type
Goal
Examples

Global Explainability

Understand overall logic.

Feature importance, partial dependence plots.

Local Explainability

Explain single prediction.

SHAP, LIME, counterfactual explanations.

Model-Agnostic

Works for any model type.

LIME, SHAP.

Intrinsic

Built into model structure.

Linear regression, decision trees.

Why Explainability Matters

  • Builds trust with users.

  • Helps detect bias and errors.

  • Required by law (e.g., GDPR “right to explanation”).

Choose the simplest model that meets performance requirements — simpler models are easier to explain.


10.7 Accountability in ML Systems

Humans must remain responsible for AI outcomes.

Accountability Components

  1. Traceability: Every prediction should be linked to:

    • Model version

    • Data used

    • Decision rationale

  2. Auditability: Logs and metrics must be available for external review.

  3. Human Oversight: Humans can override model decisions when necessary.

Documentation Tools

  • Model Cards (Google): Explain model’s purpose, performance, and limitations.

  • Datasheets for Datasets (Gebru et al.): Describe dataset origin, collection, and biases.

  • FactSheets (IBM): Summarize ethical risks and intended use.


10.8 Privacy in Machine Learning

ML often requires large amounts of data, much of it personal. Protecting privacy is both ethical and legal.

10.8.1 Types of Sensitive Data

  • Personally Identifiable Information (PII)

  • Financial or medical records

  • Biometric data (face, fingerprints)

  • Behavioral logs or user activity

10.8.2 Privacy Violations

  • Accidental data leaks

  • Model memorization (e.g., remembering patient names)

  • Inference attacks (extracting private info from model)

  • Re-identification from anonymized data


10.9 Techniques for Privacy Preservation

Technique
Purpose
Example

Data Anonymization

Remove identifiers like name, address.

Replace “John Doe” with “User123”.

Pseudonymization

Replace identifiers with pseudonyms.

Encrypt personal IDs.

Differential Privacy

Add statistical noise to prevent re-identification.

Used by Apple, Google in telemetry data.

Federated Learning

Train model locally on devices — only send updates, not data.

Google Keyboard personalization.

Homomorphic Encryption

Perform computations on encrypted data.

Privacy-preserving cloud inference.

Secure Multiparty Computation

Jointly compute results without revealing private data.

Collaborative medical AI research.

Privacy-preserving ML enables innovation without compromising personal security.


Several global regulations enforce responsible data and AI use.

Regulation / Standard
Region
Focus

GDPR (General Data Protection Regulation)

EU

Data protection, user consent, “right to explanation.”

CCPA (California Consumer Privacy Act)

USA

Data ownership and opt-out rights.

HIPAA (Health Insurance Portability and Accountability Act)

USA

Protects healthcare data.

OECD AI Principles

Global

Fairness, transparency, accountability.

EU AI Act (2024+)

EU

Risk-based regulation for AI systems.

GDPR Key Rights:

  • Right to be informed

  • Right to access data

  • Right to rectification and erasure

  • Right to data portability

  • Right to explanation (for automated decisions)


10.11 Security in ML Systems

Security breaches can compromise both the data and the model.

Common Threats

  1. Data Poisoning: Injecting malicious samples during training.

  2. Model Inversion: Extracting sensitive data from model outputs.

  3. Adversarial Attacks: Slightly modifying inputs to fool the model.

  4. Model Stealing: Reverse-engineering a proprietary model via API queries.

Mitigation Techniques

  • Validate input data and use anomaly detection.

  • Apply differential privacy and encryption.

  • Limit API access and throttle queries.

  • Monitor unusual usage patterns or request spikes.


10.12 Social Impacts of Machine Learning

ML systems shape human behavior and societal structures.

Potential Negative Impacts

  • Job displacement due to automation.

  • Filter bubbles reinforcing one-sided content.

  • Surveillance capitalism via behavioral tracking.

  • Misinformation spread through deepfakes and algorithmic amplification.

Positive Impacts

  • Enhanced healthcare diagnostics.

  • Smarter resource allocation (energy, traffic, logistics).

  • Accessible education via adaptive learning.

  • Faster scientific discoveries.

ML should be designed to empower — not manipulate — humans.


10.13 Building Ethical ML Systems

Checklist for Ethical ML Development:

Phase
Ethical Practices

Data Collection

Obtain consent, anonymize data, ensure diversity.

Model Design

Prioritize fairness, interpretability, and robustness.

Evaluation

Test for bias, drift, and ethical risk.

Deployment

Secure APIs, implement human-in-the-loop controls.

Maintenance

Monitor fairness, retrain when necessary.

Best Practices:

  1. Use ethics review boards or internal audit teams.

  2. Create diverse development teams to reduce blind spots.

  3. Include user feedback loops in the design.

  4. Define “harm scenarios” during system design.


10.14 Responsible AI Frameworks

Leading tech organizations promote responsible AI principles:

Organization
Framework / Initiative

Google

Responsible AI Principles

Microsoft

AI Fairness, Reliability, Privacy, Inclusiveness

IBM

Everyday Ethics for AI

OECD / UNESCO

Global AI Ethics Guidelines

EU AI Act

Risk-based compliance approach

All emphasize:

  • Fairness

  • Transparency

  • Accountability

  • Privacy

  • Human oversight


10.15 The Role of the ML Engineer

An ethical ML engineer is:

  • A technologist: Writes efficient, scalable, accurate code.

  • A scientist: Validates assumptions with data.

  • A philosopher: Questions the social impact of their creations.

Key Responsibilities:

  1. Anticipate harm — ask “Who could be negatively affected?”

  2. Design for inclusion — ensure diverse representation in data.

  3. Maintain accountability — document and explain every design choice.

  4. Respect privacy — minimize data collection and use only necessary features.

  5. Communicate transparently — with both technical and non-technical stakeholders.

Ethical ML engineering is a discipline — not a checklist.


10.16 Key Takeaways

  1. Ethics and privacy are as critical as accuracy or scalability.

  2. Fairness must be quantified and enforced using metrics.

  3. Transparency and explainability build trust and accountability.

  4. Privacy-preserving techniques (e.g., differential privacy, federated learning) enable responsible innovation.

  5. Global regulations like GDPR and EU AI Act set the compliance baseline.

  6. Engineers must proactively identify and mitigate bias, drift, and misuse.

  7. Ethical AI = aligned intent + transparent execution + continuous oversight.


10.17 Chapter Summary

  • ML systems have moral and societal implications beyond technical boundaries.

  • Fairness, transparency, privacy, and accountability must guide every ML decision.

  • Use frameworks and tools to audit, monitor, and explain models.

  • Protect data and users from misuse through privacy-preserving computation and secure infrastructure.

  • Responsible AI development is not just a legal obligation — it’s a moral duty.

  • The best machine learning systems don’t just work well — they work right.


Last updated