Machine learning engineering
[!IMPORTANT] Executive Summary for ML Engineers
Production ML is 10% modeling and 90% engineering. This guide summarizes Andriy Burkov's Machine Learning Engineering, focusing on building reliable, scalable, and maintainable systems.
Core Engineering Principles:
Data over Algorithms: A simple model with clean, high-quality data beats a complex model with noisy data.
Baseline First: Never build a complex neural net without a simple heuristic or linear baseline to measure incremental value.
Avoid Silent Failures: ML models don't "crash"; they degrade. Use distribution monitoring (PSI, K-S test) to detect data drift.
Train-Serve Skew: Ensure identical feature engineering pipelines for training and inference using Feature Stores or shared serialization.
Iterative Deployment: Use Shadow Mode (predict but don't serve) or Canary deployments to validate models on live traffic.
Human-in-the-Loop: Bridge the gap between low-confidence predictions and high-stakes decisions with manual review tiers.
Table of Contents (Chapter Summaries)
Chapter 4: Data Collection & Labeling ... (Full notes below)
Chapter 1 – Introduction
1. Overview
This chapter defines foundational machine learning (ML) concepts and sets the stage for what machine learning engineering entails. It focuses on ensuring consistent understanding of terms and introduces the ML project life cycle.
1.1 Notation and Definitions
Data Structures
Scalar: Single numeric value (e.g., 5, −2.3), denoted x or a.
Vector: Ordered list of scalars (attributes), denoted x, w.
Example: a = [2, 3]; b = [−2, 5].
Attribute j → x(j).
Matrix: 2D array (rows × columns), denoted A, W.
Columns or rows can be viewed as vectors.
Set: Unordered collection of unique elements, denoted calligraphically (𝒮).
Intersection: 𝒮₁ ∩ 𝒮₂
Union: 𝒮₁ ∪ 𝒮₂
|𝒮| = number of elements in the set.
Capital Sigma (Σ) Notation
Summation is expressed as:
\sum_{i=1}^{n} x_i = x_1 + x_2 + \dots + x_n
Euclidean norm (‖x‖): Measures vector length
\sqrt{\sum (x(j))^2}
Euclidean distance: Distance between two vectors
‖a - b‖ = \sqrt{\sum (a(i) - b(i))^2}
1.2 What is Machine Learning
Definition
Machine learning is the subfield of computer science that builds algorithms relying on examples of phenomena to solve practical problems.
Steps:
Collect dataset.
Train a statistical model algorithmically.
Types of Learning
Supervised Learning
Dataset: labeled pairs (x, y)
Predicts class (classification) or value (regression).
Example: spam detection, disease prediction.
Goal: Build model f(x) → ŷ close to true y.
Unsupervised Learning
Dataset: unlabeled examples {x₁, x₂, …}
Finds structure in data.
Tasks:
Clustering: groups similar objects.
Dimensionality reduction: reduces feature space.
Outlier detection: finds anomalies.
Semi-Supervised Learning
Mix of labeled and unlabeled data.
Exploits large unlabeled datasets for better learning.
Reinforcement Learning
Agent interacts with environment.
Learns policy mapping states → optimal actions.
Goal: maximize long-term reward.
1.3 Data and Machine Learning Terminology
Data Used Directly vs. Indirectly
Direct data: forms feature vectors (e.g., word sequences in NER).
Indirect data: enriches features (e.g., dictionaries, gazetteers).
Raw vs. Tidy Data
Raw data: unstructured (e.g., text, images).
Tidy data: each row = example, each column = attribute (like a spreadsheet).
Needed for ML algorithms.
Categorical data may need numerical encoding (feature engineering).
Training, Validation, and Test Sets
Training set: builds the model.
Validation set: tunes hyperparameters and algorithm choice.
Test set: final unbiased performance check.
Must prevent data leakage between these sets.
Baseline
A simple heuristic or trivial model used as a reference for comparison. Machine Learning Pipeline
Sequential data flow from raw data → trained model.
Includes:
data partitioning
imputation
feature extraction
balancing
dimensionality reduction
model training
Parameters vs. Hyperparameters
Parameters: learned from data (e.g., weights w, bias b).
Hyperparameters: set before training (e.g., tree depth, learning rate).
Classification vs. Regression
Classification: predicts discrete labels.
Binary (spam/not spam) or multiclass.
Regression: predicts continuous numerical values.
Model-Based vs. Instance-Based Learning
Model-Based: trains parameters (e.g., SVM, Logistic Regression).
Instance-Based: uses training data directly (e.g., kNN).
Shallow vs. Deep Learning
Shallow: directly learns from features (e.g., SVM).
Deep: multi-layered neural networks that learn hierarchical features.
Training vs. Scoring
Training: building the model.
Scoring: applying the model to new examples for predictions.
1.4 When to Use Machine Learning
Use ML when:
Problem too complex for explicit coding.
e.g., spam detection.
Problem changes over time.
e.g., dynamic web data, evolving user behavior.
Perceptive problems.
e.g., image, audio, or speech recognition.
Unstudied phenomena.
e.g., genetic predictions, log anomaly detection.
Simple objective.
e.g., yes/no, single-number outputs.
Cost-effective.
Data, computation, and maintenance cost justify the outcome.
1.5 When Not to Use Machine Learning
Avoid ML when:
Full explainability is required.
Error cost is extremely high.
Traditional code can solve it more cheaply.
Data is scarce or expensive.
A lookup table or heuristic suffices.
The system logic is static and rarely changes.
1.6 What is Machine Learning Engineering (MLE)
Definition:
Application of ML and software engineering principles to build scalable, reliable, maintainable ML systems.
Key Responsibilities:
Data collection and preprocessing.
Feature programming.
Efficient model training and deployment.
Model monitoring and maintenance.
Preventing and handling silent failures (degradation due to data drift).
MLE bridges the gap between data analysis and software engineering.
1.7 Machine Learning Project Life Cycle
Stages:
Goal definition
Data collection & preparation
Feature engineering
Model training
Model evaluation
Model deployment
Model serving
Model monitoring
Model maintenance
Loops exist to collect more data or re-engineer features if needed.
1.8 Summary of Key Points
ML learns from examples; supervised predicts, unsupervised organizes.
Tidy, numerical data is essential.
Split data into training, validation, and test sets.
Establish a baseline and design a full pipeline.
Understand when ML adds value and when it doesn’t.
MLE ensures production-readiness and stability of ML systems.
ML projects follow a life cycle from data collection to model maintenance.
Here are detailed notes for Chapter 2 – “Before the Project Starts” from Machine Learning Engineering by Andriy Burkov:
Chapter 2: Before the Project Starts
This chapter focuses on the planning and pre-execution stage of a machine learning (ML) project — before any data collection, model building, or deployment begins.
It discusses how to prioritize, define, and structure ML projects and teams to maximize their impact and minimize waste of time, cost, and effort.
2.1 Prioritization of Machine Learning Projects
Before starting, every ML project must be prioritized since:
Engineering and computational resources are limited.
The backlog of ideas and potential ML applications is often long.
ML projects have high uncertainty and hidden costs.
Key Factors in Prioritization
Impact
Cost
2.1.1 Impact of Machine Learning
High-impact ML projects are those where:
Machine Learning replaces complex manual logic
If an existing rule-based system is overly complicated or difficult to maintain, ML can replace it effectively.
Example: replacing hundreds of nested business rules with a model that learns the patterns from data.
Machine Learning adds significant benefit despite imperfection
Even imperfect predictions can lead to big savings.
Example:
A dispatching system classifies incoming requests as “easy” or “difficult.”
If 90% of “easy” requests are correctly automated, human time is saved significantly, even if 10% are misclassified.
Impact is high when:
ML simplifies a complex process.
ML improves scalability.
ML generates substantial business or user value.
2.1.2 Cost of Machine Learning
The cost of an ML project depends on three key factors:
Problem Difficulty
Are there existing algorithms/libraries available?
How much compute power or training time is required?
Data Availability and Cost
Can data be generated automatically?
What is the cost of labeling or annotation?
How much data is required for sufficient accuracy?
Accuracy Requirements
Cost grows superlinearly with accuracy.
Achieving 99% accuracy costs far more than 90%.
You must balance acceptable accuracy with cost.
Cost of errors must be compared with the cost of perfection.
Rule of thumb:
Going from 90% → 99% accuracy often multiplies the cost several times.
2.2 Estimating the Complexity of a Machine Learning Project
Unlike traditional software projects, ML projects have many unknowns. Estimating their complexity is hard.
2.2.1 The Unknowns
Before starting, you usually don’t know:
If the required model accuracy is achievable in practice.
How much data is needed.
What features are relevant or sufficient.
What model size or architecture is required.
How much time training and experimentation will take.
If required accuracy > 99%, expect significant complications — often due to lack of labeled data or data imbalance.
Benchmark:
Human-level performance can serve as a realistic upper limit for the model.
2.2.2 Simplifying the Problem
To make complexity estimation manageable:
Start with a simpler version of the problem.
Example: If classifying into 1,000 topics, first solve it for 10 topics + “Other.”
Once solved, extrapolate cost, time, and feasibility for the full version.
Use data slices.
Break down the data by location, product type, or age group.
Train models on subsets first to learn constraints.
Use pilot projects.
Small-scale pilots reveal whether full-scale implementation is worth it.
Note: As the number of classes grows, the required data grows superlinearly.
2.2.3 Nonlinear Progress
ML project progress is nonlinear:
Rapid improvement early → plateau later.
Sometimes no progress until new data/features are added.
Progress may stagnate while waiting for labeling or new data pipelines.
Recommendation:
Log time and resources for every activity.
Keep stakeholders informed that progress will not be linear.
Use the 80/20 rule:
80% of progress comes from the first 20% of effort.
2.3 Defining the Goal of a Machine Learning Project
An ML project’s goal is to build a model that contributes to solving a business problem.
2.3.1 What a Model Can Do
A model’s role in a system can take many forms. It can:
Automate
Automatically approve or reject transactions
Alert / Prompt
Notify admin of suspicious activity
Organize
Rank or group content for users
Annotate
Highlight key text segments
Extract
Identify names, locations in text
Recommend
Suggest items, users, or products
Classify
Assign emails as “spam” or “not spam”
Quantify
Predict house prices
Synthesize
Generate text, images, or speech
Answer Questions
“Does this text describe that image?”
Transform
Summarize text or translate language
Detect Novelty/Anomaly
Identify abnormal behavior or data patterns
If your problem doesn’t fit one of these categories, it might not be suitable for ML.
2.3.2 Properties of a Successful Model
A model is successful if it satisfies four criteria:
Meets specifications:
Input/output structure, accuracy, latency, etc.
Brings business value:
Reduces cost, increases sales, improves user retention.
Helps users:
Measured via engagement, satisfaction, or productivity.
Is scientifically rigorous:
Predictable: behaves consistently for data from the same distribution.
Reproducible: can be rebuilt from same data and hyperparameters.
Defining the Right Goal
Poorly defined goals lead to wasted effort.
Example (Cat vs. Dog problem):
Goal: Allow the client’s own cat inside, block the dog.
Bad definition: Train model to distinguish cats vs. dogs → lets any cat in!
Correct definition: Identify the client’s cat specifically.
Lesson:
Define what the model must actually achieve, not what seems similar.
Multiple Stakeholders
Different teams have different objectives:
Product owner: maximize user engagement.
Executive: increase revenue.
Finance: reduce costs.
The ML engineer must balance these goals and translate them into:
Choice of inputs/outputs
Cost function
Performance metric
2.4 Structuring a Machine Learning Team
Two main organizational cultures exist for ML teams:
2.4.1 Two Cultures
Specialized Roles (Collaborative Model)
Data analysts (scientists) handle modeling.
Software engineers handle deployment and scaling.
Pros:
Each expert focuses on their strength.
Cons:
Difficult handoffs; integration challenges.
Hybrid Roles (Full-Stack ML Engineers)
Every member combines ML and software skills.
Pros:
Easier integration, faster iteration.
Cons:
Harder to hire; requires versatile talent.
2.4.2 Members of a Machine Learning Team
Typical Roles and Responsibilities:
Data Analysts / Scientists
Analyze data, build models, test hypotheses.
Machine Learning Engineers
Build production-ready ML systems, handle scalability, optimization, and monitoring.
Data Engineers
Build data pipelines (ETL: Extract, Transform, Load), integrate data sources, design data architecture.
Labeling Experts / Annotators
Manually label data, validate labeling quality, build labeling tools, manage outsourced teams.
Domain Experts
Provide business and contextual knowledge for feature design and target definition.
DevOps Engineers
Automate deployment, CI/CD pipelines, scaling, and monitoring of ML models.
Team Collaboration Tips
Work with domain experts early to define inputs, outputs, and success metrics.
Communicate trade-offs (accuracy vs. cost, latency vs. complexity).
Encourage shared understanding between business, data, and engineering teams.
2.5 Key Takeaways
Prioritize ML projects by impact vs. cost; start with simpler pilot versions.
Estimate complexity carefully—progress is nonlinear, data requirements unpredictable.
Define clear, measurable goals that tie directly to business outcomes.
Form the right team structure:
Specialists for large orgs.
Generalist ML engineers for smaller ones.
Engage domain experts—they bridge business context and ML logic.
Set realistic accuracy and cost expectations early.
2.6 Chapter Summary
Before starting, evaluate feasibility and business value.
High impact ML projects simplify complex or costly processes.
Cost grows superlinearly with desired accuracy.
Simplify problems into smaller experiments.
Define goals precisely and align them with stakeholders.
Build a multidisciplinary ML team with clear communication between roles.
Understand that project progress will be nonlinear and uncertain, requiring flexibility and iteration.
Here are detailed notes for Chapter 3 – “Framing the Problem and Project” from Machine Learning Engineering by Andriy Burkov:
Chapter 3: Framing the Problem and Project
This chapter explains how to convert a business or research problem into a concrete machine learning (ML) task.
It discusses framing, defining objectives, metrics, and the nature of predictions, while avoiding common pitfalls in the problem-definition stage.
3.1 Importance of Problem Framing
Before collecting data or building models, you must:
Define what exactly is being predicted.
Understand how predictions are used.
Specify measurable objectives.
The quality of an ML system is directly dependent on how well the problem is framed at the beginning.
A poorly framed problem leads to:
Wasted time and resources.
Misaligned goals.
Ineffective models.
3.2 From Business Problem to Machine Learning Problem
Step 1: Understand the Business Objective
Identify the desired business outcome (e.g., “increase retention”, “reduce churn”).
Define success in measurable terms.
Example: “Reduce customer churn by 15% over 6 months.”
Step 2: Define the Decision
What decision will be made based on the model output?
Example: “Which customers should receive retention offers?”
The decision determines what kind of prediction you need.
Step 3: Translate to Prediction
Turn the decision into a prediction problem.
Business: “Which users are likely to churn?”
ML: “Predict the probability a user will churn within 30 days.”
3.3 Target Variable (Label) Definition
The target variable (y) is what the model learns to predict.
Common Pitfalls
Target Leakage
Occurs when the label includes data that would not be available at prediction time.
Example: predicting tomorrow’s stock price using tomorrow’s trading volume.
Ill-Defined Targets
When labels are inconsistent or ambiguous.
Example: “Bad customer” — unclear definition.
Changing Targets
When business objectives shift mid-project, forcing relabeling.
How to Define a Good Target
Based on data available at prediction time.
Consistent across samples.
Aligned with the business decision.
3.4 Features (Input Variables)
Features are input variables (x₁, x₂, …, xₙ) used for prediction.
Good features contain relevant, predictive information available before the event occurs.
Bad features include post-event data (causing leakage).
Example:
Predicting flight delays.
Use: weather forecasts, aircraft type, route.
Don’t use: actual arrival time.
3.5 Static vs. Dynamic Predictions
Static Prediction
Model makes one-time predictions.
Loan approval, image classification.
Dynamic Prediction
Model makes predictions continuously or repeatedly as new data arrives.
Fraud detection, recommendation systems.
Static
One-off, fixed features, no feedback loop.
Dynamic
Needs real-time data ingestion, continuous training, and feedback monitoring.
3.6 The Prediction Horizon
Prediction horizon = the time gap between when input data is collected and when the event (label) happens.
Example:
Predict churn 30 days in advance → horizon = 30 days.
Choosing horizon affects:
Data availability: shorter horizons → more data, less uncertainty.
Business value: longer horizons → more time to act, but less accuracy.
3.7 Setting a Baseline
A baseline is the simplest possible model or heuristic that performs the task.
Examples:
Predicting the most frequent class (majority classifier).
Using mean or median for regression.
Rule-based thresholds.
Baselines are critical because:
They reveal whether your ML model truly adds value.
They act as a sanity check for initial experiments.
If your ML model doesn’t outperform the baseline, revisit feature design or problem framing.
3.8 Metrics: How to Measure Success
Evaluation metrics quantify how well the model achieves the business goal.
Types of Metrics
Classification Metrics
Accuracy, Precision, Recall, F1 Score, ROC-AUC, Log loss.
Choose based on imbalance and business costs.
Regression Metrics
Mean Absolute Error (MAE), Mean Squared Error (MSE), R² Score.
Ranking / Recommendation Metrics
Precision@k, Recall@k, Mean Average Precision (MAP), NDCG.
Business KPIs
Actual real-world outcomes, e.g., revenue lift, reduced churn rate.
Cost-Based Metrics
Sometimes decisions depend on error costs:
False Positive (FP) cost ≠ False Negative (FN) cost.
Example:
FP: sending a retention offer unnecessarily.
FN: losing a valuable customer.
The metric must align with the business objective.
3.9 Offline vs. Online Evaluation
Offline Evaluation
Uses historical labeled data; faster and cheaper.
Cross-validation, holdout test sets.
Online Evaluation
Tests model in production; measures real-world effect.
A/B tests, live metrics.
Offline
Good for experimentation.
Must ensure data distribution matches production.
Online
Exposes model to live users.
Measures impact on KPIs like conversion or engagement.
Best practice: use both — offline to validate technically, online to validate business value.
3.10 Framing Pitfalls
Optimizing for the wrong metric
E.g., focusing on accuracy when recall matters more.
Ignoring user experience
Even accurate models may hurt usability if their decisions seem random to users.
Failing to consider the feedback loop
Example: a model that changes what data it later sees (self-influence).
Poorly defined prediction timing
Predicting an event after it has already happened.
Ignoring business constraints
A perfect model may be useless if it’s too slow, expensive, or non-interpretable.
3.11 Example: Credit Default Prediction
Business Goal
Reduce financial losses from unpaid loans.
Machine Learning Problem
Predict the probability a new loan applicant will default.
Key Components
Target (y): 1 = defaulted, 0 = repaid.
Features: income, credit history, number of loans, etc.
Prediction horizon: 6 months from loan approval.
Metric: ROC-AUC or cost-weighted accuracy.
Baseline: Rule-based scoring system.
Decisions
High-risk → reject loan.
Medium-risk → manual review.
Low-risk → approve.
3.12 Iterative Framing
Problem framing is not one-time; it’s iterative:
Start with a draft problem definition.
Run pilot experiments.
Analyze results and feedback.
Refine features, targets, and metrics.
This cycle continues until:
Predictions are actionable.
Model aligns with business value.
3.13 Human-in-the-Loop Design
In many real systems:
Models assist, not replace, humans.
Final decisions are hybrid (model + expert).
Benefits:
Reduces risk of model mistakes.
Increases interpretability and trust.
Enables incremental automation.
3.14 Chapter Summary
Proper problem framing bridges the gap between business goals and ML models.
Clearly define:
The decision to be made.
The target variable (label).
The prediction horizon.
The evaluation metric.
Always start with a baseline.
Avoid leakage, ill-defined labels, and misaligned metrics.
Evaluate both offline (for accuracy) and online (for business impact).
Keep framing iterative — refine based on outcomes.
Design systems with a human-in-the-loop where necessary.
Here are detailed notes for Chapter 4 – “Data Definition and Collection” from Machine Learning Engineering by Andriy Burkov:
Chapter 4: Data Definition and Collection
This chapter focuses on the data stage of the machine learning project lifecycle — defining, identifying, collecting, and labeling data.
It emphasizes that data quality is more critical than model complexity and explores practical methods for gathering reliable, representative, and unbiased data.
4.1 The Central Role of Data
Machine Learning ≈ Data + Algorithms
The algorithm defines how the model learns,
but data defines what the model learns.
Garbage in → Garbage out.
Even the best algorithm cannot overcome bad data.
Key Principle:
Better data almost always beats a better model.
4.2 What Is Data in ML Projects
In ML, data is a collection of examples representing the phenomenon the model must learn.
Each example typically includes:
Input features (x₁, x₂, …, xₙ)
Target label (y)
The dataset should reflect the distribution of real-world data the model will see in production.
4.3 Defining the Data Requirements
Before data collection begins, you must define what data you need and why.
4.3.1 Identify the Unit of Observation
What is a single data point?
Example: one transaction, one user session, one product.
Must match how predictions are made later.
4.3.2 Define the Input Features
Ask:
What information is available before the prediction moment?
Which features are causal or correlated with the outcome?
Avoid including:
Information not available at prediction time (leads to data leakage).
Features that will change meaning over time (nonstationarity).
4.3.3 Define the Label (Target)
Clearly specify the event being predicted.
Example: “User churn = no login in the next 30 days.”
Ensure the label is objectively measurable and consistent across data points.
4.3.4 Define Sampling Strategy
Decide how examples are selected:
Random sampling
Stratified sampling (ensure representation of all classes)
Time-based sampling (for temporal data)
4.4 Data Quality Dimensions
Good data satisfies multiple dimensions of quality:
Completeness
Missing values are minimal.
95% of users have recorded ages.
Consistency
Same format across systems.
Date format unified as YYYY-MM-DD.
Validity
Values fall within acceptable range.
“Age” between 0–120.
Uniqueness
No duplicates.
One record per customer.
Timeliness
Data reflects current conditions.
Real-time transaction logs.
Representativeness
Matches the production environment.
Distribution of regions/cities is similar.
Always measure and monitor these quality metrics before model training.
4.5 Data Collection Sources
Data can come from multiple sources, depending on the problem.
4.5.1 Internal Data
Company’s existing systems (databases, APIs, logs, CRMs, etc.)
Usually the easiest and cheapest to obtain.
Must check for bias and completeness.
4.5.2 External Data
Purchased or open-source datasets.
Public APIs, web scraping, government or academic data repositories.
Must ensure licensing and privacy compliance (e.g., GDPR).
4.5.3 Synthetic Data
Generated artificially to simulate rare or missing cases.
Useful for:
Balancing imbalanced datasets.
Privacy preservation.
Testing edge cases.
Synthetic data must resemble the real data distribution; otherwise, it can mislead the model.
4.6 Data Labeling
4.6.1 The Role of Labels
Labels are ground truth — the foundation of supervised learning.
Incorrect labels → incorrect learning.
4.6.2 Labeling Methods
Manual Labeling
Humans label data via tools or platforms.
Example: Amazon Mechanical Turk, Labelbox.
Must have clear guidelines and quality checks.
Programmatic Labeling
Automatically assign labels using rules, heuristics, or weak models.
Often used for initial labeling to save cost.
Semi-Supervised Labeling
Combine small labeled data + large unlabeled data.
Use self-training or pseudo-labeling.
4.6.3 Labeling Quality Assurance
Consensus labeling: multiple annotators per item → use majority vote.
Inter-annotator agreement (IAA):
Measures label consistency across annotators.
Example: Cohen’s kappa (κ).
Spot-checking: randomly audit a subset of labels.
4.7 Dealing with Class Imbalance
When one class appears much less frequently (e.g., 1% fraud, 99% normal).
Solutions:
Data-Level Approaches:
Oversampling: duplicate minority class.
SMOTE (Synthetic Minority Oversampling Technique): generate synthetic samples.
Undersampling: reduce majority class size.
Algorithm-Level Approaches:
Assign class weights inversely proportional to frequency.
Use metrics robust to imbalance (AUC, F1, Precision-Recall).
The chosen technique depends on the problem’s cost of false negatives vs. false positives.
4.8 Avoiding Data Leakage
Data leakage occurs when:
Information from the future or test set influences training.
The model indirectly learns from data it won’t have at prediction time.
Examples:
Using “time of resolution” when predicting issue occurrence.
Normalizing using global mean (instead of training mean).
Prevention:
Ensure strict train/validation/test split.
Apply transformations (scaling, encoding) after splitting.
Use pipelines to isolate data preprocessing from leakage.
4.9 Temporal Data Considerations
For time-series or temporal data:
Always maintain chronological order in train/test splits.
Avoid using future data for past predictions.
Sliding Window Techniques:
Use data up to time t to predict outcomes at t + k.
Continuously retrain models as new data arrives.
4.10 Data Representativeness
Your dataset must represent the real-world population where the model will operate.
Common Issues
Selection Bias: training data differs from live data.
Example: only urban customers in training data → poor rural predictions.
Historical Bias: past data reflects outdated or unfair decisions.
Example: old hiring data biased against certain demographics.
Solutions
Collect diverse, up-to-date data.
Monitor data drift — changes in distribution between training and production.
4.11 Privacy, Ethics, and Compliance
Privacy Concerns
Personally Identifiable Information (PII) must be:
Protected (encryption, anonymization).
Used under consent (GDPR, HIPAA compliance).
Ethical Considerations
Bias in data leads to biased models.
Example: hiring algorithm favoring one gender due to biased training data.
Best Practices
Conduct data audits for bias and fairness.
Use differential privacy or federated learning when applicable.
Document data origin, purpose, and usage.
4.12 Data Storage and Access
Store data in a version-controlled, secure, and auditable manner.
Use metadata:
Source
Date collected
Preprocessing steps
Schema version
Enable reproducibility by tracking data versions along with model versions (DataOps).
4.13 Data Collection Pipeline
Typical stages in the data pipeline:
Data Ingestion
Collect from databases, APIs, logs, or sensors.
Data Validation
Check schema, missing values, outliers.
Data Cleaning
Handle missing data, incorrect entries, duplicates.
Data Transformation
Encode, normalize, extract features.
Storage
Centralized data lake or warehouse.
Access
Enable secure queries and sampling for model training.
Use automated validation and monitoring to detect anomalies early.
4.14 The Iterative Nature of Data Work
Data collection and cleaning are never one-time tasks:
New data changes the distribution.
Features evolve.
Systems and sensors introduce new errors.
Cycle:
→ Collect → Validate → Train → Deploy → Monitor → Recollect → Retrain
4.15 Key Takeaways
Define data requirements clearly before collecting anything.
Ensure data quality — completeness, consistency, validity, representativeness.
Avoid leakage and respect the prediction moment (only use data available then).
Handle imbalance through sampling or weighting.
Labeling accuracy determines model accuracy — invest in quality assurance.
Respect privacy and ethics — avoid collecting unnecessary personal data.
Monitor data drift and continuously refresh datasets.
Document everything — sources, transformations, schema, and collection logic.
4.16 Chapter Summary
The foundation of ML systems lies in high-quality, representative, and ethical data.
Define features, labels, and sampling carefully to align with the business use case.
Establish automated data pipelines with validation and versioning.
Balance data quality, cost, and compliance for sustainable ML development.
Remember: Data work is iterative, not linear.
Here are detailed notes for Chapter 5 – “Data Preparation” from Machine Learning Engineering by Andriy Burkov:
Chapter 5: Data Preparation
This chapter covers how to clean, transform, and structure raw data into a usable format for model training.
It emphasizes that data preparation is one of the most time-consuming and crucial phases of any machine learning (ML) project — often taking up 60–80% of the total effort.
5.1 The Purpose of Data Preparation
Machine learning algorithms require data in a numerical, structured, and consistent format.
However, raw data is often messy — with missing values, duplicates, outliers, or mixed formats.
“The quality of your model depends more on your data preparation than on the algorithm you choose.”
Goals of Data Preparation:
Improve data quality (cleaning, fixing errors, removing noise).
Ensure compatibility with ML algorithms.
Enhance signal strength by transforming features.
Avoid data leakage and maintain reproducibility.
5.2 Steps of Data Preparation
The process generally includes the following stages:
Data Cleaning
Data Transformation
Feature Engineering
Feature Selection
Data Splitting
Data Balancing
Each step transforms raw data toward a model-ready dataset.
5.3 Data Cleaning
Data cleaning fixes errors, inconsistencies, and missing values.
5.3.1 Detecting and Removing Duplicates
Remove identical or near-identical rows.
Example: duplicate user transactions or logs.
5.3.2 Handling Missing Values
Causes: human error, incomplete logs, faulty sensors, etc.
Strategies:
Remove records: when missingness is random and rare.
Impute values: fill using:
Mean/median (for numeric)
Mode (for categorical)
KNN or regression-based imputation
Special category: assign “Unknown” for categorical features.
Imputation introduces bias if not done carefully — especially if data is not missing at random.
5.3.3 Detecting Outliers
Outliers can distort model learning and skew statistics.
Detection Methods:
Z-score or IQR methods.
Isolation Forest or DBSCAN for high-dimensional data.
Handling:
Cap at a threshold (winsorization).
Remove or transform (e.g., log-transform skewed distributions).
5.3.4 Correcting Data Types
Convert text numbers (“10,000”) to numeric.
Standardize date/time formats.
Ensure categorical data uses consistent naming.
5.3.5 Resolving Inconsistencies
Example: “USA”, “United States”, and “U.S.” must map to one category.
5.4 Data Transformation
Raw data often needs to be reshaped or scaled before model ingestion.
5.4.1 Scaling and Normalization
ML models like linear regression, SVM, and neural networks are sensitive to feature scales.
Techniques:
Normalization (Min-Max Scaling):
x’ = \frac{x - x_{min}}{x_{max} - x_{min}}
Scales features to [0, 1].
Standardization (Z-score Scaling):
x’ = \frac{x - \mu}{\sigma}
Transforms to mean = 0, std = 1.
When to use:
Normalization → bounded models (e.g., neural nets with sigmoid/tanh).
Standardization → unbounded models (e.g., linear regression).
Always fit the scaler on the training set only to avoid leakage.
5.4.2 Encoding Categorical Variables
ML algorithms require numeric input, so categorical data must be encoded.
Types of encoding:
One-Hot Encoding:
Create binary column for each category.
Suitable for low-cardinality features.
Ordinal Encoding:
Assign integer ranks to ordered categories.
Target / Mean Encoding:
Replace each category with the average of the target variable.
Useful for high-cardinality features.
Use with caution — target encoding can lead to overfitting; apply regularization or cross-fold encoding.
5.4.3 Text Data Transformation
For textual data:
Tokenization: split text into words or n-grams.
Stopword removal: eliminate common words (like “the”, “a”).
Stemming/Lemmatization: reduce words to root forms.
Vectorization:
Bag of Words (BoW)
TF-IDF (Term Frequency–Inverse Document Frequency)
Word Embeddings (Word2Vec, GloVe)
Contextual Embeddings (BERT)
5.4.4 Handling Date and Time Features
Dates and timestamps can be converted into:
Year, month, day, weekday.
Hour of day (cyclic encoding for periodic data).
Time differences (e.g., days since signup).
Use sin/cos encoding for cyclical features:
\text{sin}(2\pi t/T), \quad \text{cos}(2\pi t/T)
5.4.5 Aggregation and Binning
Binning: convert continuous variables into discrete intervals.
Example: age → child, adult, senior.
Aggregation: summarize data (e.g., average purchases per week).
These help reduce noise and variance.
5.5 Feature Engineering
Feature engineering = creating new informative features that help the model learn patterns more easily.
5.5.1 Derived Features
Ratios: “price per unit”, “debt-to-income ratio”.
Interactions: “feature1 × feature2”.
Temporal trends: “number of logins in last 7 days”.
5.5.2 Domain Knowledge
Use domain expertise to craft meaningful signals.
Example (banking): combine “income” and “loan amount” → “loan-to-income ratio”.
5.5.3 Feature Reduction
Remove redundant or highly correlated features.
Use feature importance from models (Random Forests, XGBoost).
Dimensionality reduction (PCA, t-SNE, autoencoders) when needed.
5.6 Data Splitting
To evaluate model performance fairly, data is split into:
Training set
Model learning
60–80%
Validation set
Hyperparameter tuning
10–20%
Test set
Final evaluation
10–20%
5.6.1 Cross-Validation
K-Fold CV: split into k folds, rotate validation set.
Reduces variance and ensures robustness.
5.6.2 Temporal Splits
For time-series data, preserve chronological order (no random shuffling).
5.6.3 Stratified Sampling
Maintain same label distribution across splits (for classification).
5.7 Data Balancing
If dataset classes are imbalanced (e.g., fraud = 1%, non-fraud = 99%), balance them via:
Resampling Techniques
Oversampling minority class (SMOTE, ADASYN).
Undersampling majority class.
Algorithmic Solutions
Adjust class weights (e.g., class_weight='balanced').
Threshold Tuning
Shift decision boundary to favor recall or precision.
5.8 Data Leakage Prevention
Data leakage is a critical issue that invalidates model evaluation.
Avoid by:
Splitting data before preprocessing.
Fitting scalers and encoders only on the training set.
Avoiding target-derived features (e.g., average target per user).
Keeping future data separate in time-based tasks.
5.9 Automating Data Preparation
Data pipelines can automate cleaning and transformation.
Tools:
Python libraries: Pandas, scikit-learn Pipelines, Featuretools.
DataOps / MLOps tools: Apache Airflow, Kubeflow, MLflow, Prefect.
Feature Stores: centralized repositories for preprocessed features (e.g., Feast, Tecton).
Benefits:
Reproducibility
Consistency
Reduced leakage
Scalability
5.10 Data Versioning and Reproducibility
Reproducibility requires tracking:
Data versions
Preprocessing scripts
Random seeds
Train/validation splits
Tools:
DVC (Data Version Control), Git LFS, MLflow Tracking.
5.11 Feature Scaling in Production
The same transformation logic used during training must be applied consistently during inference.
Hence, the preprocessing pipeline must be serialized (saved) with the model.
Example: If you used a StandardScaler, the same mean and std must be applied to production data.
5.12 Key Takeaways
Data preparation is the foundation of model quality.
Clean, validate, and standardize data before modeling.
Handle missing data, outliers, and categorical features carefully.
Perform scaling, encoding, and feature engineering thoughtfully.
Split data properly — avoid leakage and maintain representativeness.
Automate preprocessing for consistency and reproducibility.
Always validate transformations using the training set only.
Document and version your datasets and pipelines.
5.13 Chapter Summary
Data preparation ensures that raw data is structured, complete, and consistent.
Major steps include cleaning, transformation, encoding, feature engineering, and splitting.
Leakage prevention and reproducibility are key quality pillars.
Automation tools and pipelines make the process scalable.
High-quality data preparation is what separates academic models from production-grade systems.
Here are detailed notes for Chapter 6 – “Feature Engineering and Selection” from Machine Learning Engineering by Andriy Burkov:
Chapter 6: Feature Engineering and Selection
This chapter explores how to design, construct, and select features that most effectively represent the underlying structure of your data.
Feature engineering is one of the most critical stages in the ML lifecycle — it often determines the ultimate success of the model.
“The quality of your features largely determines the quality of your model.”
6.1 The Importance of Features
A feature is a measurable property or characteristic of a phenomenon being observed.
In ML, features are the inputs (x₁, x₂, …, xₙ) the model uses to learn patterns and make predictions.
Why Feature Engineering Matters
Models learn patterns, not raw data.
Poor features → poor model performance, even with complex algorithms.
Good features can make even simple models perform competitively.
Algorithms amplify the information in data — they don’t create it.
Hence, feature engineering adds signal before training begins.
6.2 The Feature Engineering Process
Feature engineering is iterative and involves:
Understanding the data and domain.
Creating new features from existing ones.
Transforming features to better reflect relationships.
Evaluating their usefulness via experiments or metrics.
Selecting the best subset.
6.3 Types of Features
Numerical
Quantitative, continuous or discrete.
Age, price, temperature.
Categorical
Qualitative, nominal or ordinal.
Country, education level.
Textual
Sequences of words, phrases.
Customer reviews, tweets.
Temporal
Time-based or sequential.
Timestamps, intervals.
Spatial
Geographic or positional.
GPS coordinates, pixel grids.
Derived / Engineered
Computed from existing data.
Ratios, interactions, aggregations.
6.4 Domain Understanding and Feature Design
Feature engineering begins with understanding the domain:
Know what influences the target variable.
Understand data generation and collection context.
Example (Loan Default Prediction):
Domain knowledge reveals that “loan-to-income ratio” or “credit utilization” are better predictors than raw “income” or “loan amount.”
Key Questions:
What features correlate with the target?
What transformations might reveal hidden relationships?
Are there nonlinear effects that need to be captured?
6.5 Feature Construction Techniques
6.5.1 Mathematical Transformations
Used to adjust feature distributions or linearize relationships.
Log(x + 1)
Reduce skewness, handle large ranges
√x
Stabilize variance
x², x³
Capture nonlinear patterns
1/x
Highlight inverse relationships
Standardization / Normalization
Adjust scales
6.5.2 Combining Features
Combine multiple raw features to create new, meaningful attributes.
Examples:
Ratio: price / income
Difference: max_temp - min_temp
Product: height × weight (for BMI)
Aggregation: avg_spent_per_week
These help models learn relationships directly rather than relying on complex nonlinearities.
6.5.3 Interaction Features
Capture combined effects of two or more features.
Example: Interaction between “education level” and “years of experience” for salary prediction.
In linear models, interactions can help simulate nonlinear patterns.
6.5.4 Temporal Features
Derived from timestamps:
Extract: hour, day, month, weekday.
Time since event: “days since last login.”
Rolling aggregates: “average purchase per week.”
Cyclical encoding for periodicity:
x_{sin} = \sin\left(2\pi \frac{t}{T}\right), \quad x_{cos} = \cos\left(2\pi \frac{t}{T}\right)
6.5.5 Text Features
Transform unstructured text into numerical representations.
Bag-of-Words (BoW)
Count word frequency per document.
TF-IDF
Weigh rare but important words higher.
Word Embeddings
Map words to dense vectors (Word2Vec, GloVe).
Contextual Embeddings
Dynamic representations (BERT, GPT, etc.).
6.5.6 Encoding Categorical Variables
Different encodings for different use cases:
One-Hot Encoding
Small number of categories.
Ordinal Encoding
Ordered categories.
Target Encoding
Large cardinality, with regularization.
Binary / Hash Encoding
Very high-cardinality features.
6.5.7 Aggregation Features
Summarize grouped data (by user, session, region, etc.).
Examples:
average_purchase_per_user
max_rating_by_category
count_of_transactions_last_30_days
Aggregation captures behavioral or temporal context that individual events miss.
6.6 Handling Feature Correlations and Redundancy
Highly correlated features can:
Inflate model complexity.
Lead to unstable coefficients (in linear models).
Reduce interpretability.
Approaches:
Compute correlation matrix and remove duplicates (|r| > 0.9).
Use Variance Inflation Factor (VIF) for multicollinearity.
Apply Principal Component Analysis (PCA) for dimensionality reduction.
6.7 Feature Scaling
Many ML algorithms assume that all features are on comparable scales.
Min-Max Scaling
(x – min)/(max – min)
Neural networks
Standardization
(x – μ)/σ
Linear models, SVM
Robust Scaling
(x – median)/IQR
Outlier-resistant models
Always fit scalers on training data only — never include test data.
6.8 Feature Selection
After creating features, not all are useful.
Feature selection helps remove irrelevant, redundant, or noisy variables.
Goals:
Reduce overfitting.
Improve interpretability.
Decrease computation time.
6.8.1 Filter Methods (Statistical Tests)
Independent of any ML algorithm.
Correlation / Mutual Information
Continuous features
Chi-Square Test
Categorical features
ANOVA (F-test)
Group mean comparison
Example:
Select top k features with highest information gain or correlation with target.
6.8.2 Wrapper Methods
Use a predictive model to evaluate feature subsets.
Recursive Feature Elimination (RFE)
Iteratively remove least important features using model feedback.
Forward Selection
Start with none, add best-performing one by one.
Backward Elimination
Start with all, remove least useful progressively.
Computationally expensive, but often yields better results.
6.8.3 Embedded Methods
Feature selection occurs during model training.
L1 Regularization (Lasso)
Linear
Forces some coefficients to zero.
Tree-based Feature Importance
Decision Trees, Random Forests, XGBoost
Importance = reduction in impurity or information gain.
Embedded methods are efficient and widely used in production.
6.9 Dimensionality Reduction
Used when there are hundreds or thousands of features.
PCA (Principal Component Analysis)
Project data onto lower-dimensional axes maximizing variance.
t-SNE / UMAP
Visualize high-dimensional structure.
Autoencoders
Learn compact representations in neural networks.
These help eliminate redundancy and speed up model training.
6.10 Evaluating Feature Importance
Helps explain model decisions and select features intelligently.
Techniques:
Permutation Importance:
Randomly shuffle one feature and observe performance drop.
Model-based Importance:
Use feature_importances_ attribute from tree-based models.
SHAP (SHapley Additive exPlanations):
Quantifies each feature’s contribution to individual predictions.
6.11 Feature Drift and Monitoring
Once deployed, feature distributions may shift over time due to changing environments.
Types:
Covariate Drift: Input features change.
Label Drift: Relationship between features and labels changes.
Concept Drift: Target definition evolves.
Solution:
Continuously monitor feature statistics.
Retrain or recalibrate model when drift exceeds threshold.
6.12 Automation in Feature Engineering
Modern ML systems use automated feature generation and management.
Tools:
Featuretools (Python) → automatic feature synthesis.
Feast, Tecton, Vertex AI Feature Store → store, version, and serve features in production.
Benefits:
Consistency across training and inference.
Shared, reusable features across teams.
Easier monitoring and debugging.
6.13 Key Takeaways
Feature engineering adds domain-driven signal that algorithms cannot create.
Combine, transform, and scale features to expose patterns.
Use statistical and model-based methods for feature selection.
Reduce redundancy via correlation checks and PCA.
Evaluate feature importance for interpretability and fairness.
Continuously monitor feature drift post-deployment.
Leverage feature stores for automation and reproducibility.
Remember: simple models with strong features outperform complex models with weak ones.
6.14 Chapter Summary
Feature engineering bridges raw data and model insight.
It involves domain understanding, transformation, interaction, and construction.
Feature selection ensures efficiency and interpretability.
Dimensionality reduction helps with high-dimensional datasets.
Automation tools are making feature management scalable and collaborative.
The best ML engineers are those who know what to add, what to remove, and what to monitor.
Here are detailed notes for Chapter 7 – “Model Training and Evaluation” from Machine Learning Engineering by Andriy Burkov:
Chapter 7: Model Training and Evaluation
This chapter explains how to train, evaluate, and optimize machine learning models.
It emphasizes best practices for achieving reliable generalization, fair performance comparison, and effective model tuning — all while avoiding common pitfalls like overfitting and data leakage.
“A model that performs well on training data but fails on unseen data is worse than no model at all.”
7.1 The Purpose of Model Training
Model training is the process of finding the best parameters (θ) that minimize prediction errors on known data while maintaining generalization to unseen data.
Objective:
\hat{\theta} = \arg\min_{\theta} L(y, f(x; \theta))
Where:
L = loss function
y = true labels
f(x; θ) = model prediction
The main goal is not just minimizing training error, but ensuring low generalization error — i.e., performing well on new, unseen examples.
7.2 Training, Validation, and Test Splits
Proper data splitting is fundamental to unbiased model evaluation.
Training Set
Fit model parameters (weights).
Used for learning.
Validation Set
Tune hyperparameters, prevent overfitting.
Helps model selection.
Test Set
Final unbiased evaluation.
Used only once at the end.
Never use test data during model development — it invalidates your evaluation.
7.3 The Training Process
Initialize parameters (randomly or heuristically).
Compute predictions on training data.
Measure error using a loss function.
Adjust parameters to minimize loss (via optimization algorithm).
Repeat until convergence or early stopping.
7.4 Optimization Algorithms
Optimization adjusts model parameters to minimize the loss function.
7.4.1 Batch Gradient Descent
Uses the entire training dataset for each update.
Slow but accurate.
7.4.2 Stochastic Gradient Descent (SGD)
Updates after each example.
Faster and enables online learning but noisier.
7.4.3 Mini-Batch Gradient Descent
Compromise between batch and SGD.
Most commonly used in practice.
7.5 Common Loss Functions
Regression
Mean Squared Error (MSE)
\frac{1}{n}\sum(y - \hat{y})^2
Mean Absolute Error (MAE)
$begin:math:text$\frac{1}{n}\sum
Classification
Binary Cross-Entropy
-[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]
Multi-class Cross-Entropy
Penalizes wrong class probability.
Ranking
Pairwise loss / NDCG
For ordered outputs.
The loss function defines what the model learns — always align it with your business goal.
7.6 Bias–Variance Tradeoff
Generalization error can be decomposed as:
E_{total} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
Underfitting
Overfitting
Too simple model
Too complex model
High training & test error
Low training, high test error
Goal:
Find a balance between bias and variance.
Techniques to control variance:
Cross-validation
Regularization (L1, L2)
Early stopping
Dropout (for deep learning)
Ensemble methods
7.7 Regularization
Regularization penalizes large parameter weights to prevent overfitting.
L1 (Lasso)
λΣ
wᵢ
L2 (Ridge)
λΣwᵢ²
Shrinks weights smoothly
Elastic Net
Combination of L1 and L2
Balances both effects
λ (lambda) controls regularization strength — tune it via validation.
7.8 Cross-Validation
A method for assessing model stability and generalization.
7.8.1 k-Fold Cross-Validation
Split data into k folds (e.g., 5 or 10).
Train on k−1 folds, test on the remaining.
Average results for final score.
7.8.2 Stratified k-Fold
Maintains class balance (important for classification).
7.8.3 Time-Series Cross-Validation
Respects chronological order (train → validate → test progressively).
Use cross-validation for model comparison and hyperparameter tuning.
7.9 Hyperparameter Tuning
Hyperparameters are not learned during training; they define how training occurs.
Examples:
Learning rate
Regularization strength
Number of layers, trees, neighbors, etc.
Tuning Techniques:
Grid Search
Try all combinations exhaustively.
Random Search
Randomly sample combinations (often more efficient).
Bayesian Optimization
Uses prior results to guide next trials (e.g., Optuna, HyperOpt).
Early Stopping
Halt training when validation performance stops improving.
Always use the validation set (or CV folds) for hyperparameter tuning — not the test set.
7.10 Evaluation Metrics
The right metric depends on your task type and business objective.
7.10.1 Classification Metrics
Accuracy
(TP + TN) / Total
Precision
TP / (TP + FP)
Recall (Sensitivity)
TP / (TP + FN)
F1 Score
2 × (Precision × Recall) / (Precision + Recall)
ROC-AUC
Area under ROC curve — discrimination power
PR-AUC
Focuses on positive class (useful for imbalance)
For imbalanced data, prefer Precision-Recall or AUC over Accuracy.
7.10.2 Regression Metrics
MSE / RMSE
Penalizes large errors heavily.
MAE
Measures average absolute deviation.
R² Score
Fraction of variance explained by model.
7.10.3 Ranking / Recommendation Metrics
Precision@k – How many of top-k results are relevant.
MAP (Mean Average Precision) – Overall ranking quality.
NDCG (Normalized Discounted Cumulative Gain) – Relevance weighted by position.
7.10.4 Business-Oriented Metrics
Sometimes, classical metrics don’t reflect real-world value.
Examples:
Click-through rate (CTR)
Revenue lift
False alarm cost
Customer churn rate reduction
Choose metrics that align with business impact.
7.11 Confusion Matrix and Error Analysis
A confusion matrix breaks down predictions into:
Actual Positive
TP
FN
Actual Negative
FP
TN
From this, all classification metrics are derived.
Error Analysis Steps:
Analyze misclassified examples.
Identify patterns or data issues.
Adjust features, thresholds, or model complexity.
7.12 Handling Class Imbalance
When one class dominates others (e.g., fraud = 1%, non-fraud = 99%).
Techniques:
Resampling: SMOTE, ADASYN, or undersampling.
Class Weights: Penalize mistakes on minority class more.
Threshold Adjustment: Shift decision boundary.
Metric Selection: Use Precision-Recall or F1 instead of accuracy.
7.13 Model Comparison
When comparing models, ensure:
Same train/validation/test splits.
Same preprocessing steps.
Same evaluation metrics.
Fair comparison = same data, same metric, same protocol.
Always include a baseline model for reference:
Random guesser
Logistic regression
Rule-based heuristic
7.14 Ensemble Methods
Combine multiple models to improve stability and performance.
Bagging
Train models independently, average predictions.
Random Forest
Boosting
Sequentially correct errors of previous models.
XGBoost, AdaBoost, LightGBM
Stacking
Combine predictions of several models using a meta-model.
Blended ensembles
Ensembles reduce variance but may sacrifice interpretability.
7.15 Avoiding Overfitting
Overfitting = low training error but high validation/test error.
Prevention Techniques:
Cross-validation
Regularization (L1/L2)
Early stopping
Dropout (for deep nets)
Simplifying model complexity
Collecting more data
Data augmentation (especially for vision/text)
7.16 Reproducibility in Training
Ensure experiments are repeatable by fixing:
Random seeds
Data splits
Environment dependencies
Code versions
Use experiment-tracking tools:
MLflow
Weights & Biases
Comet.ml
7.17 Model Interpretability
Interpretability builds trust and helps debugging.
Methods:
Feature importance (trees, coefficients)
Partial Dependence Plots (PDP)
LIME / SHAP values
Decision rules / surrogate models
Choose explainable models for high-stakes domains (healthcare, finance).
7.18 Continuous Evaluation and Monitoring
Model performance can drift in production due to:
Data distribution shifts
Concept drift
Label leakage
Monitor:
Input feature statistics
Prediction distributions
Live performance (A/B testing)
Business KPIs
Retrain or recalibrate when drift is detected.
7.19 Key Takeaways
Always separate training, validation, and test data to avoid leakage.
Align loss functions and metrics with the business objective.
Manage bias–variance tradeoff via regularization and model tuning.
Use cross-validation for reliable performance estimation.
Track and tune hyperparameters systematically.
Handle class imbalance carefully using weights or resampling.
Maintain reproducibility with versioning and fixed seeds.
Continuously monitor model performance post-deployment.
7.20 Chapter Summary
Model training involves optimizing parameters to minimize loss.
Evaluation focuses on generalization, not just training accuracy.
Regularization, CV, and early stopping prevent overfitting.
Use the right metrics and error analysis to understand model behavior.
Hyperparameter tuning and ensemble learning can boost performance.
Monitoring and interpretability are crucial for production models.
Ultimately, good training practice is about discipline, reproducibility, and clarity — not just accuracy.
Here are detailed notes for Chapter 8 – “Model Deployment and Prediction Service” from Machine Learning Engineering by Andriy Burkov:
Chapter 8: Model Deployment and Prediction Service
This chapter explains how to turn a trained model into a usable service — accessible to users or systems — in a reliable, scalable, and maintainable way.
It covers deployment strategies, prediction pipelines, system architecture, versioning, monitoring, and performance optimization.
“A model that stays in a notebook helps no one — deployment turns ideas into value.”
8.1 What Is Model Deployment?
Model deployment is the process of integrating a trained ML model into a production environment where it can make predictions on new, unseen data.
Goal:
To make the model’s predictions accessible through an interface (API, batch job, or streaming system) for real-world use.
8.2 Model Deployment Paradigms
Deployment depends on how predictions are used:
Batch Prediction
Model runs periodically on bulk data.
Daily credit risk scoring, nightly sales forecasts.
Online (Real-time) Prediction
Predictions served instantly via an API.
Fraud detection during payment, recommendation systems.
Streaming Prediction
Model ingests continuous data flow.
Predictive maintenance on IoT data, stock price alerts.
Key Tradeoffs
Batch = high throughput, low immediacy.
Online = low latency, more complexity.
Streaming = high engineering overhead, but instant reaction.
8.3 Components of a Prediction Service
A prediction service typically includes:
Model Artifacts
Serialized model file (e.g., .pkl, .onnx, .pt, .h5).
Feature Pipeline
Transforms raw input data into model-ready features.
Must mirror the preprocessing used during training.
Prediction Logic
Loads the model and computes outputs.
Serving Interface
REST API, gRPC endpoint, or message queue consumer.
Monitoring & Logging
Tracks requests, latency, and model performance drift.
Consistency rule: The data preprocessing used during training must be identical during inference.
8.4 Packaging and Serialization
To deploy a model, you first package it into a portable artifact.
Common Serialization Formats
Scikit-learn
Pickle, Joblib
.pkl, .joblib
TensorFlow/Keras
SavedModel, HDF5
.pb, .h5
PyTorch
TorchScript
.pt, .pth
ONNX
Open Neural Network Exchange
.onnx (cross-framework)
Save both the model and preprocessing pipeline together for reproducibility.
8.5 Containerization
Models are deployed as containers for consistency across environments.
Benefits of Containerization:
Reproducible environment (same dependencies, libraries, OS).
Scalability across clusters (e.g., Kubernetes).
Isolation from other services.
Typical Container Stack:
Dockerfile → Defines environment (Python version, dependencies).
Docker image → Portable build.
Container orchestration → Managed by Kubernetes, ECS, or Docker Swarm.
Example structure:
8.6 Model Serving Architectures
8.6.1 Embedded (In-process) Serving
Model runs inside the same process as the main application.
Pros: Low latency, simple setup.
Cons: Harder to update model without redeploying the app.
8.6.2 Separate Microservice
Model is exposed via REST/gRPC API.
Pros: Independent scalability and updates.
Cons: Network latency and additional infrastructure.
8.6.3 Model Server Frameworks
Prebuilt serving solutions for ML workloads.
TensorFlow Serving
TensorFlow models
TorchServe
PyTorch models
MLflow Models
Cross-framework
Seldon Core / KFServing
Kubernetes-native inference
Triton Inference Server
Multi-model GPU serving
8.7 Deployment Environments
Cloud Deployment
Model hosted on AWS, Azure, or GCP.
Scalable production ML.
On-Premise
Local servers managed by company.
Data-sensitive industries (finance, healthcare).
Edge Deployment
Model runs on devices.
Mobile apps, IoT sensors, self-driving cars.
Edge deployment prioritizes latency, privacy, and offline inference.
8.8 Model Versioning and Management
Every model version must be tracked along with:
Training data version
Code version
Hyperparameters
Evaluation metrics
Deployment metadata
Tools:
MLflow Model Registry
DVC (Data Version Control)
Weights & Biases
Git + Tagging
Version control ensures:
Reproducibility
Rollback ability (if performance degrades)
Auditability for compliance
8.9 CI/CD for Machine Learning (MLOps)
Continuous Integration / Continuous Deployment pipelines automate model lifecycle tasks.
CI/CD Stages:
Data validation (schema checks)
Model training (automated retraining pipeline)
Model evaluation (compare vs. current production)
Deployment (automated if new model passes threshold)
Monitoring & rollback
MLOps = DevOps + DataOps + ModelOps.
It ensures automation, reliability, and scalability of ML systems.
Popular Tools:
Kubeflow
Airflow
MLflow
Jenkins
Argo Workflows
8.10 Model Performance Monitoring
After deployment, models must be continuously monitored for:
1. Data Drift
Distribution of inputs changes over time.
Detected via KS-test, population stability index (PSI).
2. Concept Drift
Relationship between features and target changes.
3. Latency and Throughput
Measure prediction response time and load capacity.
4. Prediction Quality
Compare real-world outcomes vs. model predictions (if labels become available).
5. Business Metrics
Track user engagement, conversions, fraud rate, etc.
8.11 Canary and Shadow Deployments
Safe rollout strategies for new models.
Canary Deployment
Deploy new model to small % of users → monitor performance.
If stable → gradually increase traffic.
Shadow Deployment
New model runs in parallel with production model.
Receives same inputs but its predictions are not used.
Used for testing before production rollout.
Both strategies prevent full-scale failures.
8.12 Model Scaling
Scalability ensures that your service handles increasing requests efficiently.
Scaling Strategies
Vertical Scaling: Increase server resources (CPU/GPU/RAM).
Horizontal Scaling: Add more replicas behind a load balancer.
Autoscaling: Automatically adjust resources based on traffic (Kubernetes HPA, AWS Auto Scaling).
Use asynchronous queues (Kafka, RabbitMQ) for large batch or streaming workloads.
8.13 Security and Privacy in Deployment
Authentication & Authorization
Restrict API access via keys or OAuth.
Input Validation
Prevent malicious payloads or inference attacks.
Data Encryption
Encrypt data in transit (TLS) and at rest.
Model Protection
Use rate limits, model watermarking, or access control.
Privacy Preservation
Techniques: differential privacy, federated learning, secure enclaves.
Inference APIs can unintentionally expose sensitive model behavior — protect them like other production assets.
8.14 Model Retraining and Lifecycle Management
Models degrade over time due to:
Data drift
Behavior change
Concept drift
Retraining Loop
Collect new labeled data.
Evaluate production model on updated data.
Retrain if performance falls below threshold.
Re-deploy and monitor.
Automation tip:
Integrate retraining pipelines in CI/CD workflow (MLOps).
8.15 A/B Testing for Model Performance
Compare two models live on real traffic.
A model (Control)
Current production model
B model (Candidate)
New version under test
Metric
Business KPI (e.g., click-through rate, conversion)
Decision
Deploy B if statistically significant improvement
Use A/B testing to ensure new models actually deliver business value, not just metric improvements.
8.16 Logging and Observability
Logging ensures traceability and debugging ability.
Key Logs:
Input features (with timestamps)
Model version used
Prediction outputs
Latency and request ID
System-level metrics (CPU, GPU, memory)
Observability Tools:
Prometheus + Grafana
ELK Stack (Elasticsearch, Logstash, Kibana)
Sentry, Datadog
8.17 Error Handling and Failover
Implement graceful degradation (return fallback prediction or cached result if service fails).
Use redundant instances or load balancing for high availability.
Monitor error rates and auto-alert on anomalies.
8.18 Key Takeaways
Deployment turns ML research into real-world impact.
Choose between batch, online, or streaming inference depending on latency needs.
Package models with preprocessing logic — ensure consistency.
Use containers and microservices for scalability and independence.
Implement versioning, CI/CD, and MLOps for automation and reproducibility.
Monitor for drift, latency, and business KPIs continuously.
Use canary or shadow deployment to safely roll out updates.
Secure your model endpoints and data — treat them as production software.
Retrain periodically and validate improvements through A/B testing.
Build observability into every stage — logging, metrics, alerts.
8.19 Chapter Summary
Model deployment converts a static model into an active service.
It involves packaging, serving, scaling, monitoring, and maintaining the model lifecycle.
Deployment can be batch, online, or streaming, depending on latency and complexity.
Tools like Docker, Kubernetes, MLflow, Seldon, and Airflow are standard in production MLOps.
Long-term success depends on robust pipelines, observability, and continuous retraining.
The key to production ML: consistency, automation, monitoring, and security.
Here are detailed notes for Chapter 9 – “Model Maintenance and Monitoring” from Machine Learning Engineering by Andriy Burkov:
Chapter 9: Model Maintenance and Monitoring
This chapter focuses on what happens after model deployment — the phase where most real-world challenges appear.
It explains how to monitor, maintain, and update machine learning models to ensure their performance remains consistent, fair, and reliable over time.
“In machine learning, deployment is not the end — it’s the beginning of a new lifecycle.”
9.1 Why Model Maintenance Is Essential
A model’s environment is dynamic, not static.
Once in production, it begins to interact with real-world data — which changes continuously.
Reasons for Degradation (a.k.a. Model Decay):
Data Drift: Input data distribution changes over time.
Concept Drift: Relationship between inputs and outputs evolves.
Label Drift: Target variable definitions shift.
Feature Drift: Important features lose predictive power or disappear.
System Drift: Infrastructure or upstream changes (e.g., APIs, ETL jobs) alter inputs.
Even a perfectly trained model will fail if the world it represents changes.
9.2 Model Lifecycle Management
The ML lifecycle doesn’t stop at deployment; it’s a continuous loop:
Monitor model performance in production.
Detect drift or anomalies.
Collect fresh labeled data.
Retrain or recalibrate the model.
Validate new version and redeploy.
This cycle repeats throughout the model’s operational lifespan — a concept called continuous ML (CML) or MLOps.
9.3 Key Aspects of Monitoring
Monitoring ensures the model performs as expected technically and business-wise.
Monitoring Dimensions:
Data Quality Monitoring
Ensure input data integrity.
Missing values, schema mismatches.
Model Performance Monitoring
Track predictive accuracy and calibration.
Accuracy, F1, AUC, RMSE.
Drift Detection
Identify input/output distribution changes.
PSI, KL divergence, KS test.
Operational Metrics
Maintain system stability.
Latency, throughput, uptime.
Business KPIs
Measure real-world impact.
Conversion rate, churn reduction, fraud detection rate.
9.4 Monitoring Data Quality
Data quality issues are among the most common causes of performance degradation.
What to Check:
Schema Changes:
Feature added/removed/renamed.
Example: “user_age” becomes “age_years.”
Missing Data Patterns:
Missing values increase suddenly.
Feature Statistics:
Mean, median, standard deviation drift.
Sudden distribution shifts.
Outlier Ratios:
Large deviation may signal corrupted data.
Tools:
Great Expectations
TFDV (TensorFlow Data Validation)
Deequ (AWS)
Pandera (Python)
Always log input data summaries with every prediction request.
9.5 Detecting Drift
9.5.1 Types of Drift
Covariate Drift (Feature Drift)
Change in input distribution P(X).
Customer age distribution shifts due to new user segment.
Prior Probability Drift (Label Drift)
Change in target label distribution P(Y).
Fraud rate drops from 10% to 2%.
Concept Drift
Change in relationship between inputs and labels P(Y
X).
9.5.2 Drift Detection Methods
Population Stability Index (PSI)
Measures difference between two distributions.
Kolmogorov–Smirnov Test (KS Test)
Detects differences in continuous variable distributions.
Jensen–Shannon or KL Divergence
Quantifies dissimilarity between probability distributions.
Chi-Square Test
For categorical features.
ADWIN / DDM (Online Algorithms)
Real-time drift detection in streaming data.
9.6 Monitoring Model Performance
Measure how the model performs on live data compared to its original validation results.
Approaches:
Delayed Ground Truth Monitoring:
Wait until actual labels become available (e.g., loan default after 6 months).
Compare new metrics to baseline validation results.
Proxy Metrics:
Use surrogate indicators (like click rate, engagement) when labels are delayed.
Continuous Evaluation:
Periodically run evaluation pipelines on new labeled batches.
Typical Metrics:
Classification: Precision, Recall, F1, AUC.
Regression: MAE, RMSE, R².
Business: Revenue lift, reduced churn, fraud savings.
Track both model metrics and business outcomes to assess true health.
9.7 Model Explainability and Accountability
Post-deployment interpretability helps detect unexpected model behavior or bias.
Explainability Tools:
Feature Importance (Tree-based models)
SHAP values
LIME (Local Interpretable Model-Agnostic Explanations)
Counterfactual explanations
Accountability Measures:
Document model lineage and decisions.
Store versioned model cards with:
Intended use case
Training data summary
Evaluation metrics
Known limitations and risks
This builds trust and auditability, especially in regulated industries.
9.8 Automating Model Maintenance (MLOps Loop)
Automation ensures models are updated without manual intervention.
Typical Automated Workflow:
Detect Drift → Trigger Retraining Pipeline.
Collect New Data → Validate Schema → Train Model.
Evaluate → Compare with Baseline → If Better → Deploy.
Monitor Performance → Repeat.
This can be implemented via tools like:
Kubeflow Pipelines
MLflow + Airflow
TFX (TensorFlow Extended)
AWS SageMaker Pipelines
Automation reduces human error and keeps models synchronized with reality.
9.9 Retraining Strategies
Retraining keeps models relevant but must be done strategically.
Retraining Triggers:
Time-based: Retrain periodically (e.g., weekly, monthly).
Performance-based: Retrain when performance drops below threshold.
Data-based: Retrain when input distribution drifts beyond limit.
Retraining Types:
Full Retraining
Train new model from scratch.
Monthly refresh with all data.
Incremental Training
Update model using recent data.
Online learning models (SGD, streaming).
Transfer Learning
Reuse part of old model with new domain data.
Adapting product recommendation for new region.
Retraining frequency depends on how dynamic the data domain is.
9.10 Model Versioning and Rollback
Every new model must be:
Versioned (with training data, code, and parameters).
Tested for regression against current production.
Deployable with rollback support.
Model Registry Should Include:
Version ID
Training dataset hash
Model configuration & hyperparameters
Validation & production metrics
Deployment date and owner
Rollback:
If a new version performs worse or introduces bias, the system should automatically revert to the previous stable version.
9.11 Bias and Fairness Monitoring
Models can become biased due to:
Historical bias in training data.
Uneven drift across user groups.
Feedback loops reinforcing prior errors.
Monitoring Methods:
Fairness Metrics: Equal Opportunity, Demographic Parity, Disparate Impact.
Group-wise Performance Tracking: Compare F1 or AUC across demographics.
Bias Dashboards: Automated visualization tools (e.g., What-If Tool, Aequitas).
Fairness is not a one-time check — it must be continuously measured.
9.12 Shadow and Champion–Challenger Models
Safe maintenance requires testing new models before replacing old ones.
Shadow Mode:
New model runs in parallel, receives same inputs, but predictions are not used.
Compares outputs silently to production model.
Champion–Challenger Setup:
Champion = current production model.
Challenger = candidate model.
Challenger replaces Champion only if it performs better in real-world data.
This system prevents regressions and enables experimentation without risk.
9.13 Monitoring Tools and Infrastructure
Popular tools for end-to-end monitoring and maintenance:
Data Validation
Great Expectations, TFDV
Experiment Tracking
MLflow, Weights & Biases
Drift Detection
EvidentlyAI, WhyLabs, Arize AI
Model Registry
MLflow Registry, Sagemaker Model Registry
Alerting & Dashboards
Prometheus + Grafana, Kibana, Datadog
9.14 Model Governance and Compliance
For enterprise and regulated domains (finance, healthcare, insurance), compliance is critical.
Governance Elements:
Auditability: Logs of all model decisions.
Traceability: Links from prediction → model version → training data.
Accountability: Who approved, trained, and deployed the model.
Transparency: Clear documentation of data sources and decisions.
Tools and Standards:
Model Cards (Google)
Datasheets for Datasets
AI Fairness 360 (IBM)
Responsible AI frameworks (Azure, AWS, GCP)
9.15 Human-in-the-Loop Systems
Human oversight improves model robustness and accountability.
Examples:
Review edge-case predictions.
Provide feedback for retraining.
Override model outputs in critical cases.
The goal is not to replace humans but to augment them.
9.16 Key Challenges in Model Maintenance
Data Drift
New data distributions differ from training data.
Label Availability
Delayed or missing true outcomes for feedback.
Automation Failures
Pipelines break due to dependency updates.
Cost Management
Retraining and monitoring can be expensive.
Governance Complexity
Ensuring compliance across changing models.
Addressing these requires a combination of MLOps tools, alerting systems, and process discipline.
9.17 Key Takeaways
Model performance decays naturally — monitor continuously.
Track data drift, concept drift, and feature drift separately.
Automate retraining pipelines but validate models before deployment.
Use shadow/champion–challenger setups to test safely.
Version models, data, and metrics for full traceability.
Continuously monitor bias and fairness.
Establish governance and compliance processes for accountability.
Implement human-in-the-loop checks for critical decisions.
Use dashboards, alerts, and logs to detect anomalies early.
Maintenance is an ongoing process — success depends on iteration and observation.
9.18 Chapter Summary
Model maintenance ensures long-term reliability of ML systems.
Monitoring covers data quality, performance, drift, and fairness.
Automation (MLOps) is key to scaling retraining and redeployment.
Every model version should be tracked, explainable, and auditable.
Continuous monitoring bridges engineering, data science, and business.
In production, change is constant — the best ML engineers design for evolution, not perfection.
Here are detailed notes for Chapter 10 – “Ethics and Privacy” from Machine Learning Engineering by Andriy Burkov:
Chapter 10: Ethics and Privacy
This final chapter emphasizes the responsibility of ML engineers in developing and deploying machine learning systems that are ethical, fair, transparent, and respectful of privacy.
It discusses the moral, social, and legal implications of AI models and outlines strategies to build systems that benefit people without causing harm.
“A great machine learning engineer doesn’t just optimize models — they optimize for humanity.”
10.1 The Importance of Ethics in Machine Learning
Machine learning models influence critical decisions — in healthcare, finance, justice, hiring, and public policy.
Unethical design or misuse can lead to:
Discrimination and unfair outcomes
Invasion of privacy
Spread of misinformation
Erosion of trust in technology
Hence, ethical ML is not optional — it’s a core engineering responsibility.
10.2 Defining AI Ethics
AI Ethics is the practice of aligning machine learning systems with:
Human values (fairness, autonomy, safety)
Moral principles (justice, beneficence, non-maleficence)
Legal and social norms
Key Objective:
Ensure ML systems help more than they harm, and their decisions can be understood and justified.
10.3 Core Ethical Principles
Fairness
Equal treatment for all individuals or groups.
Biased loan approval model.
Accountability
Humans remain responsible for AI decisions.
“The algorithm decided, not me.”
Transparency
Decision-making logic is explainable and accessible.
Black-box models in justice systems.
Privacy
Protecting personal and sensitive data.
Sharing identifiable health data.
Security
Prevent misuse and attacks.
Exposed API allowing data theft.
Non-maleficence
Do no harm to users or society.
Deepfake misinformation.
Beneficence
Actively promote well-being.
AI that improves healthcare access.
10.4 Fairness in Machine Learning
ML models learn from historical data, which often carries systemic bias.
10.4.1 Sources of Bias
Sampling Bias: Non-representative data collection.
Label Bias: Inaccurate or biased ground truth labels.
Measurement Bias: Errors in data capture (e.g., camera-based skin tone recognition).
Algorithmic Bias: Model amplifies existing inequalities.
Societal Bias: Data reflects human prejudice and discrimination.
10.4.2 Fairness Metrics
Demographic Parity
Equal positive prediction rates across groups.
Equal Opportunity
Equal true positive rates across groups.
Equalized Odds
Equal TPR and FPR across groups.
Disparate Impact
Ratio of favorable outcomes between groups ≥ 0.8.
Calibration within Groups
Predicted probabilities are consistent across demographics.
Fairness cannot be achieved by accident — it must be measured and enforced deliberately.
10.5 Addressing Fairness
Approaches to Mitigate Bias:
1. Pre-Processing
Clean and rebalance data before training.
Techniques: reweighting, resampling (e.g., SMOTE for minority classes), removing sensitive features.
2. In-Processing
Modify training algorithms to include fairness constraints.
Example: adversarial debiasing.
3. Post-Processing
Adjust model outputs or thresholds for fairness.
Example: equalize prediction probabilities across subgroups.
Tools:
IBM AI Fairness 360, Google What-If Tool, Fairlearn (Microsoft).
10.6 Transparency and Explainability
Black-box models raise questions like:
Why did the model reject this loan?
How is this diagnosis justified?
Can we trust an algorithmic decision?
Explainability = Understanding model decisions.
Types of Explainability
Global Explainability
Understand overall logic.
Feature importance, partial dependence plots.
Local Explainability
Explain single prediction.
SHAP, LIME, counterfactual explanations.
Model-Agnostic
Works for any model type.
LIME, SHAP.
Intrinsic
Built into model structure.
Linear regression, decision trees.
Why Explainability Matters
Builds trust with users.
Helps detect bias and errors.
Required by law (e.g., GDPR “right to explanation”).
Choose the simplest model that meets performance requirements — simpler models are easier to explain.
10.7 Accountability in ML Systems
Humans must remain responsible for AI outcomes.
Accountability Components
Traceability: Every prediction should be linked to:
Model version
Data used
Decision rationale
Auditability: Logs and metrics must be available for external review.
Human Oversight: Humans can override model decisions when necessary.
Documentation Tools
Model Cards (Google): Explain model’s purpose, performance, and limitations.
Datasheets for Datasets (Gebru et al.): Describe dataset origin, collection, and biases.
FactSheets (IBM): Summarize ethical risks and intended use.
10.8 Privacy in Machine Learning
ML often requires large amounts of data, much of it personal. Protecting privacy is both ethical and legal.
10.8.1 Types of Sensitive Data
Personally Identifiable Information (PII)
Financial or medical records
Biometric data (face, fingerprints)
Behavioral logs or user activity
10.8.2 Privacy Violations
Accidental data leaks
Model memorization (e.g., remembering patient names)
Inference attacks (extracting private info from model)
Re-identification from anonymized data
10.9 Techniques for Privacy Preservation
Data Anonymization
Remove identifiers like name, address.
Replace “John Doe” with “User123”.
Pseudonymization
Replace identifiers with pseudonyms.
Encrypt personal IDs.
Differential Privacy
Add statistical noise to prevent re-identification.
Used by Apple, Google in telemetry data.
Federated Learning
Train model locally on devices — only send updates, not data.
Google Keyboard personalization.
Homomorphic Encryption
Perform computations on encrypted data.
Privacy-preserving cloud inference.
Secure Multiparty Computation
Jointly compute results without revealing private data.
Collaborative medical AI research.
Privacy-preserving ML enables innovation without compromising personal security.
10.10 Legal and Regulatory Frameworks
Several global regulations enforce responsible data and AI use.
GDPR (General Data Protection Regulation)
EU
Data protection, user consent, “right to explanation.”
CCPA (California Consumer Privacy Act)
USA
Data ownership and opt-out rights.
HIPAA (Health Insurance Portability and Accountability Act)
USA
Protects healthcare data.
OECD AI Principles
Global
Fairness, transparency, accountability.
EU AI Act (2024+)
EU
Risk-based regulation for AI systems.
GDPR Key Rights:
Right to be informed
Right to access data
Right to rectification and erasure
Right to data portability
Right to explanation (for automated decisions)
10.11 Security in ML Systems
Security breaches can compromise both the data and the model.
Common Threats
Data Poisoning: Injecting malicious samples during training.
Model Inversion: Extracting sensitive data from model outputs.
Adversarial Attacks: Slightly modifying inputs to fool the model.
Model Stealing: Reverse-engineering a proprietary model via API queries.
Mitigation Techniques
Validate input data and use anomaly detection.
Apply differential privacy and encryption.
Limit API access and throttle queries.
Monitor unusual usage patterns or request spikes.
10.12 Social Impacts of Machine Learning
ML systems shape human behavior and societal structures.
Potential Negative Impacts
Job displacement due to automation.
Filter bubbles reinforcing one-sided content.
Surveillance capitalism via behavioral tracking.
Misinformation spread through deepfakes and algorithmic amplification.
Positive Impacts
Enhanced healthcare diagnostics.
Smarter resource allocation (energy, traffic, logistics).
Accessible education via adaptive learning.
Faster scientific discoveries.
ML should be designed to empower — not manipulate — humans.
10.13 Building Ethical ML Systems
Checklist for Ethical ML Development:
Data Collection
Obtain consent, anonymize data, ensure diversity.
Model Design
Prioritize fairness, interpretability, and robustness.
Evaluation
Test for bias, drift, and ethical risk.
Deployment
Secure APIs, implement human-in-the-loop controls.
Maintenance
Monitor fairness, retrain when necessary.
Best Practices:
Use ethics review boards or internal audit teams.
Create diverse development teams to reduce blind spots.
Include user feedback loops in the design.
Define “harm scenarios” during system design.
10.14 Responsible AI Frameworks
Leading tech organizations promote responsible AI principles:
Responsible AI Principles
Microsoft
AI Fairness, Reliability, Privacy, Inclusiveness
IBM
Everyday Ethics for AI
OECD / UNESCO
Global AI Ethics Guidelines
EU AI Act
Risk-based compliance approach
All emphasize:
Fairness
Transparency
Accountability
Privacy
Human oversight
10.15 The Role of the ML Engineer
An ethical ML engineer is:
A technologist: Writes efficient, scalable, accurate code.
A scientist: Validates assumptions with data.
A philosopher: Questions the social impact of their creations.
Key Responsibilities:
Anticipate harm — ask “Who could be negatively affected?”
Design for inclusion — ensure diverse representation in data.
Maintain accountability — document and explain every design choice.
Respect privacy — minimize data collection and use only necessary features.
Communicate transparently — with both technical and non-technical stakeholders.
Ethical ML engineering is a discipline — not a checklist.
10.16 Key Takeaways
Ethics and privacy are as critical as accuracy or scalability.
Fairness must be quantified and enforced using metrics.
Transparency and explainability build trust and accountability.
Privacy-preserving techniques (e.g., differential privacy, federated learning) enable responsible innovation.
Global regulations like GDPR and EU AI Act set the compliance baseline.
Engineers must proactively identify and mitigate bias, drift, and misuse.
Ethical AI = aligned intent + transparent execution + continuous oversight.
10.17 Chapter Summary
ML systems have moral and societal implications beyond technical boundaries.
Fairness, transparency, privacy, and accountability must guide every ML decision.
Use frameworks and tools to audit, monitor, and explain models.
Protect data and users from misuse through privacy-preserving computation and secure infrastructure.
Responsible AI development is not just a legal obligation — it’s a moral duty.
The best machine learning systems don’t just work well — they work right.
Last updated