MLOps
Comprehensive guide to deploying, monitoring, and maintaining ML systems in production.
Table of Contents
MLOps Overview
Definition: MLOps is the practice of collaboration between Data Scientists and Operations teams to deploy, monitor, and maintain ML models in production reliably and efficiently.
Key Goals
Automation: Automate ML pipeline from data to deployment
Reproducibility: Ensure experiments and models are reproducible
Scalability: Scale training and inference efficiently
Reliability: Monitor and maintain model performance
Collaboration: Enable teams to work together effectively
MLOps vs DevOps
Aspect
DevOps
MLOps
Artifacts
Code, binaries
Code + data + models
Testing
Unit, integration tests
Data validation, model validation
Deployment
Continuous deployment
Gradual rollout, A/B testing
Monitoring
System metrics
Model performance, data drift
Versioning
Code versions
Code + data + model versions
ML Lifecycle
1. Problem Definition
Define business problem
Set success metrics (business + ML metrics)
Establish baselines
Determine data availability
2. Data Collection & Preparation
Data validation
Missing value handling
Outlier detection
Feature transformations
Data versioning (DVC, Delta Lake)
3. Model Development
Experiment tracking (MLflow, Weights & Biases)
Model versioning
Hyperparameter optimization
Cross-validation
4. Model Deployment
Model packaging
Containerization (Docker)
Deployment strategy (canary, blue-green)
API development (REST, gRPC)
5. Monitoring & Maintenance
Model performance metrics
Data drift / concept drift detection
Alerting and incident response
Continuous retraining
Data Management
Data Versioning
Tools: DVC, Delta Lake, lakeFS
Why?
Reproduce experiments
Track data lineage
Rollback to previous versions
Example with DVC:
Data Quality Checks
Tools: Great Expectations, Evidently
Checks:
Schema validation (correct types, columns)
Range checks (min/max values)
Distribution checks (mean, std)
Missing value thresholds
Duplicate detection
Example:
Feature Stores
Tools: Feast, Tecton, AWS SageMaker Feature Store
Benefits:
Share features across teams
Consistent features (training vs serving)
Low-latency serving
Point-in-time correctness (no data leakage)
Model Development
Experiment Tracking
Tools: MLflow, Weights & Biases, Neptune.ai
Track:
Hyperparameters
Metrics (accuracy, loss)
Artifacts (models, visualizations)
Code version (git commit)
Environment (dependencies)
MLflow Example:
Model Registry
Purpose: Centralized model storage with versioning and lifecycle management
Stages:
Development: Experimental models
Staging: Models ready for testing
Production: Models serving live traffic
Archived: Deprecated models
MLflow Model Registry:
Deployment Strategies
Batch Inference
When: Process large datasets offline (e.g., daily recommendations, weekly reports)
Architecture:
Pros: High throughput, can use complex models, cost-effective Cons: Not real-time, stale predictions
Tools: Apache Airflow, AWS Batch, Google Cloud Dataflow
Real-time Inference
When: Low latency required (< 100ms), request-response pattern
Architecture:
Serving Options:
Option
Use Case
Latency
REST API
General purpose
50-200ms
gRPC
Low latency
10-50ms
Serverless
Variable traffic
100-500ms
Edge
Ultra-low latency
< 10ms
Flask Example:
Deployment Patterns
1. Canary Deployment
Gradually route traffic to new model (5% → 25% → 50% → 100%)
Monitor performance at each stage
Rollback if problems detected
2. Blue-Green Deployment
Run old (blue) and new (green) models in parallel
Switch traffic instantly
Easy rollback
3. Shadow Mode
New model receives traffic but doesn't serve responses
Compare predictions offline
Zero risk deployment
4. A/B Testing
Split traffic between models (e.g., 50/50)
Measure business metrics
Choose winner based on data
Monitoring & Observability
Model Performance Monitoring
Metrics to Track:
Accuracy, precision, recall, F1
Prediction latency (p50, p95, p99)
Throughput (requests/second)
Error rate
Tools: Prometheus, Grafana, DataDog
Example Alert:
Data Drift Detection
Data Drift: Input feature distributions change over time
Detection Methods:
PSI (Population Stability Index): Measures distribution change
KL Divergence: Statistical distance between distributions
Kolmogorov-Smirnov Test: Statistical test for distribution difference
Tools: Evidently AI, WhyLabs, Fiddler
Example with Evidently:
Concept Drift Detection
Concept Drift: Relationship between features and target changes
Detection:
Monitor model performance metrics over time
Compare predictions vs actuals
Alert when performance degrades
Solutions:
Retrain model with recent data
Online learning (incremental updates)
Ensemble with recent models
CI/CD for ML
Continuous Integration
Steps:
Code commit (git push)
Run tests (unit, integration)
Data validation
Model training
Model validation
Register model if metrics pass threshold
Example with GitHub Actions:
Continuous Deployment
Automated Pipeline:
Safety Checks:
Minimum accuracy threshold
Performance regression tests
Shadow mode testing
Gradual rollout with monitoring
Infrastructure & Tools
Cloud Platforms
Platform
ML Services
Strengths
AWS
SageMaker, EC2, Lambda
Mature ecosystem, broad services
GCP
Vertex AI, AI Platform
TensorFlow integration, AutoML
Azure
Azure ML, Databricks
Enterprise integration
Containerization
Docker for ML:
Kubernetes for Orchestration:
Scaling (replicas based on load)
Load balancing
Health checks and auto-restart
Resource management
Model Serving Platforms
Tool
Best For
Features
TensorFlow Serving
TensorFlow models
High performance, versioning
TorchServe
PyTorch models
Multi-model serving
ONNX Runtime
Cross-framework
Optimized inference
Seldon Core
Kubernetes native
ML deployment on K8s
KFServing
Cloud-native
Serverless, autoscaling
Best Practices
1. Reproducibility
Version everything: code, data, models, environment Use Docker for consistent environments Set random seeds Document dependencies (requirements.txt, conda env)
2. Testing
Unit tests for data processing logic Integration tests for pipelines Model validation tests (accuracy thresholds) Data quality checks
3. Monitoring
Track model performance metrics Monitor data drift Set up alerts for degradation Log predictions for debugging
4. Security
Encrypt data at rest and in transit Use secrets management (AWS Secrets Manager, Vault) Implement access control (IAM, RBAC) Audit logging
5. Cost Optimization
Use spot instances for training Auto-scaling for inference Model compression (quantization, pruning) Batch similar requests
MLOps Maturity Levels
Level 0: Manual Process
Manual data preparation
Notebooks for training
Manual deployment
No monitoring
Level 1: ML Pipeline Automation
Automated training pipeline
Experiment tracking
Model registry
Basic monitoring
Level 2: CI/CD Pipeline Automation
Automated testing
Automated deployment
Continuous monitoring
Automated retraining triggers
Level 3: Production-Grade MLOps
Advanced monitoring (drift detection)
Online learning
Multi-model management
Feature stores
Governance and compliance
Interview Topics
Common Questions:
"Explain the difference between model drift and data drift"
Data drift: Input distributions change. Concept drift: X→y relationship changes. Both require retraining.
"How would you deploy a model to production?"
Containerize (Docker) → Model registry → Staging → A/B test → Gradual rollout → Monitor → Full deployment
"How do you ensure reproducibility?"
Version code (git), data (DVC), models (registry), environment (Docker), set random seeds
"How would you detect if a model is degrading?"
Monitor performance metrics, track data drift, compare predictions vs actuals, set up alerts
"Batch vs real-time inference trade-offs?"
Batch: Higher throughput, offline, complex models. Real-time: Low latency, online, simpler models.
Resources
Books:
Building Machine Learning Powered Applications (Emmanuel Ameisen)
Machine Learning Engineering (Andriy Burkov)
Designing Machine Learning Systems (Chip Huyen)
Courses:
Made With ML - MLOps
DeepLearning.AI - MLOps Specialization
Tools Repository:
Blogs:
Google Cloud MLOps
AWS Machine Learning Blog
Neptune.ai Blog
Last updated