githubEdit

MLOps

Comprehensive guide to deploying, monitoring, and maintaining ML systems in production.


Table of Contents


MLOps Overview

Definition: MLOps is the practice of collaboration between Data Scientists and Operations teams to deploy, monitor, and maintain ML models in production reliably and efficiently.

Key Goals

  • Automation: Automate ML pipeline from data to deployment

  • Reproducibility: Ensure experiments and models are reproducible

  • Scalability: Scale training and inference efficiently

  • Reliability: Monitor and maintain model performance

  • Collaboration: Enable teams to work together effectively

MLOps vs DevOps

Aspect

DevOps

MLOps

Artifacts

Code, binaries

Code + data + models

Testing

Unit, integration tests

Data validation, model validation

Deployment

Continuous deployment

Gradual rollout, A/B testing

Monitoring

System metrics

Model performance, data drift

Versioning

Code versions

Code + data + model versions


ML Lifecycle

1. Problem Definition

  • Define business problem

  • Set success metrics (business + ML metrics)

  • Establish baselines

  • Determine data availability

2. Data Collection & Preparation

  • Data validation

  • Missing value handling

  • Outlier detection

  • Feature transformations

  • Data versioning (DVC, Delta Lake)

3. Model Development

  • Experiment tracking (MLflow, Weights & Biases)

  • Model versioning

  • Hyperparameter optimization

  • Cross-validation

4. Model Deployment

  • Model packaging

  • Containerization (Docker)

  • Deployment strategy (canary, blue-green)

  • API development (REST, gRPC)

5. Monitoring & Maintenance

  • Model performance metrics

  • Data drift / concept drift detection

  • Alerting and incident response

  • Continuous retraining


Data Management

Data Versioning

Tools: DVC, Delta Lake, lakeFS

Why?

  • Reproduce experiments

  • Track data lineage

  • Rollback to previous versions

Example with DVC:

Data Quality Checks

Tools: Great Expectations, Evidently

Checks:

  • Schema validation (correct types, columns)

  • Range checks (min/max values)

  • Distribution checks (mean, std)

  • Missing value thresholds

  • Duplicate detection

Example:

Feature Stores

Tools: Feast, Tecton, AWS SageMaker Feature Store

Benefits:

  • Share features across teams

  • Consistent features (training vs serving)

  • Low-latency serving

  • Point-in-time correctness (no data leakage)


Model Development

Experiment Tracking

Tools: MLflow, Weights & Biases, Neptune.ai

Track:

  • Hyperparameters

  • Metrics (accuracy, loss)

  • Artifacts (models, visualizations)

  • Code version (git commit)

  • Environment (dependencies)

MLflow Example:

Model Registry

Purpose: Centralized model storage with versioning and lifecycle management

Stages:

  • Development: Experimental models

  • Staging: Models ready for testing

  • Production: Models serving live traffic

  • Archived: Deprecated models

MLflow Model Registry:


Deployment Strategies

Batch Inference

When: Process large datasets offline (e.g., daily recommendations, weekly reports)

Architecture:

Pros: High throughput, can use complex models, cost-effective Cons: Not real-time, stale predictions

Tools: Apache Airflow, AWS Batch, Google Cloud Dataflow


Real-time Inference

When: Low latency required (< 100ms), request-response pattern

Architecture:

Serving Options:

Option

Use Case

Latency

REST API

General purpose

50-200ms

gRPC

Low latency

10-50ms

Serverless

Variable traffic

100-500ms

Edge

Ultra-low latency

< 10ms

Flask Example:


Deployment Patterns

1. Canary Deployment

  • Gradually route traffic to new model (5% → 25% → 50% → 100%)

  • Monitor performance at each stage

  • Rollback if problems detected

2. Blue-Green Deployment

  • Run old (blue) and new (green) models in parallel

  • Switch traffic instantly

  • Easy rollback

3. Shadow Mode

  • New model receives traffic but doesn't serve responses

  • Compare predictions offline

  • Zero risk deployment

4. A/B Testing

  • Split traffic between models (e.g., 50/50)

  • Measure business metrics

  • Choose winner based on data


Monitoring & Observability

Model Performance Monitoring

Metrics to Track:

  • Accuracy, precision, recall, F1

  • Prediction latency (p50, p95, p99)

  • Throughput (requests/second)

  • Error rate

Tools: Prometheus, Grafana, DataDog

Example Alert:

Data Drift Detection

Data Drift: Input feature distributions change over time

Detection Methods:

  • PSI (Population Stability Index): Measures distribution change

  • KL Divergence: Statistical distance between distributions

  • Kolmogorov-Smirnov Test: Statistical test for distribution difference

Tools: Evidently AI, WhyLabs, Fiddler

Example with Evidently:

Concept Drift Detection

Concept Drift: Relationship between features and target changes

Detection:

  • Monitor model performance metrics over time

  • Compare predictions vs actuals

  • Alert when performance degrades

Solutions:

  • Retrain model with recent data

  • Online learning (incremental updates)

  • Ensemble with recent models


CI/CD for ML

Continuous Integration

Steps:

  1. Code commit (git push)

  2. Run tests (unit, integration)

  3. Data validation

  4. Model training

  5. Model validation

  6. Register model if metrics pass threshold

Example with GitHub Actions:

Continuous Deployment

Automated Pipeline:

Safety Checks:

  • Minimum accuracy threshold

  • Performance regression tests

  • Shadow mode testing

  • Gradual rollout with monitoring


Infrastructure & Tools

Cloud Platforms

Platform

ML Services

Strengths

AWS

SageMaker, EC2, Lambda

Mature ecosystem, broad services

GCP

Vertex AI, AI Platform

TensorFlow integration, AutoML

Azure

Azure ML, Databricks

Enterprise integration

Containerization

Docker for ML:

Kubernetes for Orchestration:

  • Scaling (replicas based on load)

  • Load balancing

  • Health checks and auto-restart

  • Resource management

Model Serving Platforms

Tool

Best For

Features

TensorFlow Serving

TensorFlow models

High performance, versioning

TorchServe

PyTorch models

Multi-model serving

ONNX Runtime

Cross-framework

Optimized inference

Seldon Core

Kubernetes native

ML deployment on K8s

KFServing

Cloud-native

Serverless, autoscaling


Best Practices

1. Reproducibility

Version everything: code, data, models, environment Use Docker for consistent environments Set random seeds Document dependencies (requirements.txt, conda env)

2. Testing

Unit tests for data processing logic Integration tests for pipelines Model validation tests (accuracy thresholds) Data quality checks

3. Monitoring

Track model performance metrics Monitor data drift Set up alerts for degradation Log predictions for debugging

4. Security

Encrypt data at rest and in transit Use secrets management (AWS Secrets Manager, Vault) Implement access control (IAM, RBAC) Audit logging

5. Cost Optimization

Use spot instances for training Auto-scaling for inference Model compression (quantization, pruning) Batch similar requests


MLOps Maturity Levels

Level 0: Manual Process

  • Manual data preparation

  • Notebooks for training

  • Manual deployment

  • No monitoring

Level 1: ML Pipeline Automation

  • Automated training pipeline

  • Experiment tracking

  • Model registry

  • Basic monitoring

Level 2: CI/CD Pipeline Automation

  • Automated testing

  • Automated deployment

  • Continuous monitoring

  • Automated retraining triggers

Level 3: Production-Grade MLOps

  • Advanced monitoring (drift detection)

  • Online learning

  • Multi-model management

  • Feature stores

  • Governance and compliance


Interview Topics

Common Questions:

  1. "Explain the difference between model drift and data drift"

    Data drift: Input distributions change. Concept drift: X→y relationship changes. Both require retraining.

  2. "How would you deploy a model to production?"

    Containerize (Docker) → Model registry → Staging → A/B test → Gradual rollout → Monitor → Full deployment

  3. "How do you ensure reproducibility?"

    Version code (git), data (DVC), models (registry), environment (Docker), set random seeds

  4. "How would you detect if a model is degrading?"

    Monitor performance metrics, track data drift, compare predictions vs actuals, set up alerts

  5. "Batch vs real-time inference trade-offs?"

    Batch: Higher throughput, offline, complex models. Real-time: Low latency, online, simpler models.


Resources

Books:

  • Building Machine Learning Powered Applications (Emmanuel Ameisen)

  • Machine Learning Engineering (Andriy Burkov)

  • Designing Machine Learning Systems (Chip Huyen)

Courses:

  • Made With ML - MLOps

  • DeepLearning.AI - MLOps Specialization

Tools Repository:

Blogs:

  • Google Cloud MLOps

  • AWS Machine Learning Blog

  • Neptune.ai Blog

Last updated