githubEdit

Introduction to AI

Artificial Intelligence

Definition: Artificial Intelligence (AI) is the field of computer science focused on creating systems capable of performing tasks that typically require human intelligence, such as learning, reasoning, problem-solving, perception, and language understanding.

Core Learning Paradigms

AI systems learn patterns and make decisions through various approaches:

  • Statistical Machine Learning - Uses mathematical and probabilistic methods to find patterns in data:

    • Creates decision boundaries, hyperplanes, or hierarchical splits to divide data

    • Examples: Linear/logistic regression, SVMs, decision trees, random forests

  • Deep Learning - Mimics human neural networks to learn complex patterns through iterative optimization:

    • Learns hierarchical feature representations automatically

    • Excels at unstructured data (images, text, audio)

    • Examples: CNNs, RNNs, Transformers

  • Generative AI - Creates new data samples after learning from existing data:

    • Examples: GANs, VAEs, Diffusion Models (DALL-E, Stable Diffusion)

  • Reinforcement Learning - Learns through interaction with an environment via rewards and penalties:

    • Agent learns optimal actions through trial and error

    • Examples: Game playing (AlphaGo), robotics, recommendation systems

Key Challenges in Machine Learning

  • Data Representation - How to encode and feed data to ML models (feature engineering, embeddings)

  • Performance Monitoring - Tracking model progress during training (loss curves, validation metrics)

  • Generalization - Ensuring models learn the right patterns and perform well on unseen data

  • Interpretability - Understanding how models make decisions

  • Scalability - Handling large datasets and deploying models in production

Machine Learning vs Deep Learning

Aspect

Machine Learning

Deep Learning

Approach

Statistical and probabilistic methods

Neural networks with multiple hidden layers

Data Requirements

Works well with smaller datasets (1K-100K samples)

Requires large datasets (100K-1M+ samples)

Computation

Lower computational cost, can run on CPUs

High computational cost, requires GPUs/TPUs

Feature Engineering

Manual feature engineering required

Automatic feature learning

Training Time

Minutes to hours

Hours to days/weeks

Interpretability

High (e.g., decision trees, linear models)

Low (black-box models)

Use Cases

Tabular data, structured problems, limited data

Images, text, audio, unstructured data

Performance Scaling

Plateaus with more data

Improves with more data and model size

Hardware Needs

Standard CPU sufficient

GPUs/TPUs often necessary

When to Use Machine Learning:

  • Small to medium-sized datasets

  • Structured/tabular data

  • Need for model interpretability

  • Limited computational resources

  • Quick iteration and deployment needed

When to Use Deep Learning:

  • Large datasets available

  • Unstructured data (images, text, audio, video)

  • Complex pattern recognition required

  • Performance is priority over interpretability

  • Sufficient computational resources available

Machine Learning Fundamentals

Bias-Variance Trade-off

One of the most important concepts in machine learning - understanding this is critical for interviews.

Mathematical Foundation:

The expected prediction error can be decomposed as:

Definitions:

  • Bias - Error from incorrect assumptions in the learning algorithm

    • Measures how far off the average model prediction is from the true value

    • High Bias → Underfitting - Model is too simple, fails to capture underlying patterns

    • Example: Using linear regression for non-linear data

  • Variance - Error from sensitivity to small fluctuations in training data

    • Measures how much predictions vary across different training sets

    • High Variance → Overfitting - Model learns noise in training data

    • Example: Deep decision tree memorizing training data

  • Irreducible Error - Noise in the data that cannot be reduced (σ²)

Visual Understanding:

The Trade-off:

  • Reducing bias typically increases variance (more complex model)

  • Reducing variance typically increases bias (simpler model)

  • Goal: Find the sweet spot that minimizes total error

Factors Affecting Bias and Variance:

Factor

Effect on Bias

Effect on Variance

More training data

No change

Decreases ↓

More features

Decreases ↓

Increases ↑

More complex model

Decreases ↓

Increases ↑

Regularization (↑λ)

Increases ↑

Decreases ↓

Feature selection

May increase ↑

Decreases ↓

Ensemble methods

Decreases ↓

Decreases ↓

How to Reduce High Bias (Underfitting):

  • Use more complex model (e.g., polynomial features, deeper network)

  • Add more relevant features

  • Decrease regularization (reduce λ)

  • Remove noise from data

  • Increase model capacity

How to Reduce High Variance (Overfitting):

  • Collect more training data

  • Feature selection / dimensionality reduction

  • Increase regularization (increase λ)

  • Use ensemble methods (bagging)

  • Cross-validation

  • Early stopping (for neural networks)

  • Dropout (for neural networks)

Interview Tip: Be ready to explain with a concrete example: "If I use linear regression on data with a quadratic relationship, I'll have high bias because the model can't capture the curve. If I use a 20-degree polynomial, I'll have high variance because it'll fit every noise point in my training data."

Regularization Techniques

Regularization adds a penalty term to the loss function to prevent overfitting:

  • L1 Regularization (Lasso):

    • Produces sparse models (some weights become exactly 0)

    • Useful for feature selection

    • Less stable with correlated features

  • L2 Regularization (Ridge):

    • Shrinks all weights but doesn't zero them out

    • More stable with correlated features

    • Preferred when all features are relevant

  • Elastic Net:

    • Combines L1 and L2

    • Handles correlated features better than Lasso

Interview Question: "When would you use L1 vs L2 regularization?"

  • L1: When you suspect many features are irrelevant and want feature selection

  • L2: When you believe most features contribute and want to avoid instability

Train/Validation/Test Split

Purpose: Properly evaluate model performance and prevent overfitting

Standard Split Ratios:

  • Training Set (60-80%): Used to fit the model

  • Validation Set (10-20%): Used for hyperparameter tuning and model selection

  • Test Set (10-20%): Used ONLY for final evaluation

Best Practices:

  • Stratification: Maintain class distribution in splits for imbalanced data

  • Time-based split: For time series, always split chronologically

  • Never use test data during development

Cross-Validation

Purpose: Better utilize limited data and get more reliable performance estimates

K-Fold Cross-Validation:

  • Split data into K folds

  • Train on K-1 folds, validate on 1 fold

  • Repeat K times, each fold used as validation once

  • Average results across all folds

Common Variants:

  • Stratified K-Fold: Maintains class distribution (use for classification)

  • Time Series Split: Respects temporal order

  • Leave-One-Out (LOO): K = number of samples (expensive but thorough)

When to Use:

  • Small datasets (to maximize training data usage)

  • Model comparison (more reliable than single split)

  • Hyperparameter tuning (use nested CV)

Interview Tip: Mention that CV is computationally expensive (K times the cost) but gives more robust estimates.

Ensemble Methods

Combine multiple models to improve performance:

  • Bagging (Bootstrap Aggregating):

    • Reduces variance

    • Trains multiple models on random subsets of data

    • Averages predictions (regression) or votes (classification)

    • Example: Random Forest

  • Boosting:

    • Reduces bias

    • Sequentially trains weak learners, each focusing on previous errors

    • Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost

  • Stacking:

    • Trains a meta-model on predictions of base models

    • Can combine diverse model types

-------------------------------------------------------------------------------------------------------

Machine Learning Algorithms

Supervised Learning

Supervised learning uses labeled data (input-output pairs) to learn a mapping function.

Regression Algorithms (Continuous Output)

Algorithm

Time Complexity

Key Assumptions

When to Use

Linear Regression

O(n·p²)

Linear relationship, independent features

Quick baseline, interpretability needed

Ridge/Lasso

O(n·p²)

Same as linear + many features

High-dimensional data, feature selection

Decision Tree

O(n·log(n)·p)

None

Non-linear relationships, interpretability

Random Forest

O(n·log(n)·p·t)

None

General purpose, handles non-linearity

XGBoost/LightGBM

O(n·log(n)·p·t)

None

Structured data, need highest accuracy

SVR

O(n²·p) to O(n³·p)

Kernel-dependent

Small-medium datasets, non-linear

KNN

O(n·p) per prediction

Similar instances close together

Small datasets, simple baseline

n = samples, p = features, t = trees

Classification Algorithms (Categorical Output)

Algorithm

Time Complexity

Key Strengths

Common Issues

Logistic Regression

O(n·p)

Fast, interpretable, probability outputs

Assumes linear decision boundary

Naive Bayes

O(n·p)

Fast, works well with small data

Assumes feature independence

Decision Tree

O(n·log(n)·p)

Interpretable, handles non-linearity

Prone to overfitting

Random Forest

O(n·log(n)·p·t)

Robust, handles imbalanced data

Less interpretable, slower

XGBoost

O(n·log(n)·p·t)

State-of-art on tabular data

Requires tuning, can overfit

LightGBM

O(n·p·t)

Faster than XGBoost, good for large data

Can overfit on small data

CatBoost

O(n·log(n)·p·t)

Handles categorical features well

Slower training

SVM

O(n²·p) to O(n³·p)

Effective in high dimensions

Slow on large datasets

KNN

O(n·p)

Simple, no training time

Slow predictions, curse of dimensionality

Ensemble Methods:

  • Bagging (Random Forest): Reduces variance, parallel training

  • Boosting (XGBoost, AdaBoost, GBM): Reduces bias, sequential training

  • Stacking: Combines diverse models with meta-learner

Algorithm Selection Guide

Start with these baselines:

  1. Regression: Linear Regression → Random Forest → XGBoost

  2. Classification: Logistic Regression → Random Forest → XGBoost

Choose based on constraints:

  • Need interpretability: Linear/Logistic Regression, Decision Tree

  • Have high-dimensional data: Lasso, Ridge, Random Forest

  • Have categorical features: CatBoost, LightGBM

  • Limited data: Naive Bayes, Regularized Linear Models

  • Need fast predictions: Linear models, Naive Bayes

  • Maximum accuracy on tabular data: XGBoost, LightGBM, CatBoost

Unsupervised Learning

Unsupervised learning finds patterns in unlabeled data.

Clustering Algorithms

Algorithm

Time Complexity

Best For

Limitations

K-Means

O(n·k·i·p)

Spherical clusters, large datasets

Must specify k, sensitive to outliers

DBSCAN

O(n·log(n))

Arbitrary shapes, outlier detection

Struggles with varying densities

Hierarchical

O(n²·log(n))

Dendrograms, unknown k

Slow on large data

GMM

O(n·k·i·p)

Soft clustering, probabilistic

Assumes Gaussian distributions

k = clusters, i = iterations

Dimensionality Reduction

Algorithm

Type

Preserves

Use Case

PCA

Linear

Global structure, variance

Visualization, noise reduction

t-SNE

Non-linear

Local structure

Visualization (2D/3D)

UMAP

Non-linear

Local + global structure

Visualization, faster than t-SNE

LDA

Supervised

Class separability

Feature extraction for classification

Autoencoders

Non-linear

Learned features

Complex non-linear reductions

Evaluation Metrics

Regression Metrics

Metric

Formula

Range

When to Use

MAE

(1/n) Σ|y - ŷ|

[0, ∞)

Robust to outliers

MSE

(1/n) Σ(y - ŷ)²

[0, ∞)

Penalizes large errors

RMSE

√MSE

[0, ∞)

Same units as target

1 - (SS_res / SS_tot)

(-∞, 1]

Explains variance, 1 is perfect

Adjusted R²

1 - [(1-R²)(n-1)/(n-p-1)]

(-∞, 1]

Accounts for # of features

MAPE

(100/n) Σ|

(y - ŷ)/y|

Key Insights:

  • MAE: Less sensitive to outliers (linear penalty)

  • MSE/RMSE: More sensitive to outliers (quadratic penalty)

  • R²: Can be negative if model is worse than mean baseline

  • MAPE: Not suitable when y can be close to 0

Interview Tip: "I'd use RMSE when large errors are particularly bad (e.g., predicting hospital demand), and MAE when all errors matter equally (e.g., pricing)."

Classification Metrics

Confusion Matrix Foundation:

Metric

Formula

Range

Focus

Accuracy

(TP + TN) / Total

[0, 1]

Overall correctness

Precision

TP / (TP + FP)

[0, 1]

Of predicted positives, how many correct?

Recall (Sensitivity)

TP / (TP + FN)

[0, 1]

Of actual positives, how many caught?

Specificity

TN / (TN + FP)

[0, 1]

Of actual negatives, how many caught?

F1 Score

2 · (P · R) / (P + R)

[0, 1]

Harmonic mean of precision & recall

F-beta

(1+β²) · (P·R) / (β²·P + R)

[0, 1]

Weighted F1 (β>1 favors recall)

Advanced Metrics:

Metric

Purpose

When to Use

ROC-AUC

Area under ROC curve

Binary classification, balanced classes

PR-AUC

Area under Precision-Recall curve

Imbalanced classes (better than ROC-AUC)

Log Loss

-Σ(y·log(ŷ) + (1-y)·log(1-ŷ))

Probabilistic predictions

Cohen's Kappa

Accounts for chance agreement

Inter-rater reliability

MCC

Matthews Correlation Coefficient

Balanced measure for imbalanced data

Precision vs Recall Trade-off:

Metric Selection Guide:

Scenario

Preferred Metric

Reason

Balanced classes

Accuracy, F1, ROC-AUC

All classes equally represented

Imbalanced classes

Precision, Recall, PR-AUC, F1

Accuracy is misleading

False positives costly

Precision

Minimize FP (e.g., spam filter)

False negatives costly

Recall

Minimize FN (e.g., cancer detection)

Need probability calibration

Log Loss, Brier Score

Evaluate probability quality

Multi-class

Macro/Micro F1, Weighted F1

Handles multiple classes

Common Interview Questions:

  1. "When would you use accuracy?"

    • Only when classes are balanced. Otherwise, it's misleading (e.g., 99% accuracy on 99% negative class).

  2. "Precision vs Recall - which is more important?"

    • Depends on cost of errors:

      • Precision: When false positives are expensive (spam detection)

      • Recall: When false negatives are expensive (disease screening)

      • F1: When you need balance

  3. "Why use ROC-AUC vs PR-AUC?"

    • ROC-AUC: Balanced datasets

    • PR-AUC: Imbalanced datasets (focuses on positive class performance)

Handling Imbalanced Data

Techniques:

  1. Resampling:

    • Oversample minority class (SMOTE, ADASYN)

    • Undersample majority class

    • Combined approaches

  2. Algorithmic:

    • Class weights (penalize minority misclassification more)

    • Anomaly detection (treat minority as anomaly)

    • Ensemble methods (balanced bagging)

  3. Metric Selection:

    • Use Precision, Recall, F1, PR-AUC instead of Accuracy

    • Focus on confusion matrix

Deep Learning

Deep learning uses neural networks with multiple layers To learn hierarchical feature representations from data.

Core Components

1. Neural Network Layers

Layer Type

Purpose

Common Use

Dense (Fully Connected)

Learns complex non-linear relationships

MLPs, final classification layers

Convolutional (Conv)

Extracts spatial features

Image processing, CNNs

Recurrent (RNN, LSTM, GRU)

Handles sequential data

Time series, text (legacy)

Attention/Transformer

Captures long-range dependencies

Modern NLP, vision transformers

Pooling

Downsamples feature maps

Dimensionality reduction in CNNs

Dropout

Regularization via random neuron deactivation

Preventing overfitting

Batch/Layer Normalization

Stabilizes training

Faster convergence, better performance

Embedding

Converts discrete tokens to vectors

NLP, categorical features

2. Activation Functions

Function

Formula

Range

Pros

Cons

ReLU

max(0, x)

[0, ∞)

Fast, no vanishing gradient

Dead ReLU problem

Leaky ReLU

max(αx, x)

(-∞, ∞)

Fixes dead ReLU

Needs tuning α

GELU

x·Φ(x)

(-∞, ∞)

Smooth, state-of-art

Slower computation

Sigmoid

1/(1+e⁻ˣ)

(0, 1)

Output probabilities

Vanishing gradient

Tanh

(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)

(-1, 1)

Zero-centered

Vanishing gradient

Softmax

eˣⁱ/Σeˣʲ

(0, 1)

Multi-class probabilities

Used only in output layer

Interview Question: "Why is ReLU preferred over sigmoid/tanh?"

  • Mitigates vanishing gradient problem

  • Faster computation (simple thresholding)

  • Encourages sparse activations (biological plausibility)

Dead ReLU Problem: When ReLU units always output 0 (negative input), they stop learning. Solutions: Leaky ReLU, careful weight initialization, lower learning rate.

3. Loss Functions

Loss Function

Formula

Use Case

MSE

(1/n)Σ(y-ŷ)²

Regression

MAE

(1/n)Σ|y-ŷ|

Regression (robust to outliers)

Binary Cross-Entropy

-[y·log(ŷ) + (1-y)·log(1-ŷ)]

Binary classification

Categorical Cross-Entropy

-Σy·log(ŷ)

Multi-class classification

Sparse Categorical CE

Same, but with integer labels

Multi-class with many classes

Hinge Loss

max(0, 1-y·ŷ)

SVMs, margin-based learning

Huber Loss

Hybrid MSE/MAE

Robust regression

4. Optimizers

Optimizer

Key Feature

When to Use

SGD

Basic gradient descent

Simple problems, with momentum

SGD + Momentum

Accumulates gradients

Faster convergence, escapes local minima

Adam

Adaptive learning rates per parameter

Default choice, works well generally

AdamW

Adam with decoupled weight decay

Better regularization than Adam

RMSprop

Adapts learning rate using moving average

RNNs, non-stationary problems

AdaGrad

Per-parameter learning rates

Sparse data

Interview Tip: "I'd start with Adam or AdamW for most problems. For fine-tuning, SGD with momentum often gives better final performance."

Modern Architectures

Computer Vision

Architecture

Year

Key Innovation

Use Case

LeNet

1998

First CNN

Digit recognition

AlexNet

2012

Deep CNN, ReLU, Dropout

ImageNet breakthrough

VGG

2014

Very deep (16-19 layers)

Feature extraction

ResNet

2015

Skip connections, 152 layers

Solves vanishing gradient

Inception

2015

Multiple filter sizes in parallel

Efficient multi-scale features

MobileNet

2017

Depthwise separable convolutions

Mobile/edge devices

EfficientNet

2019

Compound scaling (depth/width/resolution)

SOTA efficiency

Vision Transformer (ViT)

2020

Transformers for images

Current SOTA, large datasets

Natural Language Processing

Architecture

Year

Key Innovation

Use Case

Word2Vec

2013

Word embeddings

Pre-trained embeddings

GloVe

2014

Global word vectors

Pre-trained embeddings

LSTM/GRU

1997/2014

Handles long sequences

Seq-to-seq (legacy)

Transformer

2017

Self-attention, parallel processing

Foundation of modern NLP

BERT

2018

Bidirectional transformer encoder

Text understanding, classification

GPT

2018

Autoregressive transformer decoder

Text generation

T5

2019

Text-to-text framework

Unified NLP tasks

GPT-3/4

2020/2023

Massive scale (175B+ params)

Few-shot learning, general tasks

Generative Models

Model Type

How It Works

Use Case

GAN

Generator vs Discriminator

Image generation, style transfer

VAE

Encoder-decoder with latent space

Dimensionality reduction, generation

Diffusion Models

Iterative denoising process

DALL-E, Stable Diffusion, Midjourney

Autoregressive

Predicts next token sequentially

GPT, language models

Training Techniques

Preventing Overfitting

  1. Dropout (p=0.2-0.5)

    • Randomly deactivate neurons during training

    • Forces network to learn robust features

  2. Batch Normalization

    • Normalizes layer inputs

    • Allows higher learning rates, faster training

  3. Layer Normalization

    • Normalizes across features (better for transformers)

  4. Weight Decay / L2 Regularization

    • Adds penalty to large weights

  5. Early Stopping

    • Stop training when validation loss stops improving

  6. Data Augmentation

    • Vision: rotation, flipping, cropping, color jitter

    • NLP: back-translation, synonym replacement

Optimization Techniques

  1. Learning Rate Scheduling

    • Step decay: Reduce LR at intervals

    • Exponential decay: Gradual reduction

    • Cosine annealing: Smooth periodic reduction

    • Warm-up: Start low, increase, then decay

  2. Gradient Clipping

    • Prevents exploding gradients (critical for RNNs)

    • Clip by value or by norm

  3. Mixed Precision Training

    • Use FP16 for speed, FP32 for stability

    • Reduces memory, speeds up training

Transfer Learning & Fine-tuning

Transfer Learning:

Fine-tuning Strategy:

  1. Train custom head with frozen base (few epochs)

  2. Unfreeze top layers of base

  3. Train end-to-end with very low learning rate

When to Use:

  • Limited training data

  • Similar domain to pre-trained model

  • Faster training than from scratch

Common Interview Topics

Vanishing/Exploding Gradients

Vanishing Gradients:

  • Problem: Gradients become very small in early layers → no learning

  • Causes: Deep networks with sigmoid/tanh activations

  • Solutions:

    • Use ReLU activations

    • Batch normalization

    • ResNet skip connections

    • Better weight initialization (Xavier, He)

Exploding Gradients:

  • Problem: Gradients become very large → unstable training

  • Causes: Deep networks, especially RNNs

  • Solutions:

    • Gradient clipping

    • Lower learning rate

    • Batch normalisation

Batch vs Layer Normalization

Aspect

Batch Norm

Layer Norm

Normalizes

Across batch dimension

Across feature dimension

Best For

CNNs, large batches

RNNs, Transformers, small batches

Training/Inference

Different behavior

Same behavior

Batch Size Dependency

Yes (needs large batches)

No (works with batch=1)

Common Interview Questions

  1. "Why do we need activation functions?"

    • Without them, stacked linear layers = single linear layer (no expressiveness)

    • Introduce non-linearity to learn complex patterns

  2. "Explain backpropagation in simple terms"

    • Forward pass: compute predictions

    • Compute loss

    • Backward pass: use chain rule to compute gradients

    • Update weights using optimizer

  3. "How does attention work?"

    • Learns to focus on relevant parts of input

    • Query, Key, Value mechanism

    • Attention(Q,K,V) = softmax(QKᵀ/√d)V

  4. "Why are transformers better than RNNs?"

    • Parallelizable (RNNs are sequential)

    • Better at capturing long-range dependencies

    • No vanishing gradient problem

    • Scales better with data and compute

Hardware & Scalability

Training Considerations:

  • GPUs: Parallel matrix operations (NVIDIA A100, H100)

  • TPUs: Google's custom chips for tensor operations

  • Batch Size: Limited by GPU memory (use gradient accumulation for large effective batches)

  • Mixed Precision: FP16 training with FP32 master weights

  • Distributed Training: Data parallelism, model parallelism

Inference Optimization:

  • Model quantization (INT8, INT4)

  • Pruning (remove unnecessary weights)

  • Knowledge distillation (train smaller model from large model)

  • ONNX runtime, TensorRT for fast inference

Model Development Best Practices

Model Debugging Checklist

Model Underperforming:

  1. Check Data Quality

    • Missing values handled correctly?

    • Outliers detected and addressed?

    • Data leakage (test data bleeding into training)?

    • Feature scaling applied consistently?

    • Class imbalance addressed?

  2. Feature Engineering

    • Relevant features included?

    • Feature interactions captured?

    • Domain knowledge incorporated?

    • Feature importance analysis done?

  3. Model Complexity

    • Is model too simple (high bias)?

    • Is model too complex (high variance)?

    • Try different algorithm families

  4. Hyperparameters

    • Learning rate appropriate?

    • Regularization strength tuned?

    • Tree depth / number of estimators optimized?

    • Use grid search or random search

Model Overfitting (High Variance):

  • Collect more training data

  • Reduce model complexity

  • Increase regularization (L1/L2, dropout)

  • Feature selection / dimensionality reduction

  • Cross-validation

  • Early stopping

  • Data augmentation

  • Ensemble methods (bagging)

Model Underfitting (High Bias):

  • Use more complex model

  • Add more features / feature engineering

  • Reduce regularization

  • Train longer

  • Remove noise from data

  • Ensemble methods (boosting)

Production Considerations

Model Serving

Batch Inference:

  • Process large datasets offline

  • Higher throughput, lower latency requirements

  • Can use complex models

  • Examples: Daily recommendations, weekly reports

Real-time Inference:

  • Low latency required (ms to seconds)

  • Request-response pattern

  • Model optimization critical

  • Examples: Search ranking, fraud detection

Serving Infrastructure:

Model Monitoring & Drift Detection

Data Drift:

  • Input feature distributions change over time

  • Detection: PSI (Population Stability Index), KL divergence

  • Solution: Retrain model with recent data

Concept Drift:

  • Relationship between features and target changes

  • Detection: Monitor model performance metrics

  • Solution: Retrain model, feature engineering

Monitoring Metrics:

  • Model accuracy/precision/recall (decay over time?)

  • Prediction distribution

  • Feature distributions

  • Latency and throughput

  • Error rates

Alerting Thresholds:

  • Accuracy drops >5%

  • Prediction drift >10%

  • Latency increases >2x

  • Error rate >1%

A/B Testing

Purpose: Validate new model performs better than current production model

Setup:

  1. Split traffic (e.g., 90% control, 10% treatment)

  2. Monitor key metrics (conversion rate, CTR, revenue)

  3. Statistical significance testing

  4. Gradual rollout if successful

Key Metrics:

  • Business metrics (revenue, engagement)

  • Model metrics (accuracy, latency)

  • User experience metrics

Model Versioning & Reproducibility

Essential Practices:

  • Version control for code (Git)

  • Track data versions (DVC, Delta Lake)

  • Log hyperparameters (MLflow, Weights & Biases)

  • Save model artifacts with metadata

  • Docker containers for reproducible environments

  • Random seeds for reproducibility

MLflow Example:

Common Interview Questions & Answers

Conceptual Questions

1. "Explain bias-variance trade-off with an example"

"Bias-variance trade-off is about balancing model complexity. For example, if I use linear regression to fit a quadratic relationship, I'll have high bias because the model is too simple to capture the curve—it will consistently underfit. If I use a 20-degree polynomial, I'll have high variance because the model is too flexible and will fit noise in the training data—predictions will vary drastically with different training sets. The goal is to find the sweet spot, perhaps a quadratic model, that minimizes total error."

2. "How would you handle imbalanced data?"

"I'd approach it in three ways: First, use appropriate metrics like precision, recall, F1, and PR-AUC instead of accuracy. Second, apply resampling techniques like SMOTE for oversampling the minority class or undersampling the majority class. Third, use algorithmic approaches like class weights to penalize misclassification of the minority class more heavily. The choice depends on whether I have enough data and the cost of false positives vs false negatives."

3. "When would you use Random Forest vs XGBoost?"

"Random Forest is my go-to for a robust baseline—it's less prone to overfitting, requires minimal tuning, and handles outliers well. XGBoost is what I'd use when I need the highest possible accuracy on tabular data and am willing to invest time in hyperparameter tuning. XGBoost is generally more accurate but can overfit with poor tuning. Random Forest is more forgiving and often good enough. For production, I'd also consider LightGBM for faster training and inference."

4. "Explain L1 vs L2 regularization"

"Both add penalties to the loss function to prevent overfitting, but differ in how. L2 (Ridge) adds the sum of squared weights, which shrinks all weights but doesn't zero them out—good when all features contribute. L1 (Lasso) adds the sum of absolute weights, which can drive some weights exactly to zero, effectively performing feature selection—ideal when you suspect many features are irrelevant. For correlated features, L2 is more stable, while L1 is better for interpretability."

5. "How does dropout work and why is it effective?"

"Dropout randomly deactivates a percentage of neurons during each training iteration. This prevents the network from relying too heavily on any specific neuron, forcing it to learn robust, distributed representations. It's effectively like training an ensemble of different network architectures simultaneously. At inference, we use all neurons but scale their outputs appropriately. It's particularly effective for preventing overfitting in deep neural networks."

Practical Questions

6. "Your model has 99% accuracy but stakeholders are unhappy. What's wrong?"

"This is likely a class imbalance problem. If 99% of data belongs to the negative class, a model that always predicts negative gets 99% accuracy but provides zero value. I'd check the confusion matrix and look at precision, recall, and F1 score for each class. I'd then address the imbalance using techniques like class weights, SMOTE, or anomaly detection, and optimize for the metric that matters to the business—likely recall if false negatives are costly, or precision if false positives are costly."

7. "How would you detect and handle overfitting in production?"

"I'd implement monitoring to track model performance metrics over time. If validation/test accuracy was high during development but production performance degrades, that's a sign. I'd monitor prediction distributions, feature distributions, and error rates. To handle it, I'd first check for data drift. Then I'd consider retraining with more recent data, increasing regularization, collecting more training data, or simplifying the model. I'd also implement A/B testing before fully deploying any changes."

8. "Walk me through how you'd approach a new ML problem"

"First, I'd understand the business problem and define success metrics. Second, I'd do exploratory data analysis to understand distributions, missing values, and relationships. Third, I'd establish a simple baseline (like mean prediction or logistic regression). Fourth, I'd engineer relevant features and try progressively complex models (e.g., Linear → Random Forest → XGBoost). Fifth, I'd use cross-validation for model selection and tune hyperparameters. Finally, I'd evaluate on a hold-out test set and, if satisfactory, deploy with monitoring and A/B testing."

9. "How do you choose between deep learning and traditional ML?"

"I consider four factors: data size, data type, interpretability needs, and resources. For tabular data with < 100K samples, traditional ML (XGBoost, Random Forest) usually wins—it's faster, interpretable, and performs well. For unstructured data (images, text, audio) or when I have millions of samples, deep learning excels. If interpretability is critical (healthcare, finance), I'd prefer traditional ML or use interpretability techniques. Finally, if resources are limited (compute, time, expertise), traditional ML is more practical."

10. "Explain how you'd improve a model that's already performing well"

"I'd look at several areas: First, error analysis—examine misclassified examples to identify patterns and engineer targeted features. Second, ensemble methods—combine multiple models or try stacking. Third, advanced feature engineering—create interaction terms, polynomial features, or domain-specific features. Fourth, hyperparameter optimization—use Bayesian optimization or genetic algorithms. Fifth, get more data, especially for edge cases. Finally, try neural architecture search or AutoML if I have the resources. I'd always balance improvement against complexity and maintenance costs."

Last updated