Introduction to AI
Artificial Intelligence
Definition: Artificial Intelligence (AI) is the field of computer science focused on creating systems capable of performing tasks that typically require human intelligence, such as learning, reasoning, problem-solving, perception, and language understanding.
Core Learning Paradigms
AI systems learn patterns and make decisions through various approaches:
Statistical Machine Learning - Uses mathematical and probabilistic methods to find patterns in data:
Creates decision boundaries, hyperplanes, or hierarchical splits to divide data
Examples: Linear/logistic regression, SVMs, decision trees, random forests
Deep Learning - Mimics human neural networks to learn complex patterns through iterative optimization:
Learns hierarchical feature representations automatically
Excels at unstructured data (images, text, audio)
Examples: CNNs, RNNs, Transformers
Generative AI - Creates new data samples after learning from existing data:
Examples: GANs, VAEs, Diffusion Models (DALL-E, Stable Diffusion)
Reinforcement Learning - Learns through interaction with an environment via rewards and penalties:
Agent learns optimal actions through trial and error
Examples: Game playing (AlphaGo), robotics, recommendation systems
Key Challenges in Machine Learning
Data Representation - How to encode and feed data to ML models (feature engineering, embeddings)
Performance Monitoring - Tracking model progress during training (loss curves, validation metrics)
Generalization - Ensuring models learn the right patterns and perform well on unseen data
Interpretability - Understanding how models make decisions
Scalability - Handling large datasets and deploying models in production
Machine Learning vs Deep Learning
Aspect
Machine Learning
Deep Learning
Approach
Statistical and probabilistic methods
Neural networks with multiple hidden layers
Data Requirements
Works well with smaller datasets (1K-100K samples)
Requires large datasets (100K-1M+ samples)
Computation
Lower computational cost, can run on CPUs
High computational cost, requires GPUs/TPUs
Feature Engineering
Manual feature engineering required
Automatic feature learning
Training Time
Minutes to hours
Hours to days/weeks
Interpretability
High (e.g., decision trees, linear models)
Low (black-box models)
Use Cases
Tabular data, structured problems, limited data
Images, text, audio, unstructured data
Performance Scaling
Plateaus with more data
Improves with more data and model size
Hardware Needs
Standard CPU sufficient
GPUs/TPUs often necessary
When to Use Machine Learning:
Small to medium-sized datasets
Structured/tabular data
Need for model interpretability
Limited computational resources
Quick iteration and deployment needed
When to Use Deep Learning:
Large datasets available
Unstructured data (images, text, audio, video)
Complex pattern recognition required
Performance is priority over interpretability
Sufficient computational resources available
Machine Learning Fundamentals
Bias-Variance Trade-off
One of the most important concepts in machine learning - understanding this is critical for interviews.
Mathematical Foundation:
The expected prediction error can be decomposed as:
Definitions:
Bias - Error from incorrect assumptions in the learning algorithm
Measures how far off the average model prediction is from the true value
High Bias → Underfitting - Model is too simple, fails to capture underlying patterns
Example: Using linear regression for non-linear data
Variance - Error from sensitivity to small fluctuations in training data
Measures how much predictions vary across different training sets
High Variance → Overfitting - Model learns noise in training data
Example: Deep decision tree memorizing training data
Irreducible Error - Noise in the data that cannot be reduced (σ²)
Visual Understanding:
The Trade-off:
Reducing bias typically increases variance (more complex model)
Reducing variance typically increases bias (simpler model)
Goal: Find the sweet spot that minimizes total error
Factors Affecting Bias and Variance:
Factor
Effect on Bias
Effect on Variance
More training data
No change
Decreases ↓
More features
Decreases ↓
Increases ↑
More complex model
Decreases ↓
Increases ↑
Regularization (↑λ)
Increases ↑
Decreases ↓
Feature selection
May increase ↑
Decreases ↓
Ensemble methods
Decreases ↓
Decreases ↓
How to Reduce High Bias (Underfitting):
Use more complex model (e.g., polynomial features, deeper network)
Add more relevant features
Decrease regularization (reduce λ)
Remove noise from data
Increase model capacity
How to Reduce High Variance (Overfitting):
Collect more training data
Feature selection / dimensionality reduction
Increase regularization (increase λ)
Use ensemble methods (bagging)
Cross-validation
Early stopping (for neural networks)
Dropout (for neural networks)
Interview Tip: Be ready to explain with a concrete example: "If I use linear regression on data with a quadratic relationship, I'll have high bias because the model can't capture the curve. If I use a 20-degree polynomial, I'll have high variance because it'll fit every noise point in my training data."
Regularization Techniques
Regularization adds a penalty term to the loss function to prevent overfitting:
L1 Regularization (Lasso):
Produces sparse models (some weights become exactly 0)
Useful for feature selection
Less stable with correlated features
L2 Regularization (Ridge):
Shrinks all weights but doesn't zero them out
More stable with correlated features
Preferred when all features are relevant
Elastic Net:
Combines L1 and L2
Handles correlated features better than Lasso
Interview Question: "When would you use L1 vs L2 regularization?"
L1: When you suspect many features are irrelevant and want feature selection
L2: When you believe most features contribute and want to avoid instability
Train/Validation/Test Split
Purpose: Properly evaluate model performance and prevent overfitting
Standard Split Ratios:
Training Set (60-80%): Used to fit the model
Validation Set (10-20%): Used for hyperparameter tuning and model selection
Test Set (10-20%): Used ONLY for final evaluation
Best Practices:
Stratification: Maintain class distribution in splits for imbalanced data
Time-based split: For time series, always split chronologically
Never use test data during development
Cross-Validation
Purpose: Better utilize limited data and get more reliable performance estimates
K-Fold Cross-Validation:
Split data into K folds
Train on K-1 folds, validate on 1 fold
Repeat K times, each fold used as validation once
Average results across all folds
Common Variants:
Stratified K-Fold: Maintains class distribution (use for classification)
Time Series Split: Respects temporal order
Leave-One-Out (LOO): K = number of samples (expensive but thorough)
When to Use:
Small datasets (to maximize training data usage)
Model comparison (more reliable than single split)
Hyperparameter tuning (use nested CV)
Interview Tip: Mention that CV is computationally expensive (K times the cost) but gives more robust estimates.
Ensemble Methods
Combine multiple models to improve performance:
Bagging (Bootstrap Aggregating):
Reduces variance
Trains multiple models on random subsets of data
Averages predictions (regression) or votes (classification)
Example: Random Forest
Boosting:
Reduces bias
Sequentially trains weak learners, each focusing on previous errors
Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost
Stacking:
Trains a meta-model on predictions of base models
Can combine diverse model types
-------------------------------------------------------------------------------------------------------
Machine Learning Algorithms
Supervised Learning
Supervised learning uses labeled data (input-output pairs) to learn a mapping function.
Regression Algorithms (Continuous Output)
Algorithm
Time Complexity
Key Assumptions
When to Use
Linear Regression
O(n·p²)
Linear relationship, independent features
Quick baseline, interpretability needed
Ridge/Lasso
O(n·p²)
Same as linear + many features
High-dimensional data, feature selection
Decision Tree
O(n·log(n)·p)
None
Non-linear relationships, interpretability
Random Forest
O(n·log(n)·p·t)
None
General purpose, handles non-linearity
XGBoost/LightGBM
O(n·log(n)·p·t)
None
Structured data, need highest accuracy
SVR
O(n²·p) to O(n³·p)
Kernel-dependent
Small-medium datasets, non-linear
KNN
O(n·p) per prediction
Similar instances close together
Small datasets, simple baseline
n = samples, p = features, t = trees
Classification Algorithms (Categorical Output)
Algorithm
Time Complexity
Key Strengths
Common Issues
Logistic Regression
O(n·p)
Fast, interpretable, probability outputs
Assumes linear decision boundary
Naive Bayes
O(n·p)
Fast, works well with small data
Assumes feature independence
Decision Tree
O(n·log(n)·p)
Interpretable, handles non-linearity
Prone to overfitting
Random Forest
O(n·log(n)·p·t)
Robust, handles imbalanced data
Less interpretable, slower
XGBoost
O(n·log(n)·p·t)
State-of-art on tabular data
Requires tuning, can overfit
LightGBM
O(n·p·t)
Faster than XGBoost, good for large data
Can overfit on small data
CatBoost
O(n·log(n)·p·t)
Handles categorical features well
Slower training
SVM
O(n²·p) to O(n³·p)
Effective in high dimensions
Slow on large datasets
KNN
O(n·p)
Simple, no training time
Slow predictions, curse of dimensionality
Ensemble Methods:
Bagging (Random Forest): Reduces variance, parallel training
Boosting (XGBoost, AdaBoost, GBM): Reduces bias, sequential training
Stacking: Combines diverse models with meta-learner
Algorithm Selection Guide
Start with these baselines:
Regression: Linear Regression → Random Forest → XGBoost
Classification: Logistic Regression → Random Forest → XGBoost
Choose based on constraints:
Need interpretability: Linear/Logistic Regression, Decision Tree
Have high-dimensional data: Lasso, Ridge, Random Forest
Have categorical features: CatBoost, LightGBM
Limited data: Naive Bayes, Regularized Linear Models
Need fast predictions: Linear models, Naive Bayes
Maximum accuracy on tabular data: XGBoost, LightGBM, CatBoost
Unsupervised Learning
Unsupervised learning finds patterns in unlabeled data.
Clustering Algorithms
Algorithm
Time Complexity
Best For
Limitations
K-Means
O(n·k·i·p)
Spherical clusters, large datasets
Must specify k, sensitive to outliers
DBSCAN
O(n·log(n))
Arbitrary shapes, outlier detection
Struggles with varying densities
Hierarchical
O(n²·log(n))
Dendrograms, unknown k
Slow on large data
GMM
O(n·k·i·p)
Soft clustering, probabilistic
Assumes Gaussian distributions
k = clusters, i = iterations
Dimensionality Reduction
Algorithm
Type
Preserves
Use Case
PCA
Linear
Global structure, variance
Visualization, noise reduction
t-SNE
Non-linear
Local structure
Visualization (2D/3D)
UMAP
Non-linear
Local + global structure
Visualization, faster than t-SNE
LDA
Supervised
Class separability
Feature extraction for classification
Autoencoders
Non-linear
Learned features
Complex non-linear reductions
Evaluation Metrics
Regression Metrics
Metric
Formula
Range
When to Use
MAE
(1/n) Σ|y - ŷ|
[0, ∞)
Robust to outliers
MSE
(1/n) Σ(y - ŷ)²
[0, ∞)
Penalizes large errors
RMSE
√MSE
[0, ∞)
Same units as target
R²
1 - (SS_res / SS_tot)
(-∞, 1]
Explains variance, 1 is perfect
Adjusted R²
1 - [(1-R²)(n-1)/(n-p-1)]
(-∞, 1]
Accounts for # of features
MAPE
(100/n) Σ|
(y - ŷ)/y|
Key Insights:
MAE: Less sensitive to outliers (linear penalty)
MSE/RMSE: More sensitive to outliers (quadratic penalty)
R²: Can be negative if model is worse than mean baseline
MAPE: Not suitable when y can be close to 0
Interview Tip: "I'd use RMSE when large errors are particularly bad (e.g., predicting hospital demand), and MAE when all errors matter equally (e.g., pricing)."
Classification Metrics
Confusion Matrix Foundation:
Metric
Formula
Range
Focus
Accuracy
(TP + TN) / Total
[0, 1]
Overall correctness
Precision
TP / (TP + FP)
[0, 1]
Of predicted positives, how many correct?
Recall (Sensitivity)
TP / (TP + FN)
[0, 1]
Of actual positives, how many caught?
Specificity
TN / (TN + FP)
[0, 1]
Of actual negatives, how many caught?
F1 Score
2 · (P · R) / (P + R)
[0, 1]
Harmonic mean of precision & recall
F-beta
(1+β²) · (P·R) / (β²·P + R)
[0, 1]
Weighted F1 (β>1 favors recall)
Advanced Metrics:
Metric
Purpose
When to Use
ROC-AUC
Area under ROC curve
Binary classification, balanced classes
PR-AUC
Area under Precision-Recall curve
Imbalanced classes (better than ROC-AUC)
Log Loss
-Σ(y·log(ŷ) + (1-y)·log(1-ŷ))
Probabilistic predictions
Cohen's Kappa
Accounts for chance agreement
Inter-rater reliability
MCC
Matthews Correlation Coefficient
Balanced measure for imbalanced data
Precision vs Recall Trade-off:
Metric Selection Guide:
Scenario
Preferred Metric
Reason
Balanced classes
Accuracy, F1, ROC-AUC
All classes equally represented
Imbalanced classes
Precision, Recall, PR-AUC, F1
Accuracy is misleading
False positives costly
Precision
Minimize FP (e.g., spam filter)
False negatives costly
Recall
Minimize FN (e.g., cancer detection)
Need probability calibration
Log Loss, Brier Score
Evaluate probability quality
Multi-class
Macro/Micro F1, Weighted F1
Handles multiple classes
Common Interview Questions:
"When would you use accuracy?"
Only when classes are balanced. Otherwise, it's misleading (e.g., 99% accuracy on 99% negative class).
"Precision vs Recall - which is more important?"
Depends on cost of errors:
Precision: When false positives are expensive (spam detection)
Recall: When false negatives are expensive (disease screening)
F1: When you need balance
"Why use ROC-AUC vs PR-AUC?"
ROC-AUC: Balanced datasets
PR-AUC: Imbalanced datasets (focuses on positive class performance)
Handling Imbalanced Data
Techniques:
Resampling:
Oversample minority class (SMOTE, ADASYN)
Undersample majority class
Combined approaches
Algorithmic:
Class weights (penalize minority misclassification more)
Anomaly detection (treat minority as anomaly)
Ensemble methods (balanced bagging)
Metric Selection:
Use Precision, Recall, F1, PR-AUC instead of Accuracy
Focus on confusion matrix
Deep Learning
Deep learning uses neural networks with multiple layers To learn hierarchical feature representations from data.
Core Components
1. Neural Network Layers
Layer Type
Purpose
Common Use
Dense (Fully Connected)
Learns complex non-linear relationships
MLPs, final classification layers
Convolutional (Conv)
Extracts spatial features
Image processing, CNNs
Recurrent (RNN, LSTM, GRU)
Handles sequential data
Time series, text (legacy)
Attention/Transformer
Captures long-range dependencies
Modern NLP, vision transformers
Pooling
Downsamples feature maps
Dimensionality reduction in CNNs
Dropout
Regularization via random neuron deactivation
Preventing overfitting
Batch/Layer Normalization
Stabilizes training
Faster convergence, better performance
Embedding
Converts discrete tokens to vectors
NLP, categorical features
2. Activation Functions
Function
Formula
Range
Pros
Cons
ReLU
max(0, x)
[0, ∞)
Fast, no vanishing gradient
Dead ReLU problem
Leaky ReLU
max(αx, x)
(-∞, ∞)
Fixes dead ReLU
Needs tuning α
GELU
x·Φ(x)
(-∞, ∞)
Smooth, state-of-art
Slower computation
Sigmoid
1/(1+e⁻ˣ)
(0, 1)
Output probabilities
Vanishing gradient
Tanh
(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)
(-1, 1)
Zero-centered
Vanishing gradient
Softmax
eˣⁱ/Σeˣʲ
(0, 1)
Multi-class probabilities
Used only in output layer
Interview Question: "Why is ReLU preferred over sigmoid/tanh?"
Mitigates vanishing gradient problem
Faster computation (simple thresholding)
Encourages sparse activations (biological plausibility)
Dead ReLU Problem: When ReLU units always output 0 (negative input), they stop learning. Solutions: Leaky ReLU, careful weight initialization, lower learning rate.
3. Loss Functions
Loss Function
Formula
Use Case
MSE
(1/n)Σ(y-ŷ)²
Regression
MAE
(1/n)Σ|y-ŷ|
Regression (robust to outliers)
Binary Cross-Entropy
-[y·log(ŷ) + (1-y)·log(1-ŷ)]
Binary classification
Categorical Cross-Entropy
-Σy·log(ŷ)
Multi-class classification
Sparse Categorical CE
Same, but with integer labels
Multi-class with many classes
Hinge Loss
max(0, 1-y·ŷ)
SVMs, margin-based learning
Huber Loss
Hybrid MSE/MAE
Robust regression
4. Optimizers
Optimizer
Key Feature
When to Use
SGD
Basic gradient descent
Simple problems, with momentum
SGD + Momentum
Accumulates gradients
Faster convergence, escapes local minima
Adam
Adaptive learning rates per parameter
Default choice, works well generally
AdamW
Adam with decoupled weight decay
Better regularization than Adam
RMSprop
Adapts learning rate using moving average
RNNs, non-stationary problems
AdaGrad
Per-parameter learning rates
Sparse data
Interview Tip: "I'd start with Adam or AdamW for most problems. For fine-tuning, SGD with momentum often gives better final performance."
Modern Architectures
Computer Vision
Architecture
Year
Key Innovation
Use Case
LeNet
1998
First CNN
Digit recognition
AlexNet
2012
Deep CNN, ReLU, Dropout
ImageNet breakthrough
VGG
2014
Very deep (16-19 layers)
Feature extraction
ResNet
2015
Skip connections, 152 layers
Solves vanishing gradient
Inception
2015
Multiple filter sizes in parallel
Efficient multi-scale features
MobileNet
2017
Depthwise separable convolutions
Mobile/edge devices
EfficientNet
2019
Compound scaling (depth/width/resolution)
SOTA efficiency
Vision Transformer (ViT)
2020
Transformers for images
Current SOTA, large datasets
Natural Language Processing
Architecture
Year
Key Innovation
Use Case
Word2Vec
2013
Word embeddings
Pre-trained embeddings
GloVe
2014
Global word vectors
Pre-trained embeddings
LSTM/GRU
1997/2014
Handles long sequences
Seq-to-seq (legacy)
Transformer
2017
Self-attention, parallel processing
Foundation of modern NLP
BERT
2018
Bidirectional transformer encoder
Text understanding, classification
GPT
2018
Autoregressive transformer decoder
Text generation
T5
2019
Text-to-text framework
Unified NLP tasks
GPT-3/4
2020/2023
Massive scale (175B+ params)
Few-shot learning, general tasks
Generative Models
Model Type
How It Works
Use Case
GAN
Generator vs Discriminator
Image generation, style transfer
VAE
Encoder-decoder with latent space
Dimensionality reduction, generation
Diffusion Models
Iterative denoising process
DALL-E, Stable Diffusion, Midjourney
Autoregressive
Predicts next token sequentially
GPT, language models
Training Techniques
Preventing Overfitting
Dropout (p=0.2-0.5)
Randomly deactivate neurons during training
Forces network to learn robust features
Batch Normalization
Normalizes layer inputs
Allows higher learning rates, faster training
Layer Normalization
Normalizes across features (better for transformers)
Weight Decay / L2 Regularization
Adds penalty to large weights
Early Stopping
Stop training when validation loss stops improving
Data Augmentation
Vision: rotation, flipping, cropping, color jitter
NLP: back-translation, synonym replacement
Optimization Techniques
Learning Rate Scheduling
Step decay: Reduce LR at intervals
Exponential decay: Gradual reduction
Cosine annealing: Smooth periodic reduction
Warm-up: Start low, increase, then decay
Gradient Clipping
Prevents exploding gradients (critical for RNNs)
Clip by value or by norm
Mixed Precision Training
Use FP16 for speed, FP32 for stability
Reduces memory, speeds up training
Transfer Learning & Fine-tuning
Transfer Learning:
Fine-tuning Strategy:
Train custom head with frozen base (few epochs)
Unfreeze top layers of base
Train end-to-end with very low learning rate
When to Use:
Limited training data
Similar domain to pre-trained model
Faster training than from scratch
Common Interview Topics
Vanishing/Exploding Gradients
Vanishing Gradients:
Problem: Gradients become very small in early layers → no learning
Causes: Deep networks with sigmoid/tanh activations
Solutions:
Use ReLU activations
Batch normalization
ResNet skip connections
Better weight initialization (Xavier, He)
Exploding Gradients:
Problem: Gradients become very large → unstable training
Causes: Deep networks, especially RNNs
Solutions:
Gradient clipping
Lower learning rate
Batch normalisation
Batch vs Layer Normalization
Aspect
Batch Norm
Layer Norm
Normalizes
Across batch dimension
Across feature dimension
Best For
CNNs, large batches
RNNs, Transformers, small batches
Training/Inference
Different behavior
Same behavior
Batch Size Dependency
Yes (needs large batches)
No (works with batch=1)
Common Interview Questions
"Why do we need activation functions?"
Without them, stacked linear layers = single linear layer (no expressiveness)
Introduce non-linearity to learn complex patterns
"Explain backpropagation in simple terms"
Forward pass: compute predictions
Compute loss
Backward pass: use chain rule to compute gradients
Update weights using optimizer
"How does attention work?"
Learns to focus on relevant parts of input
Query, Key, Value mechanism
Attention(Q,K,V) = softmax(QKᵀ/√d)V
"Why are transformers better than RNNs?"
Parallelizable (RNNs are sequential)
Better at capturing long-range dependencies
No vanishing gradient problem
Scales better with data and compute
Hardware & Scalability
Training Considerations:
GPUs: Parallel matrix operations (NVIDIA A100, H100)
TPUs: Google's custom chips for tensor operations
Batch Size: Limited by GPU memory (use gradient accumulation for large effective batches)
Mixed Precision: FP16 training with FP32 master weights
Distributed Training: Data parallelism, model parallelism
Inference Optimization:
Model quantization (INT8, INT4)
Pruning (remove unnecessary weights)
Knowledge distillation (train smaller model from large model)
ONNX runtime, TensorRT for fast inference
Model Development Best Practices
Model Debugging Checklist
Model Underperforming:
Check Data Quality
Missing values handled correctly?
Outliers detected and addressed?
Data leakage (test data bleeding into training)?
Feature scaling applied consistently?
Class imbalance addressed?
Feature Engineering
Relevant features included?
Feature interactions captured?
Domain knowledge incorporated?
Feature importance analysis done?
Model Complexity
Is model too simple (high bias)?
Is model too complex (high variance)?
Try different algorithm families
Hyperparameters
Learning rate appropriate?
Regularization strength tuned?
Tree depth / number of estimators optimized?
Use grid search or random search
Model Overfitting (High Variance):
Collect more training data
Reduce model complexity
Increase regularization (L1/L2, dropout)
Feature selection / dimensionality reduction
Cross-validation
Early stopping
Data augmentation
Ensemble methods (bagging)
Model Underfitting (High Bias):
Use more complex model
Add more features / feature engineering
Reduce regularization
Train longer
Remove noise from data
Ensemble methods (boosting)
Production Considerations
Model Serving
Batch Inference:
Process large datasets offline
Higher throughput, lower latency requirements
Can use complex models
Examples: Daily recommendations, weekly reports
Real-time Inference:
Low latency required (ms to seconds)
Request-response pattern
Model optimization critical
Examples: Search ranking, fraud detection
Serving Infrastructure:
Model Monitoring & Drift Detection
Data Drift:
Input feature distributions change over time
Detection: PSI (Population Stability Index), KL divergence
Solution: Retrain model with recent data
Concept Drift:
Relationship between features and target changes
Detection: Monitor model performance metrics
Solution: Retrain model, feature engineering
Monitoring Metrics:
Model accuracy/precision/recall (decay over time?)
Prediction distribution
Feature distributions
Latency and throughput
Error rates
Alerting Thresholds:
Accuracy drops >5%
Prediction drift >10%
Latency increases >2x
Error rate >1%
A/B Testing
Purpose: Validate new model performs better than current production model
Setup:
Split traffic (e.g., 90% control, 10% treatment)
Monitor key metrics (conversion rate, CTR, revenue)
Statistical significance testing
Gradual rollout if successful
Key Metrics:
Business metrics (revenue, engagement)
Model metrics (accuracy, latency)
User experience metrics
Model Versioning & Reproducibility
Essential Practices:
Version control for code (Git)
Track data versions (DVC, Delta Lake)
Log hyperparameters (MLflow, Weights & Biases)
Save model artifacts with metadata
Docker containers for reproducible environments
Random seeds for reproducibility
MLflow Example:
Common Interview Questions & Answers
Conceptual Questions
1. "Explain bias-variance trade-off with an example"
"Bias-variance trade-off is about balancing model complexity. For example, if I use linear regression to fit a quadratic relationship, I'll have high bias because the model is too simple to capture the curve—it will consistently underfit. If I use a 20-degree polynomial, I'll have high variance because the model is too flexible and will fit noise in the training data—predictions will vary drastically with different training sets. The goal is to find the sweet spot, perhaps a quadratic model, that minimizes total error."
2. "How would you handle imbalanced data?"
"I'd approach it in three ways: First, use appropriate metrics like precision, recall, F1, and PR-AUC instead of accuracy. Second, apply resampling techniques like SMOTE for oversampling the minority class or undersampling the majority class. Third, use algorithmic approaches like class weights to penalize misclassification of the minority class more heavily. The choice depends on whether I have enough data and the cost of false positives vs false negatives."
3. "When would you use Random Forest vs XGBoost?"
"Random Forest is my go-to for a robust baseline—it's less prone to overfitting, requires minimal tuning, and handles outliers well. XGBoost is what I'd use when I need the highest possible accuracy on tabular data and am willing to invest time in hyperparameter tuning. XGBoost is generally more accurate but can overfit with poor tuning. Random Forest is more forgiving and often good enough. For production, I'd also consider LightGBM for faster training and inference."
4. "Explain L1 vs L2 regularization"
"Both add penalties to the loss function to prevent overfitting, but differ in how. L2 (Ridge) adds the sum of squared weights, which shrinks all weights but doesn't zero them out—good when all features contribute. L1 (Lasso) adds the sum of absolute weights, which can drive some weights exactly to zero, effectively performing feature selection—ideal when you suspect many features are irrelevant. For correlated features, L2 is more stable, while L1 is better for interpretability."
5. "How does dropout work and why is it effective?"
"Dropout randomly deactivates a percentage of neurons during each training iteration. This prevents the network from relying too heavily on any specific neuron, forcing it to learn robust, distributed representations. It's effectively like training an ensemble of different network architectures simultaneously. At inference, we use all neurons but scale their outputs appropriately. It's particularly effective for preventing overfitting in deep neural networks."
Practical Questions
6. "Your model has 99% accuracy but stakeholders are unhappy. What's wrong?"
"This is likely a class imbalance problem. If 99% of data belongs to the negative class, a model that always predicts negative gets 99% accuracy but provides zero value. I'd check the confusion matrix and look at precision, recall, and F1 score for each class. I'd then address the imbalance using techniques like class weights, SMOTE, or anomaly detection, and optimize for the metric that matters to the business—likely recall if false negatives are costly, or precision if false positives are costly."
7. "How would you detect and handle overfitting in production?"
"I'd implement monitoring to track model performance metrics over time. If validation/test accuracy was high during development but production performance degrades, that's a sign. I'd monitor prediction distributions, feature distributions, and error rates. To handle it, I'd first check for data drift. Then I'd consider retraining with more recent data, increasing regularization, collecting more training data, or simplifying the model. I'd also implement A/B testing before fully deploying any changes."
8. "Walk me through how you'd approach a new ML problem"
"First, I'd understand the business problem and define success metrics. Second, I'd do exploratory data analysis to understand distributions, missing values, and relationships. Third, I'd establish a simple baseline (like mean prediction or logistic regression). Fourth, I'd engineer relevant features and try progressively complex models (e.g., Linear → Random Forest → XGBoost). Fifth, I'd use cross-validation for model selection and tune hyperparameters. Finally, I'd evaluate on a hold-out test set and, if satisfactory, deploy with monitoring and A/B testing."
9. "How do you choose between deep learning and traditional ML?"
"I consider four factors: data size, data type, interpretability needs, and resources. For tabular data with < 100K samples, traditional ML (XGBoost, Random Forest) usually wins—it's faster, interpretable, and performs well. For unstructured data (images, text, audio) or when I have millions of samples, deep learning excels. If interpretability is critical (healthcare, finance), I'd prefer traditional ML or use interpretability techniques. Finally, if resources are limited (compute, time, expertise), traditional ML is more practical."
10. "Explain how you'd improve a model that's already performing well"
"I'd look at several areas: First, error analysis—examine misclassified examples to identify patterns and engineer targeted features. Second, ensemble methods—combine multiple models or try stacking. Third, advanced feature engineering—create interaction terms, polynomial features, or domain-specific features. Fourth, hyperparameter optimization—use Bayesian optimization or genetic algorithms. Fifth, get more data, especially for edge cases. Finally, try neural architecture search or AutoML if I have the resources. I'd always balance improvement against complexity and maintenance costs."
Last updated