Data preprocessing
[!IMPORTANT] Executive Summary for ML Engineers
Data Leakage: The #1 killer of production models. Never fit transformers on test data.
Imputation: Use Median for skewed data (robust). For categorical, consider if "Missingness" is itself a signal.
Scaling: StandardScaler is default. Use RobustScaler if you have extreme outliers.
Encoding: One-Hot for low cardinality; Target Encoding or Embeddings for high cardinality.
Drift Monitoring: Preprocessing isn't a one-time thing. Monitor distribution shift (K-S test) in production.
1. The Cardinal Rule: Prevention of Data Leakage
Before any transformation, you must perform the train-test split.
Why?
Data Leakage occurs when information from outside the training dataset is used to create the model.
Example: If you standardize using the global mean, your training data now "knows" something about the distribution of the test set.
The Correct Workflow:
Split into
X_trainandX_test.fit()the scaler/encoder ONLY onX_train.transform()bothX_trainandX_testusing the fitted parameters.
2. Handling Missing Data
Strategies Comparison
Method
Technique
When to Use
Interview Insight
Deletion
Drop rows/cols
Missing >50% or insignificant rows
Use sparingly; can introduce bias if data is not MCAR (Missing Completely At Random).
Imputation
Mean/Median/Mode
Numerical/Categorical features
Median is preferred for skewed data to avoid outlier influence.
Prediction
KNN/Iterative Imputer
Complex dependencies
More accurate but computationally expensive and risk of over-fitting.
Constant
Fill with "Unknown"
Categorical features
Preserves the fact that the data was missing, which can be a valuable signal.
Python Code (Sklearn SimpleImputer):
3. Numerical Data Transformations
Scaling Techniques
Technique
Formula
When to Use
Impact
Standardization
(x - μ) / σ
Most models (SVM, Logistic, PCA)
Centers at 0, unit variance. Sensitive to extreme outliers.
Normalization
(x - min) / (max - min)
Neural Networks, Image pixels
Bounds data between [0, 1]. Extremely sensitive to outliers.
Robust Scaling
(x - Q2) / (Q3 - Q1)
Data with many outliers
Scaled based on Interquartile Range (IQR); ignores extremes.
Handling Outliers (The IQR Method)
Formula:
Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR Where IQR = Q3 - Q1
Strategies:
Trimming: Remove values outside bounds.
Capping (Winsorization): Replace outliers with the upper/lower bound values.
Code Example (Scaling):
| Hashing | High cardinality | Fixed memory, no leakage | Collision risk, non-reversible. |
Advanced Techniques:
Target Encoding: Replaces category with mean of target. Danger: Overfitting. Use smoothing or CV-folds.
Feature Hashing (The Hashing Trick): Converts categories to indices using a hash function. Used in high-speed online learning (e.g., Vowpal Wabbit).
5. Feature Engineering & Selection
The process of using domain knowledge to create new features.
Polynomial Features: Creating interactions (e.g., $x_1 \times x_2$, $x_1^2$) to capture non-linearities.
Binning: Converting numerical features to categorical (e.g., Age 0-18 → "Child").
Geometric/Temporal: Distance to landmarks, "Is Weekend?", "Time since last purchase".
Feature Selection:
Filter Methods: Correlation, Chi-Square, Mutual Information.
Wrapper Methods: Recursive Feature Elimination (RFE).
Embedded Methods: L1 Regularization (Lasso) - coefficients shrink to zero.
6. Preprocessing in Production: Data Drift
In real-world systems, data distributions change over time (Covariate Shift).
Detection Strategies:
Population Stability Index (PSI): Measures shift in distribution between two time periods.
Kolmogorov-Smirnov (K-S) Test: Non-parametric test for equality of 1D distributions.
Monitoring Tooling: Use Prometheus/Grafana or specialized ML tools (WhyLabs, Arize).
6. Preprocessing for Image & Text
Image Data (Computer Vision)
Mandatory: Resizing (all images must be same shape, e.g., 224x224).
Mandatory: Scaling (Divide by 255 for [0, 1] or use ImageNet normalization).
Optional: Augmentation (Flips, rotations) - only during training.
Text Data (NLP)
Cleaning: Lowercasing, removing punctuation/special chars.
Normalization: Stemming (crude) vs Lemmatization (smart/dictionary-based).
Vectorization:
TF-IDF: Down-weights common words ("the", "is").
Embeddings (Word2Vec): Learns semantic meaning.
Common Interview Questions
1. "When would you choose Normalization (Min-Max) over Standardization?"
I choose Normalization when the distribution is not Gaussian or when the algorithm requires inputs in a fixed [0, 1] range, such as in Neural Networks or Image Processing. I use Standardization for most other cases, especially when the algorithm assumes Gaussian data (e.g., PCA, Logistic Regression).
2. "How do you handle categorical features with 10,000+ unique values?"
One-Hot encoding would create 10,000 columns, causing the "Curse of Dimensionality." Instead, I would use Target Encoding, Count Encoding, or Entity Embeddings (learned vectors, common in Deep Learning) to represent the classes in a lower-dimensional space.
3. "What happens if you scale the entire dataset before splitting?"
This leads to Data Leakage. The training set's statistics would be influenced by the test set's values. The model would yield overly optimistic performance metrics that won't hold up in production.
4. "Is it necessary to scale features for Decision Trees?"
No. Tree-based models (Random Forest, XGBoost) are scale-invariant. They split based on raw values and don't rely on distance metrics or gradient updates across features simultaneously.
Python Implementation: A Full Pipeline
Last updated