githubEdit

Data preprocessing

[!IMPORTANT] Executive Summary for ML Engineers

  1. Data Leakage: The #1 killer of production models. Never fit transformers on test data.

  2. Imputation: Use Median for skewed data (robust). For categorical, consider if "Missingness" is itself a signal.

  3. Scaling: StandardScaler is default. Use RobustScaler if you have extreme outliers.

  4. Encoding: One-Hot for low cardinality; Target Encoding or Embeddings for high cardinality.

  5. Drift Monitoring: Preprocessing isn't a one-time thing. Monitor distribution shift (K-S test) in production.


1. The Cardinal Rule: Prevention of Data Leakage

Before any transformation, you must perform the train-test split.

Why?

  • Data Leakage occurs when information from outside the training dataset is used to create the model.

  • Example: If you standardize using the global mean, your training data now "knows" something about the distribution of the test set.

The Correct Workflow:

  1. Split into X_train and X_test.

  2. fit() the scaler/encoder ONLY on X_train.

  3. transform() both X_train and X_test using the fitted parameters.


2. Handling Missing Data

Strategies Comparison

Method

Technique

When to Use

Interview Insight

Deletion

Drop rows/cols

Missing >50% or insignificant rows

Use sparingly; can introduce bias if data is not MCAR (Missing Completely At Random).

Imputation

Mean/Median/Mode

Numerical/Categorical features

Median is preferred for skewed data to avoid outlier influence.

Prediction

KNN/Iterative Imputer

Complex dependencies

More accurate but computationally expensive and risk of over-fitting.

Constant

Fill with "Unknown"

Categorical features

Preserves the fact that the data was missing, which can be a valuable signal.

Python Code (Sklearn SimpleImputer):


3. Numerical Data Transformations

Scaling Techniques

Technique

Formula

When to Use

Impact

Standardization

(x - μ) / σ

Most models (SVM, Logistic, PCA)

Centers at 0, unit variance. Sensitive to extreme outliers.

Normalization

(x - min) / (max - min)

Neural Networks, Image pixels

Bounds data between [0, 1]. Extremely sensitive to outliers.

Robust Scaling

(x - Q2) / (Q3 - Q1)

Data with many outliers

Scaled based on Interquartile Range (IQR); ignores extremes.

Handling Outliers (The IQR Method)

Formula:

  • Lower Bound = Q1 - 1.5 * IQR

  • Upper Bound = Q3 + 1.5 * IQR Where IQR = Q3 - Q1

Strategies:

  1. Trimming: Remove values outside bounds.

  2. Capping (Winsorization): Replace outliers with the upper/lower bound values.

Code Example (Scaling):


| Hashing | High cardinality | Fixed memory, no leakage | Collision risk, non-reversible. |

Advanced Techniques:

  • Target Encoding: Replaces category with mean of target. Danger: Overfitting. Use smoothing or CV-folds.

  • Feature Hashing (The Hashing Trick): Converts categories to indices using a hash function. Used in high-speed online learning (e.g., Vowpal Wabbit).


5. Feature Engineering & Selection

The process of using domain knowledge to create new features.

  • Polynomial Features: Creating interactions (e.g., $x_1 \times x_2$, $x_1^2$) to capture non-linearities.

  • Binning: Converting numerical features to categorical (e.g., Age 0-18 → "Child").

  • Geometric/Temporal: Distance to landmarks, "Is Weekend?", "Time since last purchase".

Feature Selection:

  • Filter Methods: Correlation, Chi-Square, Mutual Information.

  • Wrapper Methods: Recursive Feature Elimination (RFE).

  • Embedded Methods: L1 Regularization (Lasso) - coefficients shrink to zero.


6. Preprocessing in Production: Data Drift

In real-world systems, data distributions change over time (Covariate Shift).

Detection Strategies:

  • Population Stability Index (PSI): Measures shift in distribution between two time periods.

  • Kolmogorov-Smirnov (K-S) Test: Non-parametric test for equality of 1D distributions.

  • Monitoring Tooling: Use Prometheus/Grafana or specialized ML tools (WhyLabs, Arize).


6. Preprocessing for Image & Text

Image Data (Computer Vision)

  • Mandatory: Resizing (all images must be same shape, e.g., 224x224).

  • Mandatory: Scaling (Divide by 255 for [0, 1] or use ImageNet normalization).

  • Optional: Augmentation (Flips, rotations) - only during training.

Text Data (NLP)

  • Cleaning: Lowercasing, removing punctuation/special chars.

  • Normalization: Stemming (crude) vs Lemmatization (smart/dictionary-based).

  • Vectorization:

    • TF-IDF: Down-weights common words ("the", "is").

    • Embeddings (Word2Vec): Learns semantic meaning.


Common Interview Questions

1. "When would you choose Normalization (Min-Max) over Standardization?"

I choose Normalization when the distribution is not Gaussian or when the algorithm requires inputs in a fixed [0, 1] range, such as in Neural Networks or Image Processing. I use Standardization for most other cases, especially when the algorithm assumes Gaussian data (e.g., PCA, Logistic Regression).

2. "How do you handle categorical features with 10,000+ unique values?"

One-Hot encoding would create 10,000 columns, causing the "Curse of Dimensionality." Instead, I would use Target Encoding, Count Encoding, or Entity Embeddings (learned vectors, common in Deep Learning) to represent the classes in a lower-dimensional space.

3. "What happens if you scale the entire dataset before splitting?"

This leads to Data Leakage. The training set's statistics would be influenced by the test set's values. The model would yield overly optimistic performance metrics that won't hold up in production.

4. "Is it necessary to scale features for Decision Trees?"

No. Tree-based models (Random Forest, XGBoost) are scale-invariant. They split based on raw values and don't rely on distance metrics or gradient updates across features simultaneously.


Python Implementation: A Full Pipeline

Last updated