githubEdit

ML Glossary

A high-signal, "cheat sheet" style glossary for rapid technical revision. Each term includes definitions, formulas, practical applications, and interview key points.


A

Accuracy

  • Definition: The ratio of correct predictions to total predictions.

  • Formula: $\frac{TP + TN}{TP + TN + FP + FN}$

  • Practical Application: Use mostly for balanced datasets (e.g., handwritten digit classification).

  • Interview Note: Never use accuracy for fraud detection or medical diagnosis (imbalanced classes). Use F1 or PRC-AUC instead.

Activation Function

  • Definition: Non-linear transformation applied to neuron outputs to enable learning complex patterns.

  • Key Examples:

    • ReLU: $max(0, x)$ (Most usage).

    • Sigmoid: $\frac{1}{1+e^{-x}}$ (Binary output).

    • Softmax: Multi-class probabilities.

  • Practical Application: ReLU for hidden layers (avoids vanishing gradient), Sigmoid/Softmax for output layers.

AdaBoost (Adaptive Boosting)

  • Definition: An ensemble method that trains weak learners sequentially, correcting the mistakes of previous predictors.

  • Key Concept: Assigns higher weights to misclassified training instances.

  • Practical Application: Tabular classification tasks where data is clean but hard to separate.

  • Interview Note: Sensitive to noisy data and outliers because it frantically tries to fit them.

Adam (Adaptive Moment Estimation)

  • Definition: An adaptive learning rate optimization algorithm.

  • Key Concept: Combines Momentum (keeps moving in average direction) and RMSprop (scales learning rate by variance).

  • Practical Application: The default optimizer for training Deep Learning models (LLMs, CNNs).

  • Interview Note: Often set learning rate $\alpha = 3e-4$ (Andrej Karpathy's "safe bet").

AUC-ROC (Area Under Curve)

  • Definition: Performance metric measuring the ability to distinguish between classes at all threshold settings.

  • Key Range: 0.5 (Random Guessing) to 1.0 (Perfect).

  • Practical Application: Comparing models for credit scoring or ad-click prediction.

  • Interview Note: Unlike Accuracy, AUC is threshold-invariant and scale-invariant.

Attention Mechanism

  • Definition: Allows a model to focus on specific parts of the input sequence when generating output.

  • Formula: $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

  • Practical Application: The core of Transformers (ChatGPT, BERT). Enables long-range dependency modeling in translation.


B

Backpropagation

  • Definition: The algorithm for computing gradients of the loss function with respect to weights using the chain rule.

  • Key Concept: "Credit assignment" — figuring out which weight contributed how much to the error.

  • Practical Application: The engine of training Neural Networks.

Bagging (Bootstrap Aggregating)

  • Definition: Training multiple models in parallel on random subsets (with replacement) of training data.

  • Key Example: Random Forest.

  • Practical Application: Reducing Variance (overfitting). Great for high-variance models like Decision Trees.

Batch Normalization

  • Definition: Normalizing layer inputs to have mean 0 and variance 1 for each mini-batch.

  • Formula: $\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

  • Practical Application: Accelerates training, allows higher learning rates, and acts as a weak regularizer.

  • Interview Note: During Inference, use the moving average of mean/var calculated during training, not the batch statistics.

Bias-Variance Tradeoff

  • Definition: The tension between a model's ability to minimize errors on training data (Bias) vs. unseen data (Variance).

  • Key Insight:

    • High Bias: Underfitting (Linear Regression on nonlinear data).

    • High Variance: Overfitting (100-depth Decision Tree).

  • Goal: Find the "Sweet Spot".

Binary Cross-Entropy (Log Loss)

  • Definition: Loss function for binary classification.

  • Formula: $-\frac{1}{N} \sum [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$

  • Practical Application: Evaluating probabilities in spam detection or churn prediction.


C

Confusion Matrix

  • Definition: A table layout that visualizes the performance of a supervised learning algorithm.

  • Components:

    • TP: Hit.

    • TN: Correct Rejection.

    • FP: False Alarm (Type I Error).

    • FN: Miss (Type II Error).

Cosine Similarity

  • Definition: Measure of similarity between two non-zero vectors using the cosine of the angle between them.

  • Formula: $\frac{A \cdot B}{||A|| ||B||}$

  • Practical Application: Semantic Search, RAG (Retrieval Augmented Generation), Document Similarity.

  • Interview Note: Range is [-1, 1]. In high-dimensional spaces (embeddings), usually [0, 1].

Cross-Validation (K-Fold)

  • Definition: Resampling procedure used to evaluate ML models on a limited data sample.

  • Practical Application: Validating that your model isn't just memorizing the specific train-test split.

  • Interview Note: For Time Series, use TimeSeriesSplit (Walk-Forward validation), never random K-Fold.


D

Data Leakage

  • Definition: When information from outside the training dataset (or from the future) is used to create the model.

  • Examples: Using 'Target' in feature engineering, scaling data before splitting, future timestamps.

  • Interview Note: If you see 99.9% accuracy, suspect leakage immediately.

Dimensionality Reduction

  • Definition: Transformation of data from a high-dimensional space into a low-dimensional space.

  • Techniques:

    • PCA (Linear): Preserves variance.

    • t-SNE / UMAP (Non-linear): Preserves local structure/clusters.

  • Practical Application: Visualizing embeddings, reducing noise, speeding up training.

Dropout

  • Definition: Regularization technique where randomly selected neurons are ignored during training.

  • Key Concept: Prevents neurons from co-adapting too much (relying on specific peers).

  • Practical Application: Standard in almost all non-convolutional layers in Deep Learning.


E

Eigenvalue / Eigenvector

  • Definition: For a matrix $A$, $Av = \lambda v$. $v$ is the eigenvector (direction), $\lambda$ is eigenvalue (magnitude).

  • Practical Application: PCA (Principal Component Analysis) projects data onto the eigenvectors with largest eigenvalues.

Embedding

  • Definition: A relatively low-dimensional space into which high-dimensional vectors can be translated.

  • Key Concept: Semantic meaning. "King" - "Man" + "Woman" $\approx$ "Queen".

  • Practical Application: Word2Vec, BERT embeddings, Recommender Systems users/items.

Entropy (Shannon)

  • Definition: Measure of uncertainty or impurity in a dataset.

  • Formula: $H(X) = - \sum p(x) \log p(x)$

  • Practical Application: Decision Trees use this (Information Gain) to decide split points.


F

F1 Score

  • Definition: The harmonic mean of Precision and Recall.

  • Formula: $2 \times \frac{Precision \times Recall}{Precision + Recall}$

  • Practical Application: The "single number" metric for imbalanced classification.

  • Interview Note: Harmonic mean punishes extreme values more than arithmetic mean (if Recall=0, F1=0).

Fine-Tuning

  • Definition: Taking a pre-trained model (e.g., Llama-2) and training it further on a specific dataset.

  • Types:

    • Full Fine-Tuning: Update all weights.

    • PEFT (LoRA): Update only a small subset of adapters.

  • Practical Application: Customizing an LLM for legal document analysis.


G

Gradient Descent

  • Definition: An iterative optimization algorithm for finding the local minimum of a function.

  • Formula: $\theta_{new} = \theta_{old} - \alpha \nabla J(\theta)$ ($LearningRate \times Gradient$).

  • Practical Application: The fundamental way nearly all metrics are minimized in ML.

GAN (Generative Adversarial Network)

  • Definition: Two neural networks contested with each other in a game.

    • Generator: Creates fakes.

    • Discriminator: Detects fakes.

  • Practical Application: DeepFakes, Image Super-resolution, Style Transfer.


H

Hyperparameter Tuning

  • Definition: Choosing the optimal set of parameters that govern the training process (not learned by the model).

  • Methods:

    • Grid Search: Brute force.

    • Random Search: Surprisingly effective.

    • Bayesian Optimization: Smarter, probabilistic search.


I

Imbalanced Data

  • Definition: A dataset with a skewed class distribution (e.g., 1000 : 1).

  • Solutions:

    • Resampling (SMOTE, Undersampling).

    • Class Weights (Change loss function).

    • Metric Choice (Use F1/AUC, not Accuracy).

IOU (Intersection over Union)

  • Definition: Metric used to measure the accuracy of an object detector.

  • Formula: $\frac{Area(Overlap)}{Area(Union)}$

  • Practical Application: Evaluating bounding boxes in YOLO / R-CNN.


K

K-Means Clustering

  • Definition: Iterative algorithm that partitions data into $K$ clusters.

  • Key Concept: Minimizes variance within clusters.

  • Interview Note: Requires specifying $K$ beforehand (use Elbow Method to find optimal K).

KL Divergence (Kullback-Leibler)

  • Definition: Measure of how one probability distribution distinguishes from a second, reference probability distribution.

  • Formula: $D_{KL}(P || Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$

  • Practical Application: Loss function in VAEs (Variational Autoencoders) and t-SNE.


L

Learning Rate

  • Definition: Hyperparameter controlling how much we change the model in response to the estimated error each time the model weights are updated.

  • Practical Application: Too high = diverge. Too low = slow convergence.

  • Tip: Use a Scheduler (Cosine Decay, Warmup).

LSTM (Long Short-Term Memory)

  • Definition: A type of RNN capable of learning long-term dependencies.

  • Key Mechanics: Input Gate, Forget Gate, Output Gate.

  • Practical Application: Time-series forecasting, older NLP translation models.


M

MSE / MAE / RMSE

  • MSE (Mean Squared Error): Penalizes large errors heavily. Differentiable.

  • MAE (Mean Absolute Error): Robust to outliers. Not differentiable at 0.

  • RMSE: Root of MSE. Interpretability same units as target.

Momentum

  • Definition: Technique to accelerate Gradient Descent by accumulating a velocity vector in directions of persistent reduction in the objective function.

  • analogy: Heavy ball rolling down a hill (it gains speed).


N

Normalization vs. Standardization

  • Normalization (Min-Max): Scales data to [0, 1]. Best for image data / bounded ranges.

  • Standardization (Z-Score): Scales data to $\mu=0, \sigma=1$. Best for algorithms assuming Gaussian distribution (SVM, Logistic Regression).


O

Overfitting

  • Definition: When a model learns the detail and noise in the training data to the extent that it negatively impacts performance on new data.

  • The Fix:

    • More data.

    • Regularization (L1/L2/Dropout).

    • Simpler Model.

    • Early Stopping.


P

PCA (Principal Component Analysis)

  • Definition: Linear dimensionality reduction.

  • Key Concept: Finds axes (Principal Components) that maximize variance.

  • Interview Note: Sensitive to scale (must standardise data first!).

Precision & Recall

  • Precision: $\frac{TP}{TP + FP}$ (Quality). "Of all the spam we detected, how much was actually spam?"

  • Recall: $\frac{TP}{TP + FN}$ (Quantity). "Of all the actual spam, how much did we find?"

  • Tradeoff: Increasing threshold increases Precision but decreases Recall.


R

RAG (Retrieval-Augmented Generation)

  • Definition: Enhancing LLMs by retrieving relevant documents from an external knowledge base before generating an answer.

  • Components: Vector DB + Embedding Model + LLM.

  • Practical Application: "Chat with PDF", Enterprise Search.

Regularization

  • L1 (Lasso): Adds absolute interactions. Can shrink coefficients to zero (Feature Selection).

  • L2 (Ridge): Adds squared interactions. Shrinks coefficients towards zero but not to zero.

ReLU (Rectified Linear Unit)

  • Definition: Activation function $f(x) = max(0, x)$.

  • Why? Computationally cheap, solves vanishing gradient in positive domain.


S

Softmax

  • Definition: Function that turns a vector of $K$ real values into a vector of $K$ real values that sum to 1.

  • Formula: $\sigma(z)_i = \frac{e^{z_i}}{\sum e^{z_j}}$

  • Practical Application: Final layer of Multi-Class Classification.

SGD (Stochastic Gradient Descent)

  • Definition: Gradient descent using only a single sample (or mini-batch) to calculate the gradient.

  • Why? Faster per step, adds noise which helps escape local minima.

SVM (Support Vector Machine)

  • Definition: Finds the hyperplane that maximizes the margin between two classes.

  • Key Concept: Kernel Trick (maps non-linear data to higher dimensions where it becomes linear).


T

Transformer

  • Definition: Deep learning architecture based entirely on the Attention mechanism.

  • Key Benefit: Parallelizable (unlike RNNs) and captures long-term dependencies perfectly.

Transfer Learning

  • Definition: Storing knowledge gained while solving one problem and applying it to a different but related problem.

  • Practical Application: Using ResNet trained on ImageNet to classify medical X-rays.


V

Vanishing Gradient

  • Definition: In deep networks, gradients can shrink exponentially as they backpropagate, effectively stopping early layers from training.

  • Solution: ResNet (Skip connections), ReLU, BatchNorm.

Vector Database

  • Definition: A database optimized for storing and querying high-dimensional vectors.

  • Examples: Pinecone, Milvus, Chroma.

  • Practical Application: The "Long Term Memory" for LLM Agents.


Z

Zero-Shot Learning

  • Definition: The ability of a model to recognize objects or perform tasks it has not seen during training.

  • Example: Asking GPT-4 to categorize text into "Happy/Sad" without giving it examples, simply by describing the task.

Last updated