ML Glossary
A high-signal, "cheat sheet" style glossary for rapid technical revision. Each term includes definitions, formulas, practical applications, and interview key points.
A
Accuracy
Definition: The ratio of correct predictions to total predictions.
Formula: $\frac{TP + TN}{TP + TN + FP + FN}$
Practical Application: Use mostly for balanced datasets (e.g., handwritten digit classification).
Interview Note: Never use accuracy for fraud detection or medical diagnosis (imbalanced classes). Use F1 or PRC-AUC instead.
Activation Function
Definition: Non-linear transformation applied to neuron outputs to enable learning complex patterns.
Key Examples:
ReLU: $max(0, x)$ (Most usage).
Sigmoid: $\frac{1}{1+e^{-x}}$ (Binary output).
Softmax: Multi-class probabilities.
Practical Application: ReLU for hidden layers (avoids vanishing gradient), Sigmoid/Softmax for output layers.
AdaBoost (Adaptive Boosting)
Definition: An ensemble method that trains weak learners sequentially, correcting the mistakes of previous predictors.
Key Concept: Assigns higher weights to misclassified training instances.
Practical Application: Tabular classification tasks where data is clean but hard to separate.
Interview Note: Sensitive to noisy data and outliers because it frantically tries to fit them.
Adam (Adaptive Moment Estimation)
Definition: An adaptive learning rate optimization algorithm.
Key Concept: Combines Momentum (keeps moving in average direction) and RMSprop (scales learning rate by variance).
Practical Application: The default optimizer for training Deep Learning models (LLMs, CNNs).
Interview Note: Often set learning rate $\alpha = 3e-4$ (Andrej Karpathy's "safe bet").
AUC-ROC (Area Under Curve)
Definition: Performance metric measuring the ability to distinguish between classes at all threshold settings.
Key Range: 0.5 (Random Guessing) to 1.0 (Perfect).
Practical Application: Comparing models for credit scoring or ad-click prediction.
Interview Note: Unlike Accuracy, AUC is threshold-invariant and scale-invariant.
Attention Mechanism
Definition: Allows a model to focus on specific parts of the input sequence when generating output.
Formula: $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$
Practical Application: The core of Transformers (ChatGPT, BERT). Enables long-range dependency modeling in translation.
B
Backpropagation
Definition: The algorithm for computing gradients of the loss function with respect to weights using the chain rule.
Key Concept: "Credit assignment" — figuring out which weight contributed how much to the error.
Practical Application: The engine of training Neural Networks.
Bagging (Bootstrap Aggregating)
Definition: Training multiple models in parallel on random subsets (with replacement) of training data.
Key Example: Random Forest.
Practical Application: Reducing Variance (overfitting). Great for high-variance models like Decision Trees.
Batch Normalization
Definition: Normalizing layer inputs to have mean 0 and variance 1 for each mini-batch.
Formula: $\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
Practical Application: Accelerates training, allows higher learning rates, and acts as a weak regularizer.
Interview Note: During Inference, use the moving average of mean/var calculated during training, not the batch statistics.
Bias-Variance Tradeoff
Definition: The tension between a model's ability to minimize errors on training data (Bias) vs. unseen data (Variance).
Key Insight:
High Bias: Underfitting (Linear Regression on nonlinear data).
High Variance: Overfitting (100-depth Decision Tree).
Goal: Find the "Sweet Spot".
Binary Cross-Entropy (Log Loss)
Definition: Loss function for binary classification.
Formula: $-\frac{1}{N} \sum [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$
Practical Application: Evaluating probabilities in spam detection or churn prediction.
C
Confusion Matrix
Definition: A table layout that visualizes the performance of a supervised learning algorithm.
Components:
TP: Hit.
TN: Correct Rejection.
FP: False Alarm (Type I Error).
FN: Miss (Type II Error).
Cosine Similarity
Definition: Measure of similarity between two non-zero vectors using the cosine of the angle between them.
Formula: $\frac{A \cdot B}{||A|| ||B||}$
Practical Application: Semantic Search, RAG (Retrieval Augmented Generation), Document Similarity.
Interview Note: Range is [-1, 1]. In high-dimensional spaces (embeddings), usually [0, 1].
Cross-Validation (K-Fold)
Definition: Resampling procedure used to evaluate ML models on a limited data sample.
Practical Application: Validating that your model isn't just memorizing the specific train-test split.
Interview Note: For Time Series, use TimeSeriesSplit (Walk-Forward validation), never random K-Fold.
D
Data Leakage
Definition: When information from outside the training dataset (or from the future) is used to create the model.
Examples: Using 'Target' in feature engineering, scaling data before splitting, future timestamps.
Interview Note: If you see 99.9% accuracy, suspect leakage immediately.
Dimensionality Reduction
Definition: Transformation of data from a high-dimensional space into a low-dimensional space.
Techniques:
PCA (Linear): Preserves variance.
t-SNE / UMAP (Non-linear): Preserves local structure/clusters.
Practical Application: Visualizing embeddings, reducing noise, speeding up training.
Dropout
Definition: Regularization technique where randomly selected neurons are ignored during training.
Key Concept: Prevents neurons from co-adapting too much (relying on specific peers).
Practical Application: Standard in almost all non-convolutional layers in Deep Learning.
E
Eigenvalue / Eigenvector
Definition: For a matrix $A$, $Av = \lambda v$. $v$ is the eigenvector (direction), $\lambda$ is eigenvalue (magnitude).
Practical Application: PCA (Principal Component Analysis) projects data onto the eigenvectors with largest eigenvalues.
Embedding
Definition: A relatively low-dimensional space into which high-dimensional vectors can be translated.
Key Concept: Semantic meaning. "King" - "Man" + "Woman" $\approx$ "Queen".
Practical Application: Word2Vec, BERT embeddings, Recommender Systems users/items.
Entropy (Shannon)
Definition: Measure of uncertainty or impurity in a dataset.
Formula: $H(X) = - \sum p(x) \log p(x)$
Practical Application: Decision Trees use this (Information Gain) to decide split points.
F
F1 Score
Definition: The harmonic mean of Precision and Recall.
Formula: $2 \times \frac{Precision \times Recall}{Precision + Recall}$
Practical Application: The "single number" metric for imbalanced classification.
Interview Note: Harmonic mean punishes extreme values more than arithmetic mean (if Recall=0, F1=0).
Fine-Tuning
Definition: Taking a pre-trained model (e.g., Llama-2) and training it further on a specific dataset.
Types:
Full Fine-Tuning: Update all weights.
PEFT (LoRA): Update only a small subset of adapters.
Practical Application: Customizing an LLM for legal document analysis.
G
Gradient Descent
Definition: An iterative optimization algorithm for finding the local minimum of a function.
Formula: $\theta_{new} = \theta_{old} - \alpha \nabla J(\theta)$ ($LearningRate \times Gradient$).
Practical Application: The fundamental way nearly all metrics are minimized in ML.
GAN (Generative Adversarial Network)
Definition: Two neural networks contested with each other in a game.
Generator: Creates fakes.
Discriminator: Detects fakes.
Practical Application: DeepFakes, Image Super-resolution, Style Transfer.
H
Hyperparameter Tuning
Definition: Choosing the optimal set of parameters that govern the training process (not learned by the model).
Methods:
Grid Search: Brute force.
Random Search: Surprisingly effective.
Bayesian Optimization: Smarter, probabilistic search.
I
Imbalanced Data
Definition: A dataset with a skewed class distribution (e.g., 1000 : 1).
Solutions:
Resampling (SMOTE, Undersampling).
Class Weights (Change loss function).
Metric Choice (Use F1/AUC, not Accuracy).
IOU (Intersection over Union)
Definition: Metric used to measure the accuracy of an object detector.
Formula: $\frac{Area(Overlap)}{Area(Union)}$
Practical Application: Evaluating bounding boxes in YOLO / R-CNN.
K
K-Means Clustering
Definition: Iterative algorithm that partitions data into $K$ clusters.
Key Concept: Minimizes variance within clusters.
Interview Note: Requires specifying $K$ beforehand (use Elbow Method to find optimal K).
KL Divergence (Kullback-Leibler)
Definition: Measure of how one probability distribution distinguishes from a second, reference probability distribution.
Formula: $D_{KL}(P || Q) = \sum P(x) \log \frac{P(x)}{Q(x)}$
Practical Application: Loss function in VAEs (Variational Autoencoders) and t-SNE.
L
Learning Rate
Definition: Hyperparameter controlling how much we change the model in response to the estimated error each time the model weights are updated.
Practical Application: Too high = diverge. Too low = slow convergence.
Tip: Use a Scheduler (Cosine Decay, Warmup).
LSTM (Long Short-Term Memory)
Definition: A type of RNN capable of learning long-term dependencies.
Key Mechanics: Input Gate, Forget Gate, Output Gate.
Practical Application: Time-series forecasting, older NLP translation models.
M
MSE / MAE / RMSE
MSE (Mean Squared Error): Penalizes large errors heavily. Differentiable.
MAE (Mean Absolute Error): Robust to outliers. Not differentiable at 0.
RMSE: Root of MSE. Interpretability same units as target.
Momentum
Definition: Technique to accelerate Gradient Descent by accumulating a velocity vector in directions of persistent reduction in the objective function.
analogy: Heavy ball rolling down a hill (it gains speed).
N
Normalization vs. Standardization
Normalization (Min-Max): Scales data to [0, 1]. Best for image data / bounded ranges.
Standardization (Z-Score): Scales data to $\mu=0, \sigma=1$. Best for algorithms assuming Gaussian distribution (SVM, Logistic Regression).
O
Overfitting
Definition: When a model learns the detail and noise in the training data to the extent that it negatively impacts performance on new data.
The Fix:
More data.
Regularization (L1/L2/Dropout).
Simpler Model.
Early Stopping.
P
PCA (Principal Component Analysis)
Definition: Linear dimensionality reduction.
Key Concept: Finds axes (Principal Components) that maximize variance.
Interview Note: Sensitive to scale (must standardise data first!).
Precision & Recall
Precision: $\frac{TP}{TP + FP}$ (Quality). "Of all the spam we detected, how much was actually spam?"
Recall: $\frac{TP}{TP + FN}$ (Quantity). "Of all the actual spam, how much did we find?"
Tradeoff: Increasing threshold increases Precision but decreases Recall.
R
RAG (Retrieval-Augmented Generation)
Definition: Enhancing LLMs by retrieving relevant documents from an external knowledge base before generating an answer.
Components: Vector DB + Embedding Model + LLM.
Practical Application: "Chat with PDF", Enterprise Search.
Regularization
L1 (Lasso): Adds absolute interactions. Can shrink coefficients to zero (Feature Selection).
L2 (Ridge): Adds squared interactions. Shrinks coefficients towards zero but not to zero.
ReLU (Rectified Linear Unit)
Definition: Activation function $f(x) = max(0, x)$.
Why? Computationally cheap, solves vanishing gradient in positive domain.
S
Softmax
Definition: Function that turns a vector of $K$ real values into a vector of $K$ real values that sum to 1.
Formula: $\sigma(z)_i = \frac{e^{z_i}}{\sum e^{z_j}}$
Practical Application: Final layer of Multi-Class Classification.
SGD (Stochastic Gradient Descent)
Definition: Gradient descent using only a single sample (or mini-batch) to calculate the gradient.
Why? Faster per step, adds noise which helps escape local minima.
SVM (Support Vector Machine)
Definition: Finds the hyperplane that maximizes the margin between two classes.
Key Concept: Kernel Trick (maps non-linear data to higher dimensions where it becomes linear).
T
Transformer
Definition: Deep learning architecture based entirely on the Attention mechanism.
Key Benefit: Parallelizable (unlike RNNs) and captures long-term dependencies perfectly.
Transfer Learning
Definition: Storing knowledge gained while solving one problem and applying it to a different but related problem.
Practical Application: Using ResNet trained on ImageNet to classify medical X-rays.
V
Vanishing Gradient
Definition: In deep networks, gradients can shrink exponentially as they backpropagate, effectively stopping early layers from training.
Solution: ResNet (Skip connections), ReLU, BatchNorm.
Vector Database
Definition: A database optimized for storing and querying high-dimensional vectors.
Examples: Pinecone, Milvus, Chroma.
Practical Application: The "Long Term Memory" for LLM Agents.
Z
Zero-Shot Learning
Definition: The ability of a model to recognize objects or perform tasks it has not seen during training.
Example: Asking GPT-4 to categorize text into "Happy/Sad" without giving it examples, simply by describing the task.
Last updated