Deep learning

Overview

Deep learning is a subset of machine learning that uses neural networks with multiple layers to learn hierarchical representations of data. This directory contains comprehensive notes on deep learning concepts, architectures, and applications.

Core Components - Neural network building blocks
- Layers - Dense, Conv, RNN, Attention
- Activation Functions - ReLU, GELU, Sigmoid, Softmax
- Loss Functions - MSE, Cross-Entropy, Hinge
- Optimizers - SGD, Adam, AdamW
Methods & Applications - Modern architectures
- Computer Vision - CNNs, ResNet, ViT
- Natural Language Processing - Transformers, BERT, GPT
- Generative Models - GANs, VAEs, Diffusion
Advanced Topics - Production and scaling

When to Use Deep Learning

Choose Deep Learning When:

Large Dataset: 100K+ samples available Unstructured Data: Images, text, audio, video Complex Patterns: Non-linear relationships Performance Priority: Accuracy over interpretability Computational Resources: Access to GPUs/TPUs

Choose Traditional ML When:

Small Dataset: < 10K samples Tabular Data: Structured features Need Interpretability: Medical, financial domains Limited Resources: CPU-only environment Quick Iteration: Rapid prototyping needed

Quick Comparison: Traditional ML vs Deep Learning

Aspect

Traditional ML

Deep Learning

Data Size

1K-100K samples

100K-1M+ samples

Feature Engineering

Manual (critical)

Automatic (learned)

Training Time

Minutes to hours

Hours to days

Hardware

CPU sufficient

GPU/TPU required

Interpretability

High

Low (black box)

Performance Ceiling

Plateaus with more data

Improves with scale

Best For

Tabular, structured

Images, text, audio

Modern Deep Learning Timeline

1998: LeNet (CNNs for digit recognition)
2012: AlexNet (Deep learning breakthrough on ImageNet)
2014: VGG, GoogLeNet (Very deep networks)
2015: ResNet (Skip connections, 152 layers)
2017: Transformer (Attention is All You Need)
2018: BERT (Bidirectional language understanding)
2018: GPT (Autoregressive language generation)
2020: GPT-3 (175B parameters, few-shot learning)
2020: Vision Transformer (Transformers for images)
2021: DALL-E (Text-to-image generation)
2022: Stable Diffusion (Open-source generation)
2023: GPT-4, LLaMA (Multimodal, efficient LLMs)

Core Architecture Families

1. Convolutional Neural Networks (CNNs)

Purpose: Extract spatial features from images

Key Components:

Conv Layers: Learn local patterns (edges, textures, shapes)
Pooling: Downsample feature maps (Max, Average)
Fully Connected: Final classification/regression

Modern Architectures:

ResNet: Skip connections solve vanishing gradients
EfficientNet: Compound scaling (depth + width + resolution)
Vision Transformer (ViT): Transformers replacing convolutions

Applications:

Image classification
Object detection (YOLO, Faster R-CNN)
Semantic segmentation (U-Net, Mask R-CNN)
Face recognition

2. Recurrent Neural Networks (RNNs)

Purpose: Process sequential data (time series, text)

Variants:

Vanilla RNN: Simple recurrence (vanishing gradient issues)
LSTM: Long Short-Term Memory (gates to control information flow)
GRU: Gated Recurrent Unit (simplified LSTM)

Limitations:

Sequential processing (slow, can't parallelize)
Difficulty capturing long-range dependencies
Replaced by: Transformers for most NLP tasks

Still Used For:

Time series forecasting
Audio processing
Video analysis

3. Transformers

Purpose: Capture long-range dependencies via self-attention

Key Innovation:

Attention(Q, K, V) = softmax(QK^T / √d) V

Advantages:

Fully parallelizable (faster training than RNNs)
Captures global dependencies
Scales extremely well with data and compute

Architecture Types:

Type

Example

Use Case

Encoder-only

BERT, RoBERTa

Text understanding, classification

Decoder-only

GPT, LLaMA

Text generation, completion

Encoder-Decoder

T5, BART

Translation, summarization

Modern Applications:

NLP: GPT-4, BERT, T5
Vision: ViT, CLIP, Swin Transformer
Multimodal: CLIP, Flamingo, GPT-4V

4. Generative Models

Purpose: Create new data samples

Architecture Types:

Model

How It Works

Best For

GAN

Generator vs Discriminator (adversarial)

Image generation, style transfer

VAE

Encoder-decoder with latent space

Smooth interpolation, data augmentation

Diffusion

Iterative denoising process

High-quality image generation (DALL-E, Midjourney)

Autoregressive

Predict next token sequentially

Text generation (GPT), Image generation (PixelCNN)

Applications:

Image generation (Stable Diffusion, Midjourney)
Text generation (GPT, Claude)
Code generation (GitHub Copilot)
Drug discovery
Data augmentation

Training Best Practices

Preventing Overfitting

Regularization:
- Dropout (0.2-0.5)
- Weight decay / L2 regularization
- Batch normalization
- Early stopping
Data Augmentation:
- Images: Rotation, flipping, cropping, color jitter
- Text: Back-translation, synonym replacement
- MixUp/CutMix: Mix training examples
Architecture Choices:
- Smaller networks
- Reduce layer depth/width
- Use pretrained models (transfer learning)

Optimization Techniques

Learning Rate Strategies:
- Warm-up: Start low, gradually increase
- Decay: Reduce over time (step, exponential, cosine)
- Cyclical: Oscillate between bounds
Gradient Management:
- Gradient clipping (prevent exploding gradients)
- Gradient accumulation (simulate larger batches)
- Mixed precision training (FP16 + FP32)
Advanced Optimizers:
- Adam/AdamW (default choice)
- SGD + Momentum (better generalization for fine-tuning)
- LAMB, RAdam (large batch training)

Transfer Learning & Fine-Tuning

Why Transfer Learning?

Leverage pretrained models (trained on millions of images/text)
Requires less data (can work with 100s of samples)
Faster training (hours vs days)
Better performance (pretrained features)

Strategy:

1. Load pretrained model (e.g., ResNet-50, BERT-base)
2. Freeze base layers
3. Replace/add task-specific head
4. Train head only (few epochs)
5. Optionally: Unfreeze top layers, fine-tune end-to-end

Learning Rate Rules:

Frozen base: Regular LR (1e-3)
Fine-tuning: Very low LR (1e-5 to 1e-6)

Popular Pretrained Models:

Vision: ResNet, EfficientNet, ViT, CLIP
NLP: BERT, RoBERTa, T5, GPT-2
Multimodal: CLIP, ALIGN

Hardware & Scaling

Training Considerations

Aspect

Small Scale

Large Scale

Hardware

Single GPU (RTX 3090, A100)

Multi-GPU cluster, TPU pods

Batch Size

16-128

1024-4096+

Dataset Size

10K-1M samples

10M-1B+ samples

Training Time

Hours to days

Days to weeks

Techniques

Standard backprop, data augmentation

Distributed training, gradient accumulation, mixed precision

Distributed Training Strategies

Data Parallelism: Split batch across GPUs
Model Parallelism: Split model across GPUs (for very large models)
Pipeline Parallelism: Split layers across GPUs
ZeRO (Zero Redundancy Optimizer): Optimize memory usage

Inference Optimization

Quantization: INT8/INT4 (reduce precision)
Pruning: Remove unnecessary weights
Knowledge Distillation: Train smaller student model
Model Compression: ONNX, TensorRT, TFLite

Common Interview Topics

Conceptual Questions

1. "Explain vanishing/exploding gradients"

Vanishing: Gradients become very small in early layers → no learning. Caused by deep networks with sigmoid/tanh. Fix: ReLU, batch norm, ResNet. Exploding: Gradients become very large → unstable training. Caused by deep RNNs. Fix: Gradient clipping, lower learning rate.

2. "Why batch normalization helps?"

Normalizes layer inputs, reducing internal covariate shift. Allows higher learning rates, faster convergence, regularization effect. Used in CNNs.

3. "Transformer vs RNN?"

Transformer: Parallel processing, better long-range, scalable. RNN: Sequential, slower, vanishing gradients. Transformers dominate NLP now.

4. "How does attention work?"

Learns to focus on relevant parts of input. Query-Key-Value mechanism: Attention(Q,K,V) = softmax(QK^T/√d)V

5. "Transfer learning vs training from scratch?"

Transfer: Use pretrained weights, faster, less data needed, better for small datasets. Scratch: More data needed, longer training, full control.

Practical Questions

6. "Model overfits on training data. What to do?"

Get more data 2. Regularization (dropout, weight decay) 3. Data augmentation 4. Reduce model size 5. Early stopping 6. Batch normalization

7. "How to choose batch size?"

Larger batches: Faster training, better GPU utilization, but need more memory and may hurt generalization. Start with largest that fits in memory (32, 64, 128). Use gradient accumulation if needed.

8. "CNN vs Vision Transformer?"

CNN: Better for small-medium datasets, inductive bias (locality), less data needed. ViT: Better for large datasets (>1M images), more scalable, SOTA performance.

Modern Tools & Frameworks

Deep Learning Frameworks

PyTorch: Research-friendly, dynamic graphs, most popular
TensorFlow/Keras: Production-ready, static graphs, Google ecosystem
JAX: High-performance, functional, research (Google)

Pretrained Model Hubs

Hugging Face: NLP models (BERT, GPT, T5)
timm (PyTorch Image Models): Vision models (ResNet, EfficientNet, ViT)
TensorFlow Hub: Google's model repository

Training Tools

Weights & Biases: Experiment tracking
MLflow: Model versioning, tracking
TensorBoard: Visualization
DeepSpeed: Distributed training optimization
Lightning: PyTorch wrapper for cleaner code

Learning Path

Beginner

Understand neural network basics (forward/backward prop)
Implement simple MLP from scratch
Learn one framework (PyTorch recommended)
Build CNN for image classification (MNIST, CIFAR-10)
Understand training process (loss, optimization, regularization)

Intermediate

Transfer learning with pretrained models
Build RNN/LSTM for sequence tasks
Understand attention mechanism
Implement simple Transformer
Work with real datasets (ImageNet, COCO)
Learn hyperparameter tuning

Advanced

Read research papers (Attention Is All You Need, ResNet, BERT)
Implement SOTA architectures from scratch
Distributed training on multiple GPUs
Fine-tune large language models
Contribute to open-source models
Deploy models to production

Deep learning

Overview

Table of Contents

When to Use Deep Learning

Choose Deep Learning When:

Choose Traditional ML When:

Quick Comparison: Traditional ML vs Deep Learning

Modern Deep Learning Timeline

Core Architecture Families

1. Convolutional Neural Networks (CNNs)

2. Recurrent Neural Networks (RNNs)

3. Transformers

4. Generative Models

Training Best Practices

Preventing Overfitting

Optimization Techniques

Transfer Learning & Fine-Tuning

Hardware & Scaling

Training Considerations

Distributed Training Strategies

Inference Optimization

Common Interview Topics

Conceptual Questions

Practical Questions

Modern Tools & Frameworks

Deep Learning Frameworks

Pretrained Model Hubs

Training Tools

Learning Path

Beginner

Intermediate

Advanced

Further Reading

hashtagOverview

hashtagTable of Contents

hashtagWhen to Use Deep Learning

hashtagChoose Deep Learning When:

hashtagChoose Traditional ML When:

hashtagQuick Comparison: Traditional ML vs Deep Learning

hashtagModern Deep Learning Timeline

hashtagCore Architecture Families

hashtag1. Convolutional Neural Networks (CNNs)

hashtag2. Recurrent Neural Networks (RNNs)

hashtag3. Transformers

hashtag4. Generative Models

hashtagTraining Best Practices

hashtagPreventing Overfitting

hashtagOptimization Techniques

hashtagTransfer Learning & Fine-Tuning

hashtagHardware & Scaling

hashtagTraining Considerations

hashtagDistributed Training Strategies

hashtagInference Optimization

hashtagCommon Interview Topics

hashtagConceptual Questions

hashtagPractical Questions

hashtagModern Tools & Frameworks

hashtagDeep Learning Frameworks

hashtagPretrained Model Hubs

hashtagTraining Tools

hashtagLearning Path

hashtagBeginner

hashtagIntermediate

hashtagAdvanced

hashtagFurther Reading

Overview

Table of Contents

When to Use Deep Learning

Choose Deep Learning When:

Choose Traditional ML When:

Quick Comparison: Traditional ML vs Deep Learning

Modern Deep Learning Timeline

Core Architecture Families

1. Convolutional Neural Networks (CNNs)

2. Recurrent Neural Networks (RNNs)

3. Transformers

4. Generative Models

Training Best Practices

Preventing Overfitting

Optimization Techniques

Transfer Learning & Fine-Tuning

Hardware & Scaling

Training Considerations

Distributed Training Strategies

Inference Optimization

Common Interview Topics

Conceptual Questions

Practical Questions

Modern Tools & Frameworks

Deep Learning Frameworks

Pretrained Model Hubs

Training Tools

Learning Path

Beginner

Intermediate

Advanced

Further Reading