Deep Learning & NLP — Complete Study Guide

8 volumes in correct learning sequence. Enhanced with AI engineering notes, deduplication, and production insights.

v2 Enhanced Deduplicated AI Engineer Notes Added 8 Volumes · Correct Sequence
v2 Enhancements: Redundant vanishing-gradient explanations merged into one definitive section. Duplicate activation function coverage consolidated. Added: Weight Initialization table, Batch Normalization, Dropout math, Learning Rate Scheduling, CNN vs ViT comparison, BERT/GPT architecture comparison, Transformer scaling laws, and production deployment checklists.
I
Volume 1 · Start Here
Neural Network Foundations
01 — What is a Neural Network?
🧠 The Big Picture — Squiggle Fitting Machines

Neural networks map complex inputs to accurate outputs by fitting mathematical curves to data. They are not mysterious black boxes — they use calculus, statistics, and linear algebra to iteratively minimise error.

Input Layer
Ingests features (X₁, X₂, X₃). Raw data converted to numbers first.
Hidden Layers
Feature learning. Weights + biases transform signals. Activation functions add non-linearity.
Output Layer
Final prediction ŷ. Sigmoid for binary, Softmax for multiclass, Linear for regression.
Neuron output: y = Σ(wᵢ · xᵢ) + b → activation(y) → ŷ

Weights multiply inputs to stretch/flip curves. Biases shift them. Stacking layers creates the capacity to approximate any function (Universal Approximation Theorem).

🏗️ Architecture Diagram
X₁ X₂ X₃ Input H₁ H₂ H₃ Hidden 1 Hidden 2 ŷ₁ ŷ₂ Output → Forward Pass ← Backward Pass (Backprop)
The learning loop: Forward → compute loss → Backward (chain rule) → update weights → repeat until convergence.
02 — The Vanishing Gradient Problem (Definitive Reference)
⚠️ Why Gradients Vanish — The Math

During backpropagation, gradients are multiplied through every layer using the chain rule. Sigmoid and Tanh derivatives are bounded to max 0.25. In a deep network this causes exponential decay.

GRADIENT MAGNITUDE PER LAYER (10-layer network, sigmoid activation) L10 L9 L8 L7 L6 L5 L4 L3 L1 ≈0 output (strong grad) early layers forget!
0.25 × 0.25 × 0.25 × 0.25 × 0.25 × 0.25 × 0.25 × 0.25 = 0.000015 → early weights never update
Root Cause
Sigmoid/Tanh derivative ≤ 0.25. Chain rule multiplies this across every layer. Deep networks → gradient ≈ 0.
Fix for FFNNs
Use ReLU (derivative = 0 or 1, never shrinks). Batch Normalisation. Residual connections (skip layers).
Fix for RNNs
LSTM / GRU (additive cell state updates, not multiplicative). Gradient clipping for exploding gradients.
03 — Backpropagation & Chain Rule
🔄 The 4-Step Learning Loop
1
Forward Pass
Compute all layer outputs. Cache intermediate values (needed for gradients). Output = ŷ.
2
Compute Loss
Compare ŷ vs y. Get one scalar error value. e.g., MSE = (y−ŷ)², Cross-Entropy = −Σy·log(ŷ).
3
Backward Pass (Chain Rule)
Propagate error backward. Multiply partial derivatives: ∂L/∂W = ∂L/∂O_out × ∂O_out/∂O_hidden × ∂O_hidden/∂W
4
Update Weights
W_new = W_old − η × ∂L/∂W. Repeat until convergence.
Wnew = Wold η × (∂L / ∂W)
📐 Weight Update Intuition
min steep slope → big step flat slope → baby step η too high → overshoot Loss W →

Step size = slope × η. Naturally large near steep regions, tiny near minimum. Learning rate η too high → oscillates. Too low → painfully slow.

04 — Weight Initialization (Often Missed)
Weight Initialization Strategies — Critical for Training Stability

Poor initialization can cause vanishing or exploding gradients before training even starts. This is one of the most commonly overlooked topics.

All Zeros ❌
All neurons learn the same gradient. Network never learns. Symmetry problem — all weights stay identical.
Xavier / Glorot ✓
W ~ U[−√(6/(nᵢₙ+nₒᵤₜ)), +√(6/(nᵢₙ+nₒᵤₜ))]
Best for Sigmoid/Tanh. Keeps variance stable through layers.
He Initialization ✓
W ~ N(0, √(2/nᵢₙ))
Best for ReLU activations. Accounts for ReLU zeroing half the inputs.
torch.nn.init.xavier_uniform_(layer.weight)  # Sigmoid/Tanh
torch.nn.init.kaiming_uniform_(layer.weight, mode='fan_in')  # ReLU
II
Volume 2
Activation Functions — Consolidated Reference
01 — Complete Activation Function Reference
Sigmoid
σ(x) = 1/(1+e⁻ˣ)
Output: 0–1. Binary output layer. NOT zero-centred. Max derivative 0.25 → vanishing gradient. Avoid in hidden layers.
✗ Vanish ∇Binary output
Tanh
(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)
Zero-centred (output −1 to +1) → faster convergence. Still saturates. Better than Sigmoid in hidden layers. Common in RNN gates (LSTM).
✓ Zero-centredLSTM gates
ReLU
max(0, x)
Derivative 0 or 1 — no vanishing. Default hidden layer choice. Risk: Dying ReLU (stuck at 0 for negative inputs). Use He init.
✓ Default hidden✗ Dying ReLU
Leaky ReLU
max(0.01x, x)
Fixes dying ReLU. Negative slope = 0.01, gradient never exactly 0. Fixed constant. PReLU = learned slope α.
✓ No dead neurons
ELU
α(eˣ−1) for x<0
Zero-centred, smooth at 0. Computationally expensive (exponential). Use when zero-centering matters more than speed.
✓ Zero-centred✗ Slow
Swish
x · σ(x) — Google Brain
Self-gating. Smooth, non-monotonic. Outperforms ReLU in very deep networks. Only use >40 layers — expensive.
>40 layers only
Softmax
eᶻʲ / Σeᶻᵏ
Σ=1.0
Output layer for multiclass. Converts raw logits to probability distribution summing to 1. Paired with Categorical Cross-Entropy.
Multiclass output
PReLU
max(αx, x) — α learned
α=learned
α updated by backprop. α=0 → ReLU. α=0.01 → Leaky ReLU. Best of both: learns the optimal negative slope for your data.
✓ Adaptive α
02 — ArgMax vs SoftMax — The Output Decision
SoftMax — During Training

Converts raw logits into a probability distribution — all values between 0 and 1, summing to 1. Used during training so the loss function (Cross-Entropy) can compare smooth probabilities against true labels.

SoftMax([2.1, 0.5, 1.3]) → [0.65, 0.10, 0.25]   (sum = 1.0)
Training output layerDifferentiable✓ Smooth gradients
ArgMax — During Inference

Simply picks the index of the highest value. No probabilities — just the winning class. Used at inference time when you only need the final predicted label, not a probability.

ArgMax([2.1, 0.5, 1.3]) → 0   (index of 2.1, the largest)
Inference / predictionNot differentiable✓ Fastest — no exp()
03 — Activation Function Decision Guide
If your situation is…Use thisWhy
Hidden layers (default)ReLUFast, no vanishing gradient, industry default. Use He init.
Dead neurons / stuck trainingLeaky ReLU or PReLUKeeps gradient non-zero for negatives. PReLU learns optimal α.
Very deep network (>40 layers)SwishOutperforms ReLU in deep nets. Accept compute cost.
Binary classification outputSigmoidOutputs 0–1 probability. Pair with Binary CE loss.
Multiclass outputSoftmaxProbabilities sum to 1. Pair with Categorical CE.
Regression outputLinear (none)Unbounded real number output. Pair with MSE/MAE.
RNN/LSTM gatesSigmoid + TanhSigmoid for gating (0–1). Tanh for values (−1 to +1).
Training is very slow (ELU)Leaky ReLUELU's exponential is expensive. Leaky ReLU is faster with similar benefit.
03 — Batch Normalisation (Added)
Batch Normalisation — The Modern Solution to Internal Covariate Shift

BN normalises layer inputs to mean=0, variance=1 per mini-batch. Applied BEFORE or AFTER the activation function. Dramatically stabilises and speeds training. Makes learning rate less sensitive.

μ = (1/m) Σxᵢ     σ² = (1/m) Σ(xᵢ−μ)²
x̂ᵢ = (xᵢ−μ)/√(σ²+ε)    yᵢ = γ·x̂ᵢ + β
γ and β are learnable parameters — the network learns optimal scale and shift. ε = small constant for numerical stability.
Allows much higher learning rates
Reduces sensitivity to initialization
Acts as mild regulariser (reduces need for Dropout)
Doesn't work well with small batch sizes → use Layer Norm for RNNs/Transformers
model.add(BatchNormalization())  # after Dense, before or after activation
III
Volume 3
Loss Functions · Optimizers · Regularization
01 — Loss vs Cost — Quick Distinction
Loss Function

Error for a single data point. L = error(ŷ, y). What you minimise conceptually. Calculated on one record passing through the network.

Cost Function

Average loss across an entire batch or dataset. J = (1/n) Σ L. What gradient descent actually minimises — more stable gradient direction.

02 — Regression Loss Functions
MSE / L2 Loss
L = (y ŷ)²   J = Σ(y−ŷ)²/n
1 global min ✓ outlier²=huge
Quadratic bowl → guaranteed convergence. Heavily penalises outliers (squaring).
✓ Single global min✗ Outlier sensitive
MAE / L1 Loss
L = |y ŷ|
undefined here!
Robust to outliers (linear, not squared). Sharp bend at 0 → undefined derivative. Local minima risk.
✓ Outlier robust✗ Harder optimise
Huber Loss — Best of Both
|err| ≤ δ → MSE   |err| > δ → MAE
−δ ← MSE →
Smooth convergence (MSE center) + outlier robust (MAE tails). Hyperparameter δ controls transition.
✓ Best of both
03 — Classification Loss Functions
FunctionClassesLabel FormatFormulaWhen to Use
Binary CEExactly 20 or 1−y·log(ŷ)−(1−y)·log(1−ŷ)Cat vs Dog, spam detection
Categorical CE> 2One-hot array [0,1,0]−Σ yᵢ·log(ŷᵢ)Image classification, NLU
Sparse Cat. CE> 2Integer labels (0, 1, 2…)Same as Cat. CE, auto one-hotMany classes, saves memory
07 — Virtual Environments — Project Best Practice
🐍 Why Virtual Environments Matter in Deep Learning

ML libraries (TensorFlow, PyTorch, CUDA) update frequently and break older code. TensorFlow 1.x vs 2.x are fundamentally incompatible. A virtual environment isolates each project's exact dependency versions.

Create with Conda
conda create -n myproject python=3.10
conda activate myproject
pip install tensorflow keras
Lock Versions with requirements.txt
pip freeze > requirements.txt
# On another machine:
pip install -r requirements.txt
Rule: Create a new Conda environment for every new project. Never install ML libraries into your base Python environment. Use requirements.txt to reproduce the environment on any machine or server.
08 — Loss Function — SSR (Sum of Squared Residuals)
📐 Sum of Squared Residuals (SSR)

The foundational loss for regression. Core of MSE. Squaring residuals ensures all errors are positive and penalises large errors much more than small ones.

SSR = Σ (y ŷ)²   →   MSE = SSR / n   (average version)
Why square? Ensures positive values. Amplifies large errors (3² = 9 vs 3). Creates smooth differentiable parabola — single global minimum for gradient descent.
When NOT to use SSR/MSE: Dataset with outliers — squaring inflates outlier errors massively, pulling the model away from the majority of normal data. Use MAE or Huber instead.
OptimizerMechanismStrengthWeaknessUse When
Gradient DescentEntire dataset per updateStable directionCatastrophically slow on large dataNever for production
SGD1 sample per updateFast iterationsVery noisy, zigzag pathWith momentum for vision
Mini-batch SGDk samples per updateBest balanceStill some noiseIndustry standard baseline
SGD + MomentumEWA of past gradients. V = β·V_prev + (1−β)·dWSmoother pathFixed LR still neededCV tasks, ResNet training
AdagradDivides LR by sum of squared gradsAdaptive per-param LRLR → 0 over time (fatal)Sparse data only
RMSpropEWA of squared grads in denominatorFixes Adagrad decayNo bias correctionRNNs (historical)
Adam ⭐Momentum + RMSprop + Bias CorrectionFast, stable, adaptiveMay overfit on small data (try AdamW)Default for everything
AdamWAdam + decoupled weight decayBetter generalisationExtra hyperparameterTransformers, fine-tuning LLMs
05 — Learning Rate Scheduling (Added)
Learning Rate Scheduling — Critical for Production Training

A fixed LR is rarely optimal. Scheduling reduces LR over time — start large (fast progress) then shrink (precise convergence). Industry standard in all serious training runs.

Step Decay
LR drops by factor every N epochs. Simple but abrupt transitions.
Cosine Annealing
LR follows cosine curve. Smooth, widely used for vision and NLP.
Warmup + Decay
Start with tiny LR, ramp up, then decay. Standard for Transformers and LLMs.
ReduceLROnPlateau
Automatically reduces LR when validation loss stops improving. Easy win.
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
scheduler = OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=10)
06 — Regularization — Dropout, L1, L2
🎲 Dropout

Randomly deactivates neurons (p=0.5 typical) during training. Forces redundant representations — like ensemble learning. All neurons active at inference, weights scaled by ×p.

active dropped
model.add(Dropout(0.5)) # after LSTM/Dense
📏 L1 vs L2 Regularization
L2 (Ridge / Weight Decay)
J_total = J + λ Σw²
Shrinks all weights toward 0. Most common. Used in Adam as "weight decay". Smooth, differentiable.
L1 (Lasso)
J_total = J + λ Σ|w|
Pushes some weights exactly to 0 — produces sparse models. Good for feature selection.
🎯 Zero-Centred Activations — Why It Matters

If activation output is not centred around zero (like Sigmoid: outputs 0–1 only), gradients during backprop are all positive or all negative for a neuron's weights. This forces a zigzag path to convergence — slower training.

Not Zero-Centred
Sigmoid: output 0–1 only. Gradients always same sign. Zigzag optimisation path.
Zero-Centred ✓
Tanh: output −1 to +1. Gradients can be positive or negative. Faster convergence.
📋 Overfitting Diagnosis
Train acc ↑, Val acc ↓ → overfitting. Add Dropout, reduce model size, add more data, early stopping.
Both acc ↓ → underfitting. Increase model capacity, train longer, reduce regularisation.
Loss oscillates → LR too high. Reduce by 10×. Switch to Adam if on SGD.
Loss barely moves → LR too low, dying neurons, or wrong initialisation.
IV
Volume 4
Convolutional Neural Networks
01 — CNN Architecture & Pipeline
📷
Input
Image
(H×W×C)
🔲
Step 1
Conv
Layer
Step 2
ReLU
Activate
⬇️
Step 3
Max
Pooling
🔲
Step 4
Conv +
ReLU
➡️
Step 5
Flatten
1D
🧠
Step 6
Fully
Connected
🏷️
Output
Softmax
Classes
02 — Core Operations
🔲 Convolution

Filter slides over image performing element-wise multiply + sum → one output pixel. Stacked filters = feature maps. Filters are learned by backprop.

Output size = (n + 2p − f) / s + 1
n=input size · p=padding · f=filter size · s=stride
n=28, f=3, p=1, s=1
→ 28 (same size)
n=28, f=3, p=0, s=2
→ 13 (halved)
📏 Padding (Same vs Valid)
Same (p = ⌊f/2⌋)
Output = Input size. Adds zeros around border. Preserves spatial dimensions. Use in deep nets.
Valid (p = 0)
Output shrinks by (f−1) per layer. Edge pixels under-represented. Use only for small networks.
keras: padding='same' or padding='valid'
⬇️ Max Pooling

Sliding window picks maximum value. Reduces spatial dimensions while retaining strongest activations. Achieves location invariance — object detected anywhere in image.

1 9 3 4 2 5 7 1 9 7 4×4 input, 2×2 pool → 2×2 output ✓
Average / Mean Pooling (same operation — two names)
Takes the mean of the region instead of the maximum. Smoother, less sharp features. Less common than Max Pooling for object detection. Used in Global Average Pooling (GAP) before final classifier in modern CNNs like ResNet.
Max Pool → object detection Avg Pool → smooth features, GAP layers
🔄 Data Augmentation

Artificially expands training data by transforming existing images. Same label, different appearance — teaches CNN to be robust to variations.

↔️ Flip — horizontal/vertical mirror
🔄 Rotate — random angle ±15°
🔍 Zoom — random crop + resize
🌫️ Noise — Gaussian pixel noise
☀️ Brightness — colour jitter
🎭 Cutout/Mixup — advanced blend
from torchvision import transforms
transform = transforms.Compose([
  transforms.RandomHorizontalFlip(),
  transforms.RandomRotation(15),
  transforms.ColorJitter(brightness=0.2)
])
CNN vs Vision Transformer (ViT) — When to Use Which
DimensionCNNViT
Data needWorks with small datasets (10K+)Needs large data (100K+) or pretraining
Inductive biasBuilt-in: spatial locality, translation invarianceNone — learns all from data
Long-range depsWeak (needs many layers)Excellent (global attention)
ComputeO(n²) in image size (efficient)O(n²) in patches (heavier)
Use whenLimited data, edge deployment, real-timeLarge dataset, SOTA accuracy, multimodal
V
Volume 5
Natural Language Processing — Fundamentals
01 — NLP Pipeline Overview
📝
Input
Raw Text
✂️
Step 1
Tokenize
🗑️
Step 2
Stop Words
⚠ keep "not"
🌿
Step 3
Stem or
Lemmatize
🔢
Step 4
Vectorize
BoW/TF-IDF
🧠
Step 5
Embed
Word2Vec
🤖
Model
LSTM / TF
02 — Text Preprocessing — Stemming, Lemmatization & Stop Words
🌿 Stemming vs Lemmatization
Stemming — PorterStemmer
Aggressively chops word endings to find a root stem. Fast but often produces non-words. Uses algorithm, not a dictionary.
"historical" → histori ❌ (not a real word)   "running" → run ✓
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("historical") # → histori
✓ Very fast✗ Fake words possibleBest for: spam, toxic classifier
Lemmatization — WordNetLemmatizer
Uses a dictionary (WordNet) to find the true base form. Slower but always returns a real, meaningful word.
"historical" → history ✓   "better" → good ✓
from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()
wl.lemmatize("historical") # → history
✓ Real dictionary words✗ SlowerBest for: chatbots, translation
⚠️ Stop Words — The "not" Problem

Removing common low-value words saves computation. But never blindly remove "not" — it completely reverses sentiment meaning.

❌ Removing "not" destroys sentiment
Original: "Food is not good"
After stop-word removal: "Food good"completely flipped!
Rule: Always use a custom stop word list for your task. Remove "not" from the default NLTK stop words list when doing sentiment analysis.
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
stop.remove('not') # critical for sentiment!
03 — N-grams, CountVectorizer & max_features
📊 N-grams — Capturing Word Sequences

Instead of single words (unigrams), N-grams capture sequences of N words. Bigrams and trigrams preserve local context that BoW loses entirely.

Unigram (1,1) — default BoW
"Indian politician" → ["Indian", "politician"] — loses relationship between words
Bigram (2,2) or (1,2)
"Indian politician" → ["Indian politician"] — single feature preserving meaning
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2)) # unigrams + bigrams
cv = CountVectorizer(ngram_range=(2,3)) # bigrams + trigrams only
⚙️ max_features — Fighting Sparsity

max_features restricts the model to only the top N most frequent words, discarding rare words. Manual way to control vector dimensions and reduce sparse matrix size.

cv = CountVectorizer(max_features=1000)
# Only top 1000 words become features
# Reduces 50,000-dim → 1,000-dim matrix

tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
Practical starting point: max_features=1000–5000 for most classification tasks. Increase if accuracy is low.
04 — Vectorization Methods Compared
MethodDimensionSemantic?OOV?SparsityBest for
Bag of WordsVocab size (50k+)NoneFails~99% zerosSimple baseline
TF-IDFVocab sizePartialFailsStill sparseSearch, ranking
Word2Vec CBOW100–300 fixedRichHandledDense (no zeros)Frequent words
Word2Vec Skip-gram100–300 fixedRichHandledDenseRare words, small data
BERT Embeddings768 / 1024ContextualHandledDenseProduction NLP tasks
03 — TF-IDF & Word2Vec
📊 TF-IDF — Smarter Weighting
TF-IDF = TF × IDF = count/total × log(N/df)

Rare words → high IDF → high weight. Common words → low IDF → near zero. Surfaces the words that actually matter.

✓ Weights rare words higher ✓ Cancels common words ✗ Still sparse matrix ✗ No semantic meaning
👑 Word2Vec Magic
Vector(King) − Vector(Man) + Vector(Woman) ≈ Vector(Queen)
Captures gender, royalty relationships from raw text — no human labels
✓ Dense 100–300 dim ✓ Semantic relationships ✓ Cosine similarity works
VI
Volume 6
LSTM — Long Short-Term Memory Deep Dive
00 — Why RNNs? Human Memory vs Standard ANNs
🔁 The Problem RNNs Solve

When you read "The cat sat on the mat — it was tired", you understand "it" refers to "cat" because you remember earlier words. Standard ANNs process each input independently — no memory of previous inputs. RNNs add a loop that passes the previous output back as input, creating short-term memory.

Standard ANN ❌
Each word processed independently. No memory. Can't understand "it" without remembering "cat" from 6 words ago.
RNN ✓
Hidden state hₜ carries context from previous steps. Output at step t depends on all previous inputs. Natural for sequences: text, audio, time series.
01 — GRU — Gated Recurrent Unit
🔀 GRU — Lightweight LSTM Alternative

GRU combines long-term and short-term memory into a single hidden state using only 2 gates (vs LSTM's 3 gates and 2 states). Faster to train, often matches LSTM performance.

Update Gate
zₜ = σ(Wz·[hₜ₋₁, xₜ])
Decides how much of the past hidden state to retain. Output near 0 = ignore past (overwrite with new candidate). Near 1 = keep past state and blend with new info. Controls long-term memory.
0 = overwrite1 = keep past
Reset Gate
rₜ = σ(Wr·[hₜ₋₁, xₜ])
Decides what irrelevant old context to forget. Example: subject switches from "Mr. Watson" to "Mrs. Watson" — reset gate fires to erase old context and make room for new subject.
Final Update
h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ])   hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ
FeatureGRULSTM
Gates2 (Update + Reset)3 (Input, Forget, Output)
States1 hidden state2 (hidden + cell)
SpeedFaster, fewer paramsSlower, more expressive
Vanishing ∇SolvedSolved
Best forSmaller data, faster trainingComplex long-range deps
02 — LSTM Cell State & Three Gates
🚂 Cell State — The Memory Conveyor Belt
CELL STATE Cₜ (long-term memory highway — additive, not multiplicative) FORGET GATE σ(h,x) × Cₜ₋₁ Erases old context INPUT GATE σ(h,x) + tanh(h,x) Writes new context CELL UPDATE fₜ×Cₜ₋₁ + iₜ×C̃ₜ OUTPUT GATE σ(h,x) × tanh(Cₜ) Outputs hₜ
🗑️ Forget Gate
fₜ = σ(Wf·[hₜ₋₁,xₜ]+bf)
0 = erase / 1 = keep. Context switch: "Krish" → "his friend" → forget gate fires near 0, erases old subject from cell state.
📥 Input Gate
iₜ = σ(…)   C̃ₜ = tanh(…)
Cₜ = fₜ×Cₜ₋₁ + iₜ×C̃ₜ
σ decides WHICH values update. tanh scales candidates to ±1. Together: write new subject into cell state.
📤 Output Gate
oₜ = σ(Wo·[hₜ₋₁,xₜ]+bo)
hₜ = oₜ × tanh(Cₜ)
Filters cell state → hidden state hₜ. Passes singular/plural info so next verb conjugates correctly.
VII
Volume 7
RNN · LSTM · Bi-LSTM Applied to NLP
01 — RNN Architecture Types
One → One
Single input → Single output. Image classification. Standard feedforward.
One → Many
Single input → Sequence output. Text generation, music generation, image captioning.
Many → One
Sequence → Single output. Sentiment analysis, fake news detection, next-day sales prediction.
Many → Many
Sequence → Sequence. Language translation, chatbots, question-answering.
02 — Forward & Backward Propagation in RNNs
➡️ Forward Propagation

At each time step t, the network receives current input xₜ and the previous hidden state hₜ₋₁. Both are multiplied by their weight matrices, added together, and passed through an activation function (Sigmoid or Softmax).

hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b)
yₜ = softmax(Wᵧ·hₜ + bᵧ)
The hidden state hₜ carries forward the context from all previous time steps into the next step — this is the RNN's short-term memory mechanism.
⬅️ Backward Propagation Through Time (BPTT)

Gradients flow backward through all time steps using the chain rule — called Backpropagation Through Time (BPTT). Each step multiplies the gradient by the weight matrix and activation derivative.

Problem: With Sigmoid/Tanh (derivative ≤ 0.25), multiplying across many time steps causes the gradient to vanish. Early time steps receive near-zero gradient updates — the network forgets long-range context.
✗ Vanishing gradient over long sequences Fix: LSTM / GRU gates
↔️ Bidirectional LSTM

Standard LSTM only reads left→right. "Bull is going ___" — can't predict without "high" that follows. Bi-LSTM solves this by reading both directions and concatenating outputs.

"I am a [?] expert in Python" I am a ? expert… Forward → ← Backward CONCAT [h_fwd ‖ h_bwd] → full context ✓
model.add(Bidirectional(LSTM(128)))
🏗️ Full NLP Keras Pipeline
# 1. Preprocess
tokens = nltk.word_tokenize(text.lower())
clean = [ps.stem(w) for w in tokens if w not in stopwords]

# 2. Encode + Pad
encoded = [one_hot(s, vocab_size=5000) for s in sentences]
X = pad_sequences(encoded, maxlen=50, padding='pre')

# 3. Model
model = Sequential([
  Embedding(5000, 40, input_length=50),
  Bidirectional(LSTM(100, dropout=0.2)),
  Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
04 — Pre-Padding vs Post-Padding
📏 Pre-Padding (Recommended for LSTM)
[0, 0, 0, 42, 8, 11, 7, 3]

Zeros at the front. The actual text sits at the end. For an LSTM reading left→right, the meaningful content arrives last — it sits fresh in the final hidden state used for prediction. Generally better for standard LSTMs.

pad_sequences(X, padding='pre') # default
📏 Post-Padding
[42, 8, 11, 7, 3, 0, 0, 0]

Zeros at the end. The LSTM processes real content first, then multiple zeros. The trailing zeros can dilute the context in the final hidden state, reducing prediction quality. Use only when the architecture requires it.

pad_sequences(X, padding='post')
05 — Tools, Resources & What to Learn Next
🛠️ Tools & Resources
☁️
Google Colab
Free GPU for sequence model training. Recommended for all LSTM/Transformer experiments.
📖
Colah's LSTM Blog
Chris Olah's blog (colah.github.io) — the definitive visual breakdown of LSTM gates. Essential companion reading. Highly recommended by practitioners worldwide.
📊
Kaggle / UCI Repository
Fake News Classifier, IMDb review corpus, sentiment datasets. Practical project data.
🧰
Libraries: TensorFlow / Keras, NLTK, Gensim, scikit-learn
one_hot, pad_sequences, Embedding, LSTM, Bidirectional in Keras. word_tokenize, PorterStemmer, stopwords in NLTK.
🚀 What to Learn Next (After This Guide)
1. Encoder-Decoder Architectures
Logical evolution from Bi-LSTMs. Seq2Seq for translation — sequence in, sequence out with attention.
2. Transformers & Self-Attention
Completely replaces RNN-based constraints. All tokens in parallel. Foundation of BERT and GPT. (Covered in Vol VIII)
3. Pre-trained LLMs — BERT & Hugging Face
Fine-tune BERT for classification, NER, QA without training from scratch. HuggingFace transformers library.
VIII
Volume 8 · State of the Art
RNN → Transformers — "Attention is All You Need"
01 — Architecture Evolution
🔁
Problem
Simple RNN
✗ Vanish ∇
🔀
Fix A
LSTM/GRU
✓ Gated memory
↔️
Fix B
Bi-LSTM
✓ Both contexts
🔄
Architecture
Seq2Seq
✗ Bottleneck
👁️
Breakthrough
Attention
✓ All states visible
SOTA
Transformer
✓ Parallel
✓ Self-attention
01b — GRU Gates — Context Within the Evolution
🔀 GRU Update Gate — How It Handles Context Switching

The update gate decides how much past hidden state to carry forward. When output is 0, the old state is entirely replaced. When output is 1, it is fully retained.

Example: Subject switches mid-sentence
"Mr. Watson was tired. His friend arrived."
→ Update gate fires: erases "Mr. Watson" context. Reset gate fires: makes room for "his friend".
zₜ (update) = σ(Wz·[hₜ₋₁, xₜ])   rₜ (reset) = σ(Wr·[hₜ₋₁, xₜ])
hₜ = (1−zₜ) ⊙ hₜ₋₁ + zₜ ⊙ tanh(W·[rₜ⊙hₜ₋₁, xₜ])
zₜ≈0: ignore past, use new zₜ≈1: retain past state Lighter than LSTM (2 gates, 1 state)
🔀 Bi-LSTM — "Bull is Going High"

Standard LSTM reading "Bull is going ___" cannot determine if this is financial context without seeing "high" that comes after. Bi-LSTM reads both directions and concatenates.

Bull is going high → Forward ← Backward concat outputs → finance context ✓
Predicts every word using both past AND future words. Critical for NER, fill-in-blank, QA tasks.
02 — Encoder-Decoder (Seq2Seq) Architecture
🔄 Seq2Seq: How It Works & Why It Breaks on Long Sentences
INPUT (English) I am a student ENCODER processes input → ignores intermediate outputs Context Vector ⚠ bottleneck! fixed size → info loss DECODER generates output one token at a time OUTPUT (French) Je suis un étudiant BLEU Score vs Sen. Length short long accuracy drops!
Encoder Role
Ingests input sequence. Ignores individual step outputs. Creates one final context vector summarising the whole input.
⚠ Context Bottleneck
One fixed-size vector cannot hold all info from 100+ words. BLEU score drops sharply with longer sentences — information is simply lost.
Decoder Role
Takes context vector. Predicts output one token at a time until end-of-string token is generated. Auto-regressive.
BLEU Score (Bilingual Evaluation Understudy) — standard metric for translation quality. Measures how closely machine output matches human reference translations. Range 0–1 (higher = better). Sharp drop on long sentences = bottleneck problem.
03 — Attention Mechanism — Fixing the Bottleneck
👁️ Attention: Give the Decoder Access to ALL Encoder States

Instead of squeezing everything into one context vector, Attention lets the decoder look back at every encoder hidden state at each decode step — creating a dynamic weighted context that focuses on the most relevant input words.

ALL encoder hidden states stay available to decoder (no bottleneck) h₁ h₂ h₃ h₄ encoder states (all kept) Attention weights (softmax → sum to 1) 0.1 0.7 0.1 0.1 (sum=1.0) Weighted Context Decoder Key insight: at every decode step, attention creates a NEW context by weighting all encoder states differently. Standard Attention: encoder ↔ decoder (cross-sequence relevance) Self-Attention: words within the SAME sequence (within encoder or within decoder)
Standard Encoder-Decoder Attention
Q comes from the decoder. K and V come from the encoder. Lets the decoder ask "which input word should I focus on right now?" at each generation step.
Self-Attention (Transformer)
Q, K, and V all come from the same sequence. Lets every word attend to every other word within the same sentence — understanding "it" refers to "animal".
How Attention Fixes the Bottleneck
Without attention: 100-word sentence → 1 fixed vector → decoder has no idea which input words matter for each output token.
With attention: Decoder dynamically computes a new weighted context at every step, attending most to the relevant encoder states.
Result: BLEU scores stay high even on very long sentences. Long-range dependencies preserved.
04 — Self-Attention: Step-by-Step
👁️ Self-Attention Formula & Steps
1
Generate Q,K,V
Embed × W^Q, W^K, W^V → 3 vectors per word
2
Score = Q·Kᵀ
Q of one word × K of all words → relevance scores
3
Scale ÷ √dk
÷√64=8. Prevents Softmax saturation, stabilises gradients
4
Softmax
Scores → probabilities summing to 1
5
× Values
Softmax × V → weighted sum = final output
Self-Attention Formula
Attention(Q,K,V) = softmax( Q·Kᵀ / √dk ) × V
dk = 64 dimensions · √dk = 8 · Prevents gradient saturation before Softmax
05 — Key Transformer Components
🔀 Multi-Head Attention

Run 8 parallel attention heads with different Q/K/V weight matrices. Each learns a different relationship. Concat + linear → richer representation. "it" → simultaneously links to "animal" AND "tired".

MultiHead(Q,K,V) = Concat(head₁,…,head₈) × W^O
📍 Positional Encoding

Parallel processing loses word order. PE adds sin/cos vectors to embeddings encoding position and relative distance. Without it: "dog bites man" = "man bites dog" to the Transformer.

Input = WordEmbed + PositionVector(sin/cos)
↩️ Residual + Layer Norm

Bypass shortcut around each sublayer. If self-attention isn't useful, data skips it. Prevents vanishing gradients in deep 6-layer stacks. Input + SubLayer(Input) → LayerNorm.

Output = LayerNorm(x + SubLayer(x))
06 — BERT vs GPT — Understanding Modern Architectures (Added)
BERT vs GPT — The Two Dominant Transformer Paradigms
DimensionBERT (Encoder-only)GPT (Decoder-only)
ArchitectureBidirectional encoder only (no decoder)Causal decoder only (left-to-right masked)
Training taskMasked Language Modelling (fill in [MASK] tokens)Next-token prediction (autoregressive)
Context directionSees all tokens (past + future) simultaneouslyOnly past tokens (masked future)
Best atClassification, NER, QA, embeddingsText generation, completion, coding
Fine-tune forSentiment, intent detection, information extractionChatbots, summarisation, code generation
ExamplesBERT · RoBERTa · DistilBERTGPT-2/3/4 · LLaMA · Mistral
Practical rule: Need to understand text → BERT. Need to generate text → GPT. Need both → Encoder-Decoder (T5, BART). For RAG retrieval → BERT embeddings. For RAG generation → GPT-family LLM.
07 — Why Transformers Dominate — Scaling Laws (Added)
Scaling Laws & Why Transformers Replaced RNNs in Production
Parallelism
RNNs process sequentially (each step depends on previous). Transformers process ALL tokens simultaneously → GPU utilisation 10–100× better.
Scaling
Kaplan et al. (2020): Model performance scales as a power law with compute, data, and parameters. RNNs don't benefit nearly as much from scale.
Long-range Dependencies
Every token attends to every other token in O(1) steps. RNNs need O(n) steps. Critical for long documents and code.
When to still use RNN/LSTM
✓ Edge deployment (Transformer too large)
✓ Real-time streaming (online learning)
✓ Very limited data (<10K sequences)
✓ Fixed-length time series with no long-range deps
✓ Teaching / understanding sequence models
When to use Transformers
✓ Any production NLP task in 2024+
✓ Long documents, code, multimodal
✓ When GPU compute is available
✓ Transfer learning from pretrained models
✓ Building LLM-powered applications
Master Quick-Reference — One Page to Rule Them All
Complete Deep Learning & NLP — Instant Reference
Vanishing Gradient
σ' ≤ 0.25 → 0.25ⁿ → 0
Fix: ReLU for FFNNs, LSTM/GRU for RNNs. Residual connections for Transformers.
Weight Init
Xavier (Sigmoid) · He (ReLU)
Never all-zeros. He = √(2/nᵢₙ). Xavier = √(6/(nᵢₙ+nₒᵤₜ)).
Activation Quick Rule
ReLU hidden · Softmax out
Sigmoid only binary output. Never Sigmoid in hidden. Leaky/PReLU if neurons die.
Loss Function
MSE=reg · BCE=binary · CCE=multi
Huber if outliers. Sparse CCE skips one-hot encoding. Always match loss to output type.
Optimizer
Adam by default · AdamW for LLMs
Never vanilla GD. β₁=0.9, β₂=0.999, ε=1e-8. Add LR schedule for best results.
CNN Output Size
⌊(n+2p−f)/s⌋ + 1
Same padding: p=⌊f/2⌋. Always ReLU after conv. Max pooling for location invariance.
NLP Vectorization
BoW→TF-IDF→Word2Vec→BERT
Production: use BERT embeddings. BoW/TF-IDF for prototypes. Word2Vec when BERT is too heavy.
LSTM 5 Steps
Forget→Input→C̃→Update→Output
Cₜ = fₜ×Cₜ₋₁ + iₜ×C̃ₜ. hₜ = oₜ×tanh(Cₜ). Pre-pad sequences for LSTM.
Self-Attention
softmax(QKᵀ/√dk) × V
dk=64, √dk=8. Q/K/V from Wq,Wk,Wv matrices. Softmax ensures scores sum to 1.
Transformer Stack
6 Enc + 6 Dec · 8 heads
Parallel processing. Positional encoding for order. Residual + LayerNorm every sublayer.
BERT vs GPT
Encoder=understand · Decoder=generate
BERT: masked LM, bidirectional. GPT: causal, autoregressive. T5/BART: both.
Production Checklist
Init → BN → Dropout → Schedule
He init → Batch Norm → Dropout(0.2–0.5) → LR schedule → Early stopping → Monitor val loss.
IX
Volume 9
Transformers, BERT & GPT
Attention Mechanisms · Encoder vs Decoder · Transfer Learning · Meta-Learning
01 — Full Transformer Architecture (Vaswani et al. 2017)
The diagram below reproduces the full Encoder–Decoder Transformer architecture from "Attention is All You Need" (Vaswani et al., 2017) — the same diagram from Jay Alammar's famous illustrated guide. The left stack is the Encoder. The right stack is the Decoder. Each contains Self-Attention → Add & Normalize → Feed-Forward → Add & Normalize. The Decoder adds a third sublayer: Encoder–Decoder Attention which attends over the encoder's output. A Linear + Softmax layer on top converts decoder output to a probability distribution over the vocabulary.
FULL TRANSFORMER ARCHITECTURE — VASWANI ET AL. 2017 ENCODER #1 Self-Attention Add & Normalize Feed Forward Feed Forward Add & Normalize ENCODER #2 Self-Attention Add & Normalize Feed Forward Feed Forward Add & Normalize + pos enc x₁ Thinking + x₂ Machines DECODER #1 Self-Attention Add & Normalize Encoder-Decoder Attention Add & Normalize Feed Forward Feed Forward Add & Normalize DECODER #2 Self-Attention + Add & Norm Enc-Dec Attention + Add & Norm Feed Forward + Add & Norm Linear → Softmax → Output + + Output₁ Output₂ K, V from Encoder Encoder layers Decoder layers Self-Attention Enc-Dec Attention Feed Forward Add & Normalize
Encoder (Left Stack)
Each encoder has 2 sublayers: Self-Attention → Add&Norm, then Feed-Forward → Add&Norm. Residual connections (dashed arrows) bypass each sublayer — gradient highway. Output is a rich contextual representation of the input sequence. BERT uses encoder-only.
Decoder (Right Stack)
Each decoder has 3 sublayers: Masked Self-Attention (prevents future-token peeking) → Add&Norm, then Encoder-Decoder Attention (attends over encoder output) → Add&Norm, then Feed-Forward → Add&Norm. GPT uses decoder-only.
Enc-Dec Attention (Pink)
The critical bridge. Queries come from the decoder; Keys and Values come from the final encoder output. This is how the decoder "reads" the encoded input while generating each output token — context flows from encoder to decoder through this layer.
02 — The Evolution of NLP
Legacy Era (2013–2016)
Word2Vec, n-grams, RNNs, LSTMs dominated NLP. Models processed tokens one at a time — inherently sequential and slow. BiLSTMs attempted bidirectionality by concatenating passes, but “bank” in “riverbank” and “bank robber” shared the same vector — no contextual awareness.
Transformer Breakthrough (2017)
“Attention is All You Need” replaced recurrence entirely. All tokens processed simultaneously — massive parallelization. Self-attention computes contextual relationships between every pair of words in one matrix operation. Training time dropped dramatically.
LSTM 2013 Seq2Seq 2014 Trans- former 2017 BERT 2018 GPT-2 2019 GPT-3 2020 ★ Attention is All You Need
03 — Self-Attention Mechanism
SELF-ATTENTION: Q · Kᵀ / √dₖ → Softmax → × V "The bank" "by the" "river" Q K V Q·Kᵀ / √dₖ Softmax × V Context Vector Irrelevant words get softmax score ~0.001 → multiply Value → contribute almost nothing to output. Relevant words: score ~1.0 → fully preserved.
04 — BERT (Encoder-Only) & GPT (Decoder-Only)
BERT — Encoder Stack
Stacks encoder blocks only. Bi-directional context. Pre-trained with MLM (mask 15% of tokens, predict them) + NSP (predict if sentence B follows A). Fine-tune by replacing output layer for specific tasks.
Input: Token + Segment + Position Embeddings
BERT Base: 12 layers, 110M params
BERT Large: 24 layers, 340M params
Max: 512 tokens (hard limit)
GPT — Decoder Stack
Stacks decoder blocks only. Uni-directional (left-to-right). Pre-trained by predicting the next word. Evolved from fine-tuning (GPT-1) → zero-shot (GPT-2, 1.5B) → few-shot meta-learning (GPT-3, 175B).
GPT-1: 117M — fine-tune per task
GPT-2: 1.5B — zero-shot learning
GPT-3: 175B — few-shot (10-100 examples)
No weight updates at inference time
Dimension BERT GPT
ArchitectureEncoder-onlyDecoder-only
DirectionBi-directionalUni-directional
Pre-trainingMLM + NSPNext-word prediction
StrengthUnderstanding / classifyingGenerating / completing
AdaptationFine-tune output layerPrompt / meta-learning
Token limit512 (strict)2k–128k+ model-dep.
Cheat Summary
Transformer
Encoder + Decoder, parallel
2017. Attention only. No recurrence. Q/K/V matrices. BERT=Encoder. GPT=Decoder.
Self-Attention
softmax(QKᵀ/√dk) × V
Irrelevant words → score 0.001 → drowned out. Relevant → score ~1 → preserved.
BERT Pre-training
MLM 15% + NSP binary
Mask 15%: balance between cost and context. NSP: sentence coherence. Both run simultaneously.
GPT Meta-learning
Zero/Few-shot → no weight updates
GPT-2: 0 examples. GPT-3: 10-100 examples in context window. 175B params needed.
Add & Normalize
Residual + LayerNorm
Every sublayer. Residual = gradient highway. LayerNorm = stabilizes activations. No vanishing gradients.
Enc-Dec Attention
Q=decoder, K/V=encoder
Bridge between stacks. Decoder queries the encoder's output for context on each output token.
Positional Encoding
sin/cos waves by position
Tokens enter all at once → no inherent order. PE injects order. BERT: max 512. GPT-3: 2048.
Transfer Learning
Pre-train → save → fine-tune
Never train from scratch. HuggingFace: bert-base-uncased. Fine-tune = replace output layer only.
Q&A Flashcards — Exam & Interview Prep
Why can Transformers parallelize but LSTMs cannot?
LSTMs: step t needs step t−1 output — inherently sequential. Transformers: attention computed on all tokens simultaneously in one matrix operation → full GPU/TPU utilization.
Remove Position Embeddings from BERT — what breaks?
All tokens enter without order. “Dog bites man” = “Man bites dog”. Grammar and syntax collapse. Transformer becomes a bag-of-words model.
Why mask exactly 15% in MLM?
Too low (5%) → training is computationally expensive for little signal. Too high (50%) → destroys surrounding context needed to predict the mask. 15% balances both.
Fine-tuning vs meta-learning in GPT?
Fine-tuning: gradients flow, weights update, needs 100k+ examples. Meta-learning: weights frozen, instructions + examples fed as vectors into context window at inference only.
Feed a 1000-word essay into BERT — what happens?
Hard failure. Position embeddings only defined up to token 512. The overflow is truncated or errors. Use Longformer, BigBird, or chunk the document into 512-token segments.
Why does Enc-Dec Attention use Q from decoder and K/V from encoder?
The decoder is generating output tokens. It needs to query the encoder's understanding of the full input. Q = “what am I looking for?”, K/V = “what did the encoder understand?”.
Sources & Further Reading
🔗 Original Paper
PAPER
"Attention is All You Need" — Vaswani et al. (2017)
The original Transformer paper introducing the full Encoder–Decoder architecture with multi-head self-attention. Google Brain / Google Research.
arxiv.org/abs/1706.03762 →
🎥 Video Resources
YT
Illustrated Guide to Transformers — Michael Phi
Step-by-step visual walkthrough of the Transformer architecture including the encoder-decoder stack, attention mechanism, and how words flow through each layer.
youtube.com/watch?v=4Bdc55j80l8 →
YT
BERT Neural Network — Computerphile
Clear explanation of BERT's masked language modelling, next sentence prediction, and how fine-tuning works for downstream NLP tasks.
youtube.com/watch?v=7kLi8u2dJz0 →
YT
GPT, GPT-2, GPT-3 Explained — Yannic Kilcher
Deep technical walk-through of all three GPT generations covering the shift from fine-tuning to in-context few-shot learning and the scaling hypothesis.
youtube.com/watch?v=SY5PvZrJhLE →
📄 Illustrated Blog Posts
BLOG
The Illustrated Transformer — Jay Alammar
The gold-standard illustrated walkthrough of the Transformer. The architecture diagram in this volume is based on Jay Alammar's visuals. Covers every component step-by-step with animations.
jalammar.github.io/illustrated-transformer →
BLOG
The Illustrated BERT, ELMo — Jay Alammar
Companion post to the Transformer guide. Explains BERT's pre-training tasks, token/segment/position embeddings, and fine-tuning in detail with illustrated examples.
jalammar.github.io/illustrated-bert →
BLOG
Attention? Attention! — Lilian Weng (OpenAI)
Comprehensive mathematical survey of all attention variants — self-attention, multi-head, cross-attention, and beyond. Essential for understanding the math behind every attention mechanism.
lilianweng.github.io/posts/attention →