Deep Learning & NLP — Complete Enhanced Study Guide (v2)

Volume 1 · Start Here

Neural Network Foundations

01 — What is a Neural Network?

🧠 The Big Picture — Squiggle Fitting Machines

Neural networks map complex inputs to accurate outputs by fitting mathematical curves to data. They are not mysterious black boxes — they use calculus, statistics, and linear algebra to iteratively minimise error.

Input Layer

Ingests features (X₁, X₂, X₃). Raw data converted to numbers first.

Hidden Layers

Feature learning. Weights + biases transform signals. Activation functions add non-linearity.

Output Layer

Final prediction ŷ. Sigmoid for binary, Softmax for multiclass, Linear for regression.

Neuron output: y = Σ(wᵢ · xᵢ) + b → activation(y) → ŷ

Weights multiply inputs to stretch/flip curves. Biases shift them. Stacking layers creates the capacity to approximate any function (Universal Approximation Theorem).

🏗️ Architecture Diagram

The learning loop: Forward → compute loss → Backward (chain rule) → update weights → repeat until convergence.

02 — The Vanishing Gradient Problem (Definitive Reference)

⚠️ Why Gradients Vanish — The Math

During backpropagation, gradients are multiplied through every layer using the chain rule. Sigmoid and Tanh derivatives are bounded to max 0.25. In a deep network this causes exponential decay.

0.25 × 0.25 × 0.25 × 0.25 × 0.25 × 0.25 × 0.25 × 0.25 = 0.000015 → early weights never update

Root Cause

Sigmoid/Tanh derivative ≤ 0.25. Chain rule multiplies this across every layer. Deep networks → gradient ≈ 0.

Fix for FFNNs

Use ReLU (derivative = 0 or 1, never shrinks). Batch Normalisation. Residual connections (skip layers).

Fix for RNNs

LSTM / GRU (additive cell state updates, not multiplicative). Gradient clipping for exploding gradients.

03 — Backpropagation & Chain Rule

🔄 The 4-Step Learning Loop

Forward Pass

Compute all layer outputs. Cache intermediate values (needed for gradients). Output = ŷ.

Compute Loss

Compare ŷ vs y. Get one scalar error value. e.g., MSE = (y−ŷ)², Cross-Entropy = −Σy·log(ŷ).

Backward Pass (Chain Rule)

Propagate error backward. Multiply partial derivatives: ∂L/∂W = ∂L/∂O_out × ∂O_out/∂O_hidden × ∂O_hidden/∂W

Update Weights

W_new = W_old − η × ∂L/∂W. Repeat until convergence.

W_new = W_old − η × (∂L / ∂W)

📐 Weight Update Intuition

Step size = slope × η. Naturally large near steep regions, tiny near minimum. Learning rate η too high → oscillates. Too low → painfully slow.

04 — Weight Initialization (Often Missed)

Weight Initialization Strategies — Critical for Training Stability

Poor initialization can cause vanishing or exploding gradients before training even starts. This is one of the most commonly overlooked topics.

All Zeros ❌

All neurons learn the same gradient. Network never learns. Symmetry problem — all weights stay identical.

Xavier / Glorot ✓

W ~ U[−√(6/(nᵢₙ+nₒᵤₜ)), +√(6/(nᵢₙ+nₒᵤₜ))]

Best for Sigmoid/Tanh. Keeps variance stable through layers.

He Initialization ✓

W ~ N(0, √(2/nᵢₙ))

Best for ReLU activations. Accounts for ReLU zeroing half the inputs.

torch.nn.init.xavier_uniform_(layer.weight) # Sigmoid/Tanh
torch.nn.init.kaiming_uniform_(layer.weight, mode='fan_in') # ReLU

Volume 2

Activation Functions — Consolidated Reference

01 — Complete Activation Function Reference

Sigmoid

σ(x) = 1/(1+e⁻ˣ)

Output: 0–1. Binary output layer. NOT zero-centred. Max derivative 0.25 → vanishing gradient. Avoid in hidden layers.

✗ Vanish ∇Binary output

Tanh

(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)

Zero-centred (output −1 to +1) → faster convergence. Still saturates. Better than Sigmoid in hidden layers. Common in RNN gates (LSTM).

✓ Zero-centredLSTM gates

ReLU

max(0, x)

Derivative 0 or 1 — no vanishing. Default hidden layer choice. Risk: Dying ReLU (stuck at 0 for negative inputs). Use He init.

✓ Default hidden✗ Dying ReLU

Leaky ReLU

max(0.01x, x)

Fixes dying ReLU. Negative slope = 0.01, gradient never exactly 0. Fixed constant. PReLU = learned slope α.

✓ No dead neurons

ELU

α(eˣ−1) for x<0

Zero-centred, smooth at 0. Computationally expensive (exponential). Use when zero-centering matters more than speed.

✓ Zero-centred✗ Slow

Swish

x · σ(x) — Google Brain

Self-gating. Smooth, non-monotonic. Outperforms ReLU in very deep networks. Only use >40 layers — expensive.

>40 layers only

Softmax

eᶻʲ / Σeᶻᵏ

Output layer for multiclass. Converts raw logits to probability distribution summing to 1. Paired with Categorical Cross-Entropy.

Multiclass output

PReLU

max(αx, x) — α learned

α updated by backprop. α=0 → ReLU. α=0.01 → Leaky ReLU. Best of both: learns the optimal negative slope for your data.

✓ Adaptive α

02 — ArgMax vs SoftMax — The Output Decision

SoftMax — During Training

Converts raw logits into a probability distribution — all values between 0 and 1, summing to 1. Used during training so the loss function (Cross-Entropy) can compare smooth probabilities against true labels.

SoftMax([2.1, 0.5, 1.3]) → [0.65, 0.10, 0.25] (sum = 1.0)

Training output layerDifferentiable✓ Smooth gradients

ArgMax — During Inference

Simply picks the index of the highest value. No probabilities — just the winning class. Used at inference time when you only need the final predicted label, not a probability.

ArgMax([2.1, 0.5, 1.3]) → 0 (index of 2.1, the largest)

Inference / predictionNot differentiable✓ Fastest — no exp()

03 — Activation Function Decision Guide

If your situation is…	Use this	Why
Hidden layers (default)	ReLU	Fast, no vanishing gradient, industry default. Use He init.
Dead neurons / stuck training	Leaky ReLU or PReLU	Keeps gradient non-zero for negatives. PReLU learns optimal α.
Very deep network (>40 layers)	Swish	Outperforms ReLU in deep nets. Accept compute cost.
Binary classification output	Sigmoid	Outputs 0–1 probability. Pair with Binary CE loss.
Multiclass output	Softmax	Probabilities sum to 1. Pair with Categorical CE.
Regression output	Linear (none)	Unbounded real number output. Pair with MSE/MAE.
RNN/LSTM gates	Sigmoid + Tanh	Sigmoid for gating (0–1). Tanh for values (−1 to +1).
Training is very slow (ELU)	Leaky ReLU	ELU's exponential is expensive. Leaky ReLU is faster with similar benefit.

03 — Batch Normalisation (Added)

Batch Normalisation — The Modern Solution to Internal Covariate Shift

BN normalises layer inputs to mean=0, variance=1 per mini-batch. Applied BEFORE or AFTER the activation function. Dramatically stabilises and speeds training. Makes learning rate less sensitive.

μ = (1/m) Σxᵢ σ² = (1/m) Σ(xᵢ−μ)²
x̂ᵢ = (xᵢ−μ)/√(σ²+ε) yᵢ = γ·x̂ᵢ + β

γ and β are learnable parameters — the network learns optimal scale and shift. ε = small constant for numerical stability.

✓ Allows much higher learning rates

✓ Reduces sensitivity to initialization

✓ Acts as mild regulariser (reduces need for Dropout)

✗ Doesn't work well with small batch sizes → use Layer Norm for RNNs/Transformers

model.add(BatchNormalization()) # after Dense, before or after activation

III

Volume 3

Loss Functions · Optimizers · Regularization

01 — Loss vs Cost — Quick Distinction

Loss Function

Error for a single data point. L = error(ŷ, y). What you minimise conceptually. Calculated on one record passing through the network.

Cost Function

Average loss across an entire batch or dataset. J = (1/n) Σ L. What gradient descent actually minimises — more stable gradient direction.

02 — Regression Loss Functions

MSE / L2 Loss

L = (y − ŷ)² J = Σ(y−ŷ)²/n

Quadratic bowl → guaranteed convergence. Heavily penalises outliers (squaring).
✓ Single global min✗ Outlier sensitive

MAE / L1 Loss

L = |y − ŷ|

Robust to outliers (linear, not squared). Sharp bend at 0 → undefined derivative. Local minima risk.
✓ Outlier robust✗ Harder optimise

Huber Loss — Best of Both

|err| ≤ δ → MSE |err| > δ → MAE

Smooth convergence (MSE center) + outlier robust (MAE tails). Hyperparameter δ controls transition.
✓ Best of both

03 — Classification Loss Functions

Function	Classes	Label Format	Formula	When to Use
Binary CE	Exactly 2	0 or 1	−y·log(ŷ)−(1−y)·log(1−ŷ)	Cat vs Dog, spam detection
Categorical CE	> 2	One-hot array [0,1,0]	−Σ yᵢ·log(ŷᵢ)	Image classification, NLU
Sparse Cat. CE	> 2	Integer labels (0, 1, 2…)	Same as Cat. CE, auto one-hot	Many classes, saves memory

07 — Virtual Environments — Project Best Practice

🐍 Why Virtual Environments Matter in Deep Learning

ML libraries (TensorFlow, PyTorch, CUDA) update frequently and break older code. TensorFlow 1.x vs 2.x are fundamentally incompatible. A virtual environment isolates each project's exact dependency versions.

Create with Conda

conda create -n myproject python=3.10
conda activate myproject
pip install tensorflow keras

Lock Versions with requirements.txt

pip freeze > requirements.txt
# On another machine:
pip install -r requirements.txt

Rule: Create a new Conda environment for every new project. Never install ML libraries into your base Python environment. Use requirements.txt to reproduce the environment on any machine or server.

08 — Loss Function — SSR (Sum of Squared Residuals)

📐 Sum of Squared Residuals (SSR)

The foundational loss for regression. Core of MSE. Squaring residuals ensures all errors are positive and penalises large errors much more than small ones.

SSR = Σ (y − ŷ)² → MSE = SSR / n (average version)

Why square? Ensures positive values. Amplifies large errors (3² = 9 vs 3). Creates smooth differentiable parabola — single global minimum for gradient descent.

When NOT to use SSR/MSE: Dataset with outliers — squaring inflates outlier errors massively, pulling the model away from the majority of normal data. Use MAE or Huber instead.

Optimizer	Mechanism	Strength	Weakness	Use When
Gradient Descent	Entire dataset per update	Stable direction	Catastrophically slow on large data	Never for production
SGD	1 sample per update	Fast iterations	Very noisy, zigzag path	With momentum for vision
Mini-batch SGD	k samples per update	Best balance	Still some noise	Industry standard baseline
SGD + Momentum	EWA of past gradients. V = β·V_prev + (1−β)·dW	Smoother path	Fixed LR still needed	CV tasks, ResNet training
Adagrad	Divides LR by sum of squared grads	Adaptive per-param LR	LR → 0 over time (fatal)	Sparse data only
RMSprop	EWA of squared grads in denominator	Fixes Adagrad decay	No bias correction	RNNs (historical)
Adam ⭐	Momentum + RMSprop + Bias Correction	Fast, stable, adaptive	May overfit on small data (try AdamW)	Default for everything
AdamW	Adam + decoupled weight decay	Better generalisation	Extra hyperparameter	Transformers, fine-tuning LLMs

05 — Learning Rate Scheduling (Added)

Learning Rate Scheduling — Critical for Production Training

A fixed LR is rarely optimal. Scheduling reduces LR over time — start large (fast progress) then shrink (precise convergence). Industry standard in all serious training runs.

Step Decay

LR drops by factor every N epochs. Simple but abrupt transitions.

Cosine Annealing

LR follows cosine curve. Smooth, widely used for vision and NLP.

Warmup + Decay

Start with tiny LR, ramp up, then decay. Standard for Transformers and LLMs.

ReduceLROnPlateau

Automatically reduces LR when validation loss stops improving. Easy win.

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
scheduler = OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=10)

06 — Regularization — Dropout, L1, L2

🎲 Dropout

Randomly deactivates neurons (p=0.5 typical) during training. Forces redundant representations — like ensemble learning. All neurons active at inference, weights scaled by ×p.

model.add(Dropout(0.5)) # after LSTM/Dense

📏 L1 vs L2 Regularization

L2 (Ridge / Weight Decay)

J_total = J + λ Σw²

Shrinks all weights toward 0. Most common. Used in Adam as "weight decay". Smooth, differentiable.

L1 (Lasso)

J_total = J + λ Σ|w|

Pushes some weights exactly to 0 — produces sparse models. Good for feature selection.

🎯 Zero-Centred Activations — Why It Matters

If activation output is not centred around zero (like Sigmoid: outputs 0–1 only), gradients during backprop are all positive or all negative for a neuron's weights. This forces a zigzag path to convergence — slower training.

Not Zero-Centred

Sigmoid: output 0–1 only. Gradients always same sign. Zigzag optimisation path.

Zero-Centred ✓

Tanh: output −1 to +1. Gradients can be positive or negative. Faster convergence.

📋 Overfitting Diagnosis

Train acc ↑, Val acc ↓ → overfitting. Add Dropout, reduce model size, add more data, early stopping.

Both acc ↓ → underfitting. Increase model capacity, train longer, reduce regularisation.

Loss oscillates → LR too high. Reduce by 10×. Switch to Adam if on SGD.

Loss barely moves → LR too low, dying neurons, or wrong initialisation.

Volume 4

Convolutional Neural Networks

01 — CNN Architecture & Pipeline

📷

Input

Image
(H×W×C)

→

🔲

Step 1

Conv
Layer

→

⚡

Step 2

ReLU
Activate

→

⬇️

Step 3

Max
Pooling

→

🔲

Step 4

Conv +
ReLU

→

➡️

Step 5

Flatten
1D

→

🧠

Step 6

Fully
Connected

→

🏷️

Output

Softmax
Classes

02 — Core Operations

🔲 Convolution

Filter slides over image performing element-wise multiply + sum → one output pixel. Stacked filters = feature maps. Filters are learned by backprop.

Output size = ⌊(n + 2p − f) / s⌋ + 1

n=input size · p=padding · f=filter size · s=stride

n=28, f=3, p=1, s=1
→ 28 (same size)

n=28, f=3, p=0, s=2
→ 13 (halved)

📏 Padding (Same vs Valid)

Same (p = ⌊f/2⌋)

Output = Input size. Adds zeros around border. Preserves spatial dimensions. Use in deep nets.

Valid (p = 0)

Output shrinks by (f−1) per layer. Edge pixels under-represented. Use only for small networks.

keras: padding='same' or padding='valid'

⬇️ Max Pooling

Sliding window picks maximum value. Reduces spatial dimensions while retaining strongest activations. Achieves location invariance — object detected anywhere in image.

Average / Mean Pooling (same operation — two names)

Takes the mean of the region instead of the maximum. Smoother, less sharp features. Less common than Max Pooling for object detection. Used in Global Average Pooling (GAP) before final classifier in modern CNNs like ResNet.

Max Pool → object detection Avg Pool → smooth features, GAP layers

🔄 Data Augmentation

Artificially expands training data by transforming existing images. Same label, different appearance — teaches CNN to be robust to variations.

↔️ Flip — horizontal/vertical mirror

🔄 Rotate — random angle ±15°

🔍 Zoom — random crop + resize

🌫️ Noise — Gaussian pixel noise

☀️ Brightness — colour jitter

🎭 Cutout/Mixup — advanced blend

from torchvision import transforms
transform = transforms.Compose([
  transforms.RandomHorizontalFlip(),
  transforms.RandomRotation(15),
  transforms.ColorJitter(brightness=0.2)
])

CNN vs Vision Transformer (ViT) — When to Use Which

Dimension	CNN	ViT
Data need	Works with small datasets (10K+)	Needs large data (100K+) or pretraining
Inductive bias	Built-in: spatial locality, translation invariance	None — learns all from data
Long-range deps	Weak (needs many layers)	Excellent (global attention)
Compute	O(n²) in image size (efficient)	O(n²) in patches (heavier)
Use when	Limited data, edge deployment, real-time	Large dataset, SOTA accuracy, multimodal

Volume 5

Natural Language Processing — Fundamentals

01 — NLP Pipeline Overview

📝

Input

Raw Text

→

✂️

Step 1

Tokenize

→

🗑️

Step 2

Stop Words
⚠ keep "not"

→

🌿

Step 3

Stem or
Lemmatize

→

🔢

Step 4

Vectorize
BoW/TF-IDF

→

🧠

Step 5

Embed
Word2Vec

→

🤖

Model

LSTM / TF

02 — Text Preprocessing — Stemming, Lemmatization & Stop Words

🌿 Stemming vs Lemmatization

Stemming — PorterStemmer

Aggressively chops word endings to find a root stem. Fast but often produces non-words. Uses algorithm, not a dictionary.

"historical" → histori ❌ (not a real word) "running" → run ✓

from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("historical") # → histori

✓ Very fast✗ Fake words possibleBest for: spam, toxic classifier

Lemmatization — WordNetLemmatizer

Uses a dictionary (WordNet) to find the true base form. Slower but always returns a real, meaningful word.

"historical" → history ✓ "better" → good ✓

from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()
wl.lemmatize("historical") # → history

✓ Real dictionary words✗ SlowerBest for: chatbots, translation

⚠️ Stop Words — The "not" Problem

Removing common low-value words saves computation. But never blindly remove "not" — it completely reverses sentiment meaning.

❌ Removing "not" destroys sentiment

Original: "Food is not good"
After stop-word removal: "Food good" ← completely flipped!

Rule: Always use a custom stop word list for your task. Remove "not" from the default NLTK stop words list when doing sentiment analysis.

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
stop.remove('not') # critical for sentiment!

03 — N-grams, CountVectorizer & max_features

📊 N-grams — Capturing Word Sequences

Instead of single words (unigrams), N-grams capture sequences of N words. Bigrams and trigrams preserve local context that BoW loses entirely.

Unigram (1,1) — default BoW

"Indian politician" → ["Indian", "politician"] — loses relationship between words

Bigram (2,2) or (1,2)

"Indian politician" → ["Indian politician"] — single feature preserving meaning

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2)) # unigrams + bigrams
cv = CountVectorizer(ngram_range=(2,3)) # bigrams + trigrams only

⚙️ max_features — Fighting Sparsity

max_features restricts the model to only the top N most frequent words, discarding rare words. Manual way to control vector dimensions and reduce sparse matrix size.

cv = CountVectorizer(max_features=1000)
# Only top 1000 words become features
# Reduces 50,000-dim → 1,000-dim matrix

tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

Practical starting point: max_features=1000–5000 for most classification tasks. Increase if accuracy is low.

04 — Vectorization Methods Compared

Method	Dimension	Semantic?	OOV?	Sparsity	Best for
Bag of Words	Vocab size (50k+)	None	Fails	~99% zeros	Simple baseline
TF-IDF	Vocab size	Partial	Fails	Still sparse	Search, ranking
Word2Vec CBOW	100–300 fixed	Rich	Handled	Dense (no zeros)	Frequent words
Word2Vec Skip-gram	100–300 fixed	Rich	Handled	Dense	Rare words, small data
BERT Embeddings	768 / 1024	Contextual	Handled	Dense	Production NLP tasks

03 — TF-IDF & Word2Vec

📊 TF-IDF — Smarter Weighting

TF-IDF = TF × IDF = count/total × log(N/df)

Rare words → high IDF → high weight. Common words → low IDF → near zero. Surfaces the words that actually matter.

✓ Weights rare words higher ✓ Cancels common words ✗ Still sparse matrix ✗ No semantic meaning

👑 Word2Vec Magic

Vector(King) − Vector(Man) + Vector(Woman) ≈ Vector(Queen)

Captures gender, royalty relationships from raw text — no human labels

✓ Dense 100–300 dim ✓ Semantic relationships ✓ Cosine similarity works

Volume 6

LSTM — Long Short-Term Memory Deep Dive

00 — Why RNNs? Human Memory vs Standard ANNs

🔁 The Problem RNNs Solve

When you read "The cat sat on the mat — it was tired", you understand "it" refers to "cat" because you remember earlier words. Standard ANNs process each input independently — no memory of previous inputs. RNNs add a loop that passes the previous output back as input, creating short-term memory.

Standard ANN ❌

Each word processed independently. No memory. Can't understand "it" without remembering "cat" from 6 words ago.

RNN ✓

Hidden state hₜ carries context from previous steps. Output at step t depends on all previous inputs. Natural for sequences: text, audio, time series.

01 — GRU — Gated Recurrent Unit

🔀 GRU — Lightweight LSTM Alternative

GRU combines long-term and short-term memory into a single hidden state using only 2 gates (vs LSTM's 3 gates and 2 states). Faster to train, often matches LSTM performance.

Update Gate

zₜ = σ(Wz·[hₜ₋₁, xₜ])

Decides how much of the past hidden state to retain. Output near 0 = ignore past (overwrite with new candidate). Near 1 = keep past state and blend with new info. Controls long-term memory.

0 = overwrite1 = keep past

Reset Gate

rₜ = σ(Wr·[hₜ₋₁, xₜ])

Decides what irrelevant old context to forget. Example: subject switches from "Mr. Watson" to "Mrs. Watson" — reset gate fires to erase old context and make room for new subject.

Final Update

h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

Feature	GRU	LSTM
Gates	2 (Update + Reset)	3 (Input, Forget, Output)
States	1 hidden state	2 (hidden + cell)
Speed	Faster, fewer params	Slower, more expressive
Vanishing ∇	Solved	Solved
Best for	Smaller data, faster training	Complex long-range deps

02 — LSTM Cell State & Three Gates

🚂 Cell State — The Memory Conveyor Belt

🗑️ Forget Gate

fₜ = σ(Wf·[hₜ₋₁,xₜ]+bf)

0 = erase / 1 = keep. Context switch: "Krish" → "his friend" → forget gate fires near 0, erases old subject from cell state.

📥 Input Gate

iₜ = σ(…) C̃ₜ = tanh(…)
Cₜ = fₜ×Cₜ₋₁ + iₜ×C̃ₜ

σ decides WHICH values update. tanh scales candidates to ±1. Together: write new subject into cell state.

📤 Output Gate

oₜ = σ(Wo·[hₜ₋₁,xₜ]+bo)
hₜ = oₜ × tanh(Cₜ)

Filters cell state → hidden state hₜ. Passes singular/plural info so next verb conjugates correctly.

VII

Volume 7

RNN · LSTM · Bi-LSTM Applied to NLP

01 — RNN Architecture Types

One → One

Single input → Single output. Image classification. Standard feedforward.

One → Many

Single input → Sequence output. Text generation, music generation, image captioning.

Many → One

Sequence → Single output. Sentiment analysis, fake news detection, next-day sales prediction.

Many → Many

Sequence → Sequence. Language translation, chatbots, question-answering.

02 — Forward & Backward Propagation in RNNs

➡️ Forward Propagation

At each time step t, the network receives current input xₜ and the previous hidden state hₜ₋₁. Both are multiplied by their weight matrices, added together, and passed through an activation function (Sigmoid or Softmax).

hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b)
yₜ = softmax(Wᵧ·hₜ + bᵧ)

The hidden state hₜ carries forward the context from all previous time steps into the next step — this is the RNN's short-term memory mechanism.

⬅️ Backward Propagation Through Time (BPTT)

Gradients flow backward through all time steps using the chain rule — called Backpropagation Through Time (BPTT). Each step multiplies the gradient by the weight matrix and activation derivative.

Problem: With Sigmoid/Tanh (derivative ≤ 0.25), multiplying across many time steps causes the gradient to vanish. Early time steps receive near-zero gradient updates — the network forgets long-range context.

✗ Vanishing gradient over long sequences Fix: LSTM / GRU gates

↔️ Bidirectional LSTM

Standard LSTM only reads left→right. "Bull is going ___" — can't predict without "high" that follows. Bi-LSTM solves this by reading both directions and concatenating outputs.

model.add(Bidirectional(LSTM(128)))

🏗️ Full NLP Keras Pipeline

# 1. Preprocess
tokens = nltk.word_tokenize(text.lower())
clean = [ps.stem(w) for w in tokens if w not in stopwords]

# 2. Encode + Pad
encoded = [one_hot(s, vocab_size=5000) for s in sentences]
X = pad_sequences(encoded, maxlen=50, padding='pre')

# 3. Model
model = Sequential([
Embedding(5000, 40, input_length=50),
Bidirectional(LSTM(100, dropout=0.2)),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')

04 — Pre-Padding vs Post-Padding

📏 Pre-Padding (Recommended for LSTM)

[0, 0, 0, 42, 8, 11, 7, 3]

Zeros at the front. The actual text sits at the end. For an LSTM reading left→right, the meaningful content arrives last — it sits fresh in the final hidden state used for prediction. Generally better for standard LSTMs.

pad_sequences(X, padding='pre') # default

📏 Post-Padding

[42, 8, 11, 7, 3, 0, 0, 0]

Zeros at the end. The LSTM processes real content first, then multiple zeros. The trailing zeros can dilute the context in the final hidden state, reducing prediction quality. Use only when the architecture requires it.

pad_sequences(X, padding='post')

05 — Tools, Resources & What to Learn Next

🛠️ Tools & Resources

☁️

Google Colab
Free GPU for sequence model training. Recommended for all LSTM/Transformer experiments.

📖

Colah's LSTM Blog
Chris Olah's blog (colah.github.io) — the definitive visual breakdown of LSTM gates. Essential companion reading. Highly recommended by practitioners worldwide.

📊

Kaggle / UCI Repository
Fake News Classifier, IMDb review corpus, sentiment datasets. Practical project data.

🧰

Libraries: TensorFlow / Keras, NLTK, Gensim, scikit-learn
one_hot, pad_sequences, Embedding, LSTM, Bidirectional in Keras. word_tokenize, PorterStemmer, stopwords in NLTK.

🚀 What to Learn Next (After This Guide)

1. Encoder-Decoder Architectures

Logical evolution from Bi-LSTMs. Seq2Seq for translation — sequence in, sequence out with attention.

2. Transformers & Self-Attention

Completely replaces RNN-based constraints. All tokens in parallel. Foundation of BERT and GPT. (Covered in Vol VIII)

3. Pre-trained LLMs — BERT & Hugging Face

Fine-tune BERT for classification, NER, QA without training from scratch. HuggingFace transformers library.

VIII

Volume 8 · State of the Art

RNN → Transformers — "Attention is All You Need"

01 — Architecture Evolution

🔁

Problem

Simple RNN

✗ Vanish ∇

→

🔀

Fix A

LSTM/GRU

✓ Gated memory

→

↔️

Fix B

Bi-LSTM

✓ Both contexts

→

🔄

Architecture

Seq2Seq

✗ Bottleneck

→

👁️

Breakthrough

Attention

✓ All states visible

→

⚡

SOTA

Transformer

✓ Parallel

✓ Self-attention

01b — GRU Gates — Context Within the Evolution

🔀 GRU Update Gate — How It Handles Context Switching

The update gate decides how much past hidden state to carry forward. When output is 0, the old state is entirely replaced. When output is 1, it is fully retained.

Example: Subject switches mid-sentence

"Mr. Watson was tired. His friend arrived."
→ Update gate fires: erases "Mr. Watson" context. Reset gate fires: makes room for "his friend".

zₜ (update) = σ(Wz·[hₜ₋₁, xₜ]) rₜ (reset) = σ(Wr·[hₜ₋₁, xₜ])
hₜ = (1−zₜ) ⊙ hₜ₋₁ + zₜ ⊙ tanh(W·[rₜ⊙hₜ₋₁, xₜ])

zₜ≈0: ignore past, use new zₜ≈1: retain past state Lighter than LSTM (2 gates, 1 state)

🔀 Bi-LSTM — "Bull is Going High"

Standard LSTM reading "Bull is going ___" cannot determine if this is financial context without seeing "high" that comes after. Bi-LSTM reads both directions and concatenates.

Predicts every word using both past AND future words. Critical for NER, fill-in-blank, QA tasks.

02 — Encoder-Decoder (Seq2Seq) Architecture

🔄 Seq2Seq: How It Works & Why It Breaks on Long Sentences

Encoder Role

Ingests input sequence. Ignores individual step outputs. Creates one final context vector summarising the whole input.

⚠ Context Bottleneck

One fixed-size vector cannot hold all info from 100+ words. BLEU score drops sharply with longer sentences — information is simply lost.

Decoder Role

Takes context vector. Predicts output one token at a time until end-of-string token is generated. Auto-regressive.

BLEU Score (Bilingual Evaluation Understudy) — standard metric for translation quality. Measures how closely machine output matches human reference translations. Range 0–1 (higher = better). Sharp drop on long sentences = bottleneck problem.

03 — Attention Mechanism — Fixing the Bottleneck

👁️ Attention: Give the Decoder Access to ALL Encoder States

Instead of squeezing everything into one context vector, Attention lets the decoder look back at every encoder hidden state at each decode step — creating a dynamic weighted context that focuses on the most relevant input words.

Standard Encoder-Decoder Attention

Q comes from the decoder. K and V come from the encoder. Lets the decoder ask "which input word should I focus on right now?" at each generation step.

Self-Attention (Transformer)

Q, K, and V all come from the same sequence. Lets every word attend to every other word within the same sentence — understanding "it" refers to "animal".

How Attention Fixes the Bottleneck

❌ Without attention: 100-word sentence → 1 fixed vector → decoder has no idea which input words matter for each output token.

✅ With attention: Decoder dynamically computes a new weighted context at every step, attending most to the relevant encoder states.

✅ Result: BLEU scores stay high even on very long sentences. Long-range dependencies preserved.

04 — Self-Attention: Step-by-Step

👁️ Self-Attention Formula & Steps

Generate Q,K,V

Embed × W^Q, W^K, W^V → 3 vectors per word

Score = Q·Kᵀ

Q of one word × K of all words → relevance scores

Scale ÷ √dk

÷√64=8. Prevents Softmax saturation, stabilises gradients

Softmax

Scores → probabilities summing to 1

× Values

Softmax × V → weighted sum = final output

Self-Attention Formula

Attention(Q,K,V) = softmax( Q·Kᵀ / √dk ) × V

dk = 64 dimensions · √dk = 8 · Prevents gradient saturation before Softmax

05 — Key Transformer Components

🔀 Multi-Head Attention

Run 8 parallel attention heads with different Q/K/V weight matrices. Each learns a different relationship. Concat + linear → richer representation. "it" → simultaneously links to "animal" AND "tired".

MultiHead(Q,K,V) = Concat(head₁,…,head₈) × W^O

📍 Positional Encoding

Parallel processing loses word order. PE adds sin/cos vectors to embeddings encoding position and relative distance. Without it: "dog bites man" = "man bites dog" to the Transformer.

Input = WordEmbed + PositionVector(sin/cos)

↩️ Residual + Layer Norm

Bypass shortcut around each sublayer. If self-attention isn't useful, data skips it. Prevents vanishing gradients in deep 6-layer stacks. Input + SubLayer(Input) → LayerNorm.

Output = LayerNorm(x + SubLayer(x))

06 — BERT vs GPT — Understanding Modern Architectures (Added)

BERT vs GPT — The Two Dominant Transformer Paradigms

Dimension	BERT (Encoder-only)	GPT (Decoder-only)
Architecture	Bidirectional encoder only (no decoder)	Causal decoder only (left-to-right masked)
Training task	Masked Language Modelling (fill in [MASK] tokens)	Next-token prediction (autoregressive)
Context direction	Sees all tokens (past + future) simultaneously	Only past tokens (masked future)
Best at	Classification, NER, QA, embeddings	Text generation, completion, coding
Fine-tune for	Sentiment, intent detection, information extraction	Chatbots, summarisation, code generation
Examples	BERT · RoBERTa · DistilBERT	GPT-2/3/4 · LLaMA · Mistral

Practical rule: Need to understand text → BERT. Need to generate text → GPT. Need both → Encoder-Decoder (T5, BART). For RAG retrieval → BERT embeddings. For RAG generation → GPT-family LLM.

07 — Why Transformers Dominate — Scaling Laws (Added)

Scaling Laws & Why Transformers Replaced RNNs in Production

Parallelism

RNNs process sequentially (each step depends on previous). Transformers process ALL tokens simultaneously → GPU utilisation 10–100× better.

Scaling

Kaplan et al. (2020): Model performance scales as a power law with compute, data, and parameters. RNNs don't benefit nearly as much from scale.

Long-range Dependencies

Every token attends to every other token in O(1) steps. RNNs need O(n) steps. Critical for long documents and code.

When to still use RNN/LSTM

✓ Edge deployment (Transformer too large)
✓ Real-time streaming (online learning)
✓ Very limited data (<10K sequences)
✓ Fixed-length time series with no long-range deps
✓ Teaching / understanding sequence models

When to use Transformers

✓ Any production NLP task in 2024+
✓ Long documents, code, multimodal
✓ When GPU compute is available
✓ Transfer learning from pretrained models
✓ Building LLM-powered applications

Master Quick-Reference — One Page to Rule Them All

Complete Deep Learning & NLP — Instant Reference

Vanishing Gradient

σ' ≤ 0.25 → 0.25ⁿ → 0

Fix: ReLU for FFNNs, LSTM/GRU for RNNs. Residual connections for Transformers.

Weight Init

Xavier (Sigmoid) · He (ReLU)

Never all-zeros. He = √(2/nᵢₙ). Xavier = √(6/(nᵢₙ+nₒᵤₜ)).

Activation Quick Rule

ReLU hidden · Softmax out

Sigmoid only binary output. Never Sigmoid in hidden. Leaky/PReLU if neurons die.

Loss Function

MSE=reg · BCE=binary · CCE=multi

Huber if outliers. Sparse CCE skips one-hot encoding. Always match loss to output type.

Optimizer

Adam by default · AdamW for LLMs

Never vanilla GD. β₁=0.9, β₂=0.999, ε=1e-8. Add LR schedule for best results.

CNN Output Size

⌊(n+2p−f)/s⌋ + 1

Same padding: p=⌊f/2⌋. Always ReLU after conv. Max pooling for location invariance.

NLP Vectorization

BoW→TF-IDF→Word2Vec→BERT

Production: use BERT embeddings. BoW/TF-IDF for prototypes. Word2Vec when BERT is too heavy.

LSTM 5 Steps

Forget→Input→C̃→Update→Output

Cₜ = fₜ×Cₜ₋₁ + iₜ×C̃ₜ. hₜ = oₜ×tanh(Cₜ). Pre-pad sequences for LSTM.

Self-Attention

softmax(QKᵀ/√dk) × V

dk=64, √dk=8. Q/K/V from Wq,Wk,Wv matrices. Softmax ensures scores sum to 1.

Transformer Stack

6 Enc + 6 Dec · 8 heads

Parallel processing. Positional encoding for order. Residual + LayerNorm every sublayer.

BERT vs GPT

Encoder=understand · Decoder=generate

BERT: masked LM, bidirectional. GPT: causal, autoregressive. T5/BART: both.

Production Checklist

Init → BN → Dropout → Schedule

He init → Batch Norm → Dropout(0.2–0.5) → LR schedule → Early stopping → Monitor val loss.

Volume 9

Transformers, BERT & GPT

Attention Mechanisms · Encoder vs Decoder · Transfer Learning · Meta-Learning

01 — Full Transformer Architecture (Vaswani et al. 2017)

The diagram below reproduces the full Encoder–Decoder Transformer architecture from "Attention is All You Need" (Vaswani et al., 2017) — the same diagram from Jay Alammar's famous illustrated guide. The left stack is the Encoder. The right stack is the Decoder. Each contains Self-Attention → Add & Normalize → Feed-Forward → Add & Normalize. The Decoder adds a third sublayer: Encoder–Decoder Attention which attends over the encoder's output. A Linear + Softmax layer on top converts decoder output to a probability distribution over the vocabulary.

Encoder (Left Stack)

Each encoder has 2 sublayers: Self-Attention → Add&Norm, then Feed-Forward → Add&Norm. Residual connections (dashed arrows) bypass each sublayer — gradient highway. Output is a rich contextual representation of the input sequence. BERT uses encoder-only.

Decoder (Right Stack)

Each decoder has 3 sublayers: Masked Self-Attention (prevents future-token peeking) → Add&Norm, then Encoder-Decoder Attention (attends over encoder output) → Add&Norm, then Feed-Forward → Add&Norm. GPT uses decoder-only.

Enc-Dec Attention (Pink)

The critical bridge. Queries come from the decoder; Keys and Values come from the final encoder output. This is how the decoder "reads" the encoded input while generating each output token — context flows from encoder to decoder through this layer.

02 — The Evolution of NLP

Legacy Era (2013–2016)

Word2Vec, n-grams, RNNs, LSTMs dominated NLP. Models processed tokens one at a time — inherently sequential and slow. BiLSTMs attempted bidirectionality by concatenating passes, but “bank” in “riverbank” and “bank robber” shared the same vector — no contextual awareness.

Transformer Breakthrough (2017)

“Attention is All You Need” replaced recurrence entirely. All tokens processed simultaneously — massive parallelization. Self-attention computes contextual relationships between every pair of words in one matrix operation. Training time dropped dramatically.

03 — Self-Attention Mechanism

04 — BERT (Encoder-Only) & GPT (Decoder-Only)

BERT — Encoder Stack

Stacks encoder blocks only. Bi-directional context. Pre-trained with MLM (mask 15% of tokens, predict them) + NSP (predict if sentence B follows A). Fine-tune by replacing output layer for specific tasks.

      Input: Token + Segment + Position Embeddings

      BERT Base: 12 layers, 110M params

      BERT Large: 24 layers, 340M params

      Max: 512 tokens (hard limit)

GPT — Decoder Stack

Stacks decoder blocks only. Uni-directional (left-to-right). Pre-trained by predicting the next word. Evolved from fine-tuning (GPT-1) → zero-shot (GPT-2, 1.5B) → few-shot meta-learning (GPT-3, 175B).

      GPT-1: 117M — fine-tune per task

      GPT-2: 1.5B — zero-shot learning

      GPT-3: 175B — few-shot (10-100 examples)

      No weight updates at inference time

Dimension	BERT	GPT
Architecture	Encoder-only	Decoder-only
Direction	Bi-directional	Uni-directional
Pre-training	MLM + NSP	Next-word prediction
Strength	Understanding / classifying	Generating / completing
Adaptation	Fine-tune output layer	Prompt / meta-learning
Token limit	512 (strict)	2k–128k+ model-dep.

Cheat Summary

Transformer

Encoder + Decoder, parallel

2017. Attention only. No recurrence. Q/K/V matrices. BERT=Encoder. GPT=Decoder.

Self-Attention

softmax(QKᵀ/√dk) × V

Irrelevant words → score 0.001 → drowned out. Relevant → score ~1 → preserved.

BERT Pre-training

MLM 15% + NSP binary

Mask 15%: balance between cost and context. NSP: sentence coherence. Both run simultaneously.

GPT Meta-learning

Zero/Few-shot → no weight updates

GPT-2: 0 examples. GPT-3: 10-100 examples in context window. 175B params needed.

Add & Normalize

Residual + LayerNorm

Every sublayer. Residual = gradient highway. LayerNorm = stabilizes activations. No vanishing gradients.

Enc-Dec Attention

Q=decoder, K/V=encoder

Bridge between stacks. Decoder queries the encoder's output for context on each output token.

Positional Encoding

sin/cos waves by position

Tokens enter all at once → no inherent order. PE injects order. BERT: max 512. GPT-3: 2048.

Transfer Learning

Pre-train → save → fine-tune

Never train from scratch. HuggingFace: bert-base-uncased. Fine-tune = replace output layer only.

Q&A Flashcards — Exam & Interview Prep

Why can Transformers parallelize but LSTMs cannot?

LSTMs: step t needs step t−1 output — inherently sequential. Transformers: attention computed on all tokens simultaneously in one matrix operation → full GPU/TPU utilization.

Remove Position Embeddings from BERT — what breaks?

All tokens enter without order. “Dog bites man” = “Man bites dog”. Grammar and syntax collapse. Transformer becomes a bag-of-words model.

Why mask exactly 15% in MLM?

Too low (5%) → training is computationally expensive for little signal. Too high (50%) → destroys surrounding context needed to predict the mask. 15% balances both.

Fine-tuning vs meta-learning in GPT?

Fine-tuning: gradients flow, weights update, needs 100k+ examples. Meta-learning: weights frozen, instructions + examples fed as vectors into context window at inference only.

Feed a 1000-word essay into BERT — what happens?

Hard failure. Position embeddings only defined up to token 512. The overflow is truncated or errors. Use Longformer, BigBird, or chunk the document into 512-token segments.

Why does Enc-Dec Attention use Q from decoder and K/V from encoder?

The decoder is generating output tokens. It needs to query the encoder's understanding of the full input. Q = “what am I looking for?”, K/V = “what did the encoder understand?”.

Sources & Further Reading

🔗 Original Paper

PAPER

"Attention is All You Need" — Vaswani et al. (2017)

The original Transformer paper introducing the full Encoder–Decoder architecture with multi-head self-attention. Google Brain / Google Research.

arxiv.org/abs/1706.03762 →

🎥 Video Resources

YT

Illustrated Guide to Transformers — Michael Phi

Step-by-step visual walkthrough of the Transformer architecture including the encoder-decoder stack, attention mechanism, and how words flow through each layer.

youtube.com/watch?v=4Bdc55j80l8 →

YT

BERT Neural Network — Computerphile

Clear explanation of BERT's masked language modelling, next sentence prediction, and how fine-tuning works for downstream NLP tasks.

youtube.com/watch?v=7kLi8u2dJz0 →

YT

GPT, GPT-2, GPT-3 Explained — Yannic Kilcher

Deep technical walk-through of all three GPT generations covering the shift from fine-tuning to in-context few-shot learning and the scaling hypothesis.

youtube.com/watch?v=SY5PvZrJhLE →

📄 Illustrated Blog Posts

BLOG

The Illustrated Transformer — Jay Alammar

The gold-standard illustrated walkthrough of the Transformer. The architecture diagram in this volume is based on Jay Alammar's visuals. Covers every component step-by-step with animations.

jalammar.github.io/illustrated-transformer →

BLOG

The Illustrated BERT, ELMo — Jay Alammar

Companion post to the Transformer guide. Explains BERT's pre-training tasks, token/segment/position embeddings, and fine-tuning in detail with illustrated examples.

jalammar.github.io/illustrated-bert →

BLOG

Attention? Attention! — Lilian Weng (OpenAI)

Comprehensive mathematical survey of all attention variants — self-attention, multi-head, cross-attention, and beyond. Essential for understanding the math behind every attention mechanism.

lilianweng.github.io/posts/attention →

Deep Learning & NLP — Complete Study Guide