Neural networks map complex inputs to accurate outputs by fitting mathematical curves to data. They are not mysterious black boxes — they use calculus, statistics, and linear algebra to iteratively minimise error.
Input Layer
Ingests features (X₁, X₂, X₃). Raw data converted to numbers first.
Final prediction ŷ. Sigmoid for binary, Softmax for multiclass, Linear for regression.
Neuron output: y = Σ(wᵢ · xᵢ) + b → activation(y) → ŷ
Weights multiply inputs to stretch/flip curves. Biases shift them. Stacking layers creates the capacity to approximate any function (Universal Approximation Theorem).
🏗️ Architecture Diagram
The learning loop: Forward → compute loss → Backward (chain rule) → update weights → repeat until convergence.
02 — The Vanishing Gradient Problem (Definitive Reference)
⚠️ Why Gradients Vanish — The Math
During backpropagation, gradients are multiplied through every layer using the chain rule. Sigmoid and Tanh derivatives are bounded to max 0.25. In a deep network this causes exponential decay.
Zero-centred, smooth at 0. Computationally expensive (exponential). Use when zero-centering matters more than speed.
✓ Zero-centred✗ Slow
Swish
x · σ(x) — Google Brain
Self-gating. Smooth, non-monotonic. Outperforms ReLU in very deep networks. Only use >40 layers — expensive.
>40 layers only
Softmax
eᶻʲ / Σeᶻᵏ
Output layer for multiclass. Converts raw logits to probability distribution summing to 1. Paired with Categorical Cross-Entropy.
Multiclass output
PReLU
max(αx, x) — α learned
α updated by backprop. α=0 → ReLU. α=0.01 → Leaky ReLU. Best of both: learns the optimal negative slope for your data.
✓ Adaptive α
02 — ArgMax vs SoftMax — The Output Decision
SoftMax — During Training
Converts raw logits into a probability distribution — all values between 0 and 1, summing to 1. Used during training so the loss function (Cross-Entropy) can compare smooth probabilities against true labels.
Training output layerDifferentiable✓ Smooth gradients
ArgMax — During Inference
Simply picks the index of the highest value. No probabilities — just the winning class. Used at inference time when you only need the final predicted label, not a probability.
ArgMax([2.1, 0.5, 1.3]) → 0 (index of 2.1, the largest)
Inference / predictionNot differentiable✓ Fastest — no exp()
03 — Activation Function Decision Guide
If your situation is…
Use this
Why
Hidden layers (default)
ReLU
Fast, no vanishing gradient, industry default. Use He init.
Dead neurons / stuck training
Leaky ReLU or PReLU
Keeps gradient non-zero for negatives. PReLU learns optimal α.
Very deep network (>40 layers)
Swish
Outperforms ReLU in deep nets. Accept compute cost.
Binary classification output
Sigmoid
Outputs 0–1 probability. Pair with Binary CE loss.
Multiclass output
Softmax
Probabilities sum to 1. Pair with Categorical CE.
Regression output
Linear (none)
Unbounded real number output. Pair with MSE/MAE.
RNN/LSTM gates
Sigmoid + Tanh
Sigmoid for gating (0–1). Tanh for values (−1 to +1).
Training is very slow (ELU)
Leaky ReLU
ELU's exponential is expensive. Leaky ReLU is faster with similar benefit.
03 — Batch Normalisation (Added)
Batch Normalisation — The Modern Solution to Internal Covariate Shift
BN normalises layer inputs to mean=0, variance=1 per mini-batch. Applied BEFORE or AFTER the activation function. Dramatically stabilises and speeds training. Makes learning rate less sensitive.
γ and β are learnable parameters — the network learns optimal scale and shift. ε = small constant for numerical stability.
✓ Allows much higher learning rates
✓ Reduces sensitivity to initialization
✓ Acts as mild regulariser (reduces need for Dropout)
✗ Doesn't work well with small batch sizes → use Layer Norm for RNNs/Transformers
model.add(BatchNormalization()) # after Dense, before or after activation
III
Volume 3
Loss Functions · Optimizers · Regularization
01 — Loss vs Cost — Quick Distinction
Loss Function
Error for a single data point. L = error(ŷ, y). What you minimise conceptually. Calculated on one record passing through the network.
Cost Function
Average loss across an entire batch or dataset. J = (1/n) Σ L. What gradient descent actually minimises — more stable gradient direction.
02 — Regression Loss Functions
MSE / L2 Loss
L = (y − ŷ)² J = Σ(y−ŷ)²/n
Quadratic bowl → guaranteed convergence. Heavily penalises outliers (squaring). ✓ Single global min✗ Outlier sensitive
MAE / L1 Loss
L = |y − ŷ|
Robust to outliers (linear, not squared). Sharp bend at 0 → undefined derivative. Local minima risk. ✓ Outlier robust✗ Harder optimise
Huber Loss — Best of Both
|err| ≤ δ → MSE |err| > δ → MAE
Smooth convergence (MSE center) + outlier robust (MAE tails). Hyperparameter δ controls transition. ✓ Best of both
03 — Classification Loss Functions
Function
Classes
Label Format
Formula
When to Use
Binary CE
Exactly 2
0 or 1
−y·log(ŷ)−(1−y)·log(1−ŷ)
Cat vs Dog, spam detection
Categorical CE
> 2
One-hot array [0,1,0]
−Σ yᵢ·log(ŷᵢ)
Image classification, NLU
Sparse Cat. CE
> 2
Integer labels (0, 1, 2…)
Same as Cat. CE, auto one-hot
Many classes, saves memory
07 — Virtual Environments — Project Best Practice
🐍 Why Virtual Environments Matter in Deep Learning
ML libraries (TensorFlow, PyTorch, CUDA) update frequently and break older code. TensorFlow 1.x vs 2.x are fundamentally incompatible. A virtual environment isolates each project's exact dependency versions.
pip freeze > requirements.txt # On another machine: pip install -r requirements.txt
Rule: Create a new Conda environment for every new project. Never install ML libraries into your base Python environment. Use requirements.txt to reproduce the environment on any machine or server.
08 — Loss Function — SSR (Sum of Squared Residuals)
📐 Sum of Squared Residuals (SSR)
The foundational loss for regression. Core of MSE. Squaring residuals ensures all errors are positive and penalises large errors much more than small ones.
SSR = Σ (y − ŷ)² → MSE = SSR / n (average version)
Why square? Ensures positive values. Amplifies large errors (3² = 9 vs 3). Creates smooth differentiable parabola — single global minimum for gradient descent.
When NOT to use SSR/MSE: Dataset with outliers — squaring inflates outlier errors massively, pulling the model away from the majority of normal data. Use MAE or Huber instead.
Optimizer
Mechanism
Strength
Weakness
Use When
Gradient Descent
Entire dataset per update
Stable direction
Catastrophically slow on large data
Never for production
SGD
1 sample per update
Fast iterations
Very noisy, zigzag path
With momentum for vision
Mini-batch SGD
k samples per update
Best balance
Still some noise
Industry standard baseline
SGD + Momentum
EWA of past gradients. V = β·V_prev + (1−β)·dW
Smoother path
Fixed LR still needed
CV tasks, ResNet training
Adagrad
Divides LR by sum of squared grads
Adaptive per-param LR
LR → 0 over time (fatal)
Sparse data only
RMSprop
EWA of squared grads in denominator
Fixes Adagrad decay
No bias correction
RNNs (historical)
Adam ⭐
Momentum + RMSprop + Bias Correction
Fast, stable, adaptive
May overfit on small data (try AdamW)
Default for everything
AdamW
Adam + decoupled weight decay
Better generalisation
Extra hyperparameter
Transformers, fine-tuning LLMs
05 — Learning Rate Scheduling (Added)
Learning Rate Scheduling — Critical for Production Training
A fixed LR is rarely optimal. Scheduling reduces LR over time — start large (fast progress) then shrink (precise convergence). Industry standard in all serious training runs.
Step Decay
LR drops by factor every N epochs. Simple but abrupt transitions.
Cosine Annealing
LR follows cosine curve. Smooth, widely used for vision and NLP.
Warmup + Decay
Start with tiny LR, ramp up, then decay. Standard for Transformers and LLMs.
ReduceLROnPlateau
Automatically reduces LR when validation loss stops improving. Easy win.
from torch.optim.lr_scheduler importCosineAnnealingLR, OneCycleLR scheduler = OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=10)
06 — Regularization — Dropout, L1, L2
🎲 Dropout
Randomly deactivates neurons (p=0.5 typical) during training. Forces redundant representations — like ensemble learning. All neurons active at inference, weights scaled by ×p.
model.add(Dropout(0.5)) # after LSTM/Dense
📏 L1 vs L2 Regularization
L2 (Ridge / Weight Decay)
J_total = J + λ Σw²
Shrinks all weights toward 0. Most common. Used in Adam as "weight decay". Smooth, differentiable.
L1 (Lasso)
J_total = J + λ Σ|w|
Pushes some weights exactly to 0 — produces sparse models. Good for feature selection.
🎯 Zero-Centred Activations — Why It Matters
If activation output is not centred around zero (like Sigmoid: outputs 0–1 only), gradients during backprop are all positive or all negative for a neuron's weights. This forces a zigzag path to convergence — slower training.
Output = Input size. Adds zeros around border. Preserves spatial dimensions. Use in deep nets.
Valid (p = 0)
Output shrinks by (f−1) per layer. Edge pixels under-represented. Use only for small networks.
keras: padding='same' or padding='valid'
⬇️ Max Pooling
Sliding window picks maximum value. Reduces spatial dimensions while retaining strongest activations. Achieves location invariance — object detected anywhere in image.
Average / Mean Pooling (same operation — two names)
Takes the mean of the region instead of the maximum. Smoother, less sharp features. Less common than Max Pooling for object detection. Used in Global Average Pooling (GAP) before final classifier in modern CNNs like ResNet.
Max Pool → object detectionAvg Pool → smooth features, GAP layers
🔄 Data Augmentation
Artificially expands training data by transforming existing images. Same label, different appearance — teaches CNN to be robust to variations.
02 — Text Preprocessing — Stemming, Lemmatization & Stop Words
🌿 Stemming vs Lemmatization
Stemming — PorterStemmer
Aggressively chops word endings to find a root stem. Fast but often produces non-words. Uses algorithm, not a dictionary.
"historical" → histori ❌ (not a real word) "running" → run ✓
from nltk.stem importPorterStemmer ps = PorterStemmer() ps.stem("historical") # → histori
✓ Very fast✗ Fake words possibleBest for: spam, toxic classifier
Lemmatization — WordNetLemmatizer
Uses a dictionary (WordNet) to find the true base form. Slower but always returns a real, meaningful word.
"historical" → history ✓ "better" → good ✓
from nltk.stem importWordNetLemmatizer wl = WordNetLemmatizer() wl.lemmatize("historical") # → history
✓ Real dictionary words✗ SlowerBest for: chatbots, translation
⚠️ Stop Words — The "not" Problem
Removing common low-value words saves computation. But never blindly remove "not" — it completely reverses sentiment meaning.
❌ Removing "not" destroys sentiment
Original:"Food is not good" After stop-word removal:"Food good" ← completely flipped!
Rule: Always use a custom stop word list for your task. Remove "not" from the default NLTK stop words list when doing sentiment analysis.
from nltk.corpus importstopwords stop = set(stopwords.words('english')) stop.remove('not') # critical for sentiment!
03 — N-grams, CountVectorizer & max_features
📊 N-grams — Capturing Word Sequences
Instead of single words (unigrams), N-grams capture sequences of N words. Bigrams and trigrams preserve local context that BoW loses entirely.
Unigram (1,1) — default BoW
"Indian politician" → ["Indian", "politician"] — loses relationship between words
Bigram (2,2) or (1,2)
"Indian politician" → ["Indian politician"] — single feature preserving meaning
from sklearn.feature_extraction.text importCountVectorizer cv = CountVectorizer(ngram_range=(1,2)) # unigrams + bigrams cv = CountVectorizer(ngram_range=(2,3)) # bigrams + trigrams only
⚙️ max_features — Fighting Sparsity
max_features restricts the model to only the top N most frequent words, discarding rare words. Manual way to control vector dimensions and reduce sparse matrix size.
cv = CountVectorizer(max_features=1000) # Only top 1000 words become features # Reduces 50,000-dim → 1,000-dim matrix
Captures gender, royalty relationships from raw text — no human labels
✓ Dense 100–300 dim✓ Semantic relationships✓ Cosine similarity works
VI
Volume 6
LSTM — Long Short-Term Memory Deep Dive
00 — Why RNNs? Human Memory vs Standard ANNs
🔁 The Problem RNNs Solve
When you read "The cat sat on the mat — it was tired", you understand "it" refers to "cat" because you remember earlier words. Standard ANNs process each input independently — no memory of previous inputs. RNNs add a loop that passes the previous output back as input, creating short-term memory.
Standard ANN ❌
Each word processed independently. No memory. Can't understand "it" without remembering "cat" from 6 words ago.
RNN ✓
Hidden state hₜ carries context from previous steps. Output at step t depends on all previous inputs. Natural for sequences: text, audio, time series.
01 — GRU — Gated Recurrent Unit
🔀 GRU — Lightweight LSTM Alternative
GRU combines long-term and short-term memory into a single hidden state using only 2 gates (vs LSTM's 3 gates and 2 states). Faster to train, often matches LSTM performance.
Update Gate
zₜ = σ(Wz·[hₜ₋₁, xₜ])
Decides how much of the past hidden state to retain. Output near 0 = ignore past (overwrite with new candidate). Near 1 = keep past state and blend with new info. Controls long-term memory.
0 = overwrite1 = keep past
Reset Gate
rₜ = σ(Wr·[hₜ₋₁, xₜ])
Decides what irrelevant old context to forget. Example: subject switches from "Mr. Watson" to "Mrs. Watson" — reset gate fires to erase old context and make room for new subject.
Sequence → Sequence. Language translation, chatbots, question-answering.
02 — Forward & Backward Propagation in RNNs
➡️ Forward Propagation
At each time step t, the network receives current input xₜ and the previous hidden state hₜ₋₁. Both are multiplied by their weight matrices, added together, and passed through an activation function (Sigmoid or Softmax).
The hidden state hₜ carries forward the context from all previous time steps into the next step — this is the RNN's short-term memory mechanism.
⬅️ Backward Propagation Through Time (BPTT)
Gradients flow backward through all time steps using the chain rule — called Backpropagation Through Time (BPTT). Each step multiplies the gradient by the weight matrix and activation derivative.
Problem: With Sigmoid/Tanh (derivative ≤ 0.25), multiplying across many time steps causes the gradient to vanish. Early time steps receive near-zero gradient updates — the network forgets long-range context.
✗ Vanishing gradient over long sequencesFix: LSTM / GRU gates
↔️ Bidirectional LSTM
Standard LSTM only reads left→right. "Bull is going ___" — can't predict without "high" that follows. Bi-LSTM solves this by reading both directions and concatenating outputs.
model.add(Bidirectional(LSTM(128)))
🏗️ Full NLP Keras Pipeline
# 1. Preprocess
tokens = nltk.word_tokenize(text.lower())
clean = [ps.stem(w) for w in tokens if w not in stopwords]
# 2. Encode + Pad
encoded = [one_hot(s, vocab_size=5000) for s in sentences]
X = pad_sequences(encoded, maxlen=50, padding='pre')
# 3. Model
model = Sequential([ Embedding(5000, 40, input_length=50), Bidirectional(LSTM(100, dropout=0.2)), Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
04 — Pre-Padding vs Post-Padding
📏 Pre-Padding (Recommended for LSTM)
[0, 0, 0, 42, 8, 11, 7, 3]
Zeros at the front. The actual text sits at the end. For an LSTM reading left→right, the meaningful content arrives last — it sits fresh in the final hidden state used for prediction. Generally better for standard LSTMs.
pad_sequences(X, padding='pre') # default
📏 Post-Padding
[42, 8, 11, 7, 3, 0, 0, 0]
Zeros at the end. The LSTM processes real content first, then multiple zeros. The trailing zeros can dilute the context in the final hidden state, reducing prediction quality. Use only when the architecture requires it.
pad_sequences(X, padding='post')
05 — Tools, Resources & What to Learn Next
🛠️ Tools & Resources
☁️
Google Colab Free GPU for sequence model training. Recommended for all LSTM/Transformer experiments.
📖
Colah's LSTM Blog Chris Olah's blog (colah.github.io) — the definitive visual breakdown of LSTM gates. Essential companion reading. Highly recommended by practitioners worldwide.
Libraries: TensorFlow / Keras, NLTK, Gensim, scikit-learn one_hot, pad_sequences, Embedding, LSTM, Bidirectional in Keras. word_tokenize, PorterStemmer, stopwords in NLTK.
🚀 What to Learn Next (After This Guide)
1. Encoder-Decoder Architectures
Logical evolution from Bi-LSTMs. Seq2Seq for translation — sequence in, sequence out with attention.
2. Transformers & Self-Attention
Completely replaces RNN-based constraints. All tokens in parallel. Foundation of BERT and GPT. (Covered in Vol VIII)
3. Pre-trained LLMs — BERT & Hugging Face
Fine-tune BERT for classification, NER, QA without training from scratch. HuggingFace transformers library.
VIII
Volume 8 · State of the Art
RNN → Transformers — "Attention is All You Need"
01 — Architecture Evolution
🔁
Problem
Simple RNN
✗ Vanish ∇
→
🔀
Fix A
LSTM/GRU
✓ Gated memory
→
↔️
Fix B
Bi-LSTM
✓ Both contexts
→
🔄
Architecture
Seq2Seq
✗ Bottleneck
→
👁️
Breakthrough
Attention
✓ All states visible
→
⚡
SOTA
Transformer
✓ Parallel
✓ Self-attention
01b — GRU Gates — Context Within the Evolution
🔀 GRU Update Gate — How It Handles Context Switching
The update gate decides how much past hidden state to carry forward. When output is 0, the old state is entirely replaced. When output is 1, it is fully retained.
Example: Subject switches mid-sentence
"Mr. Watson was tired. His friend arrived." → Update gate fires: erases "Mr. Watson" context. Reset gate fires: makes room for "his friend".
zₜ≈0: ignore past, use newzₜ≈1: retain past stateLighter than LSTM (2 gates, 1 state)
🔀 Bi-LSTM — "Bull is Going High"
Standard LSTM reading "Bull is going ___" cannot determine if this is financial context without seeing "high" that comes after. Bi-LSTM reads both directions and concatenates.
Predicts every word using both past AND future words. Critical for NER, fill-in-blank, QA tasks.
02 — Encoder-Decoder (Seq2Seq) Architecture
🔄 Seq2Seq: How It Works & Why It Breaks on Long Sentences
Encoder Role
Ingests input sequence. Ignores individual step outputs. Creates one final context vector summarising the whole input.
⚠ Context Bottleneck
One fixed-size vector cannot hold all info from 100+ words. BLEU score drops sharply with longer sentences — information is simply lost.
Decoder Role
Takes context vector. Predicts output one token at a time until end-of-string token is generated. Auto-regressive.
BLEU Score (Bilingual Evaluation Understudy) — standard metric for translation quality. Measures how closely machine output matches human reference translations. Range 0–1 (higher = better). Sharp drop on long sentences = bottleneck problem.
03 — Attention Mechanism — Fixing the Bottleneck
👁️ Attention: Give the Decoder Access to ALL Encoder States
Instead of squeezing everything into one context vector, Attention lets the decoder look back at every encoder hidden state at each decode step — creating a dynamic weighted context that focuses on the most relevant input words.
Standard Encoder-Decoder Attention
Q comes from the decoder. K and V come from the encoder. Lets the decoder ask "which input word should I focus on right now?" at each generation step.
Self-Attention (Transformer)
Q, K, and V all come from the same sequence. Lets every word attend to every other word within the same sentence — understanding "it" refers to "animal".
How Attention Fixes the Bottleneck
❌ Without attention: 100-word sentence → 1 fixed vector → decoder has no idea which input words matter for each output token.
✅ With attention: Decoder dynamically computes a new weighted context at every step, attending most to the relevant encoder states.
✅ Result: BLEU scores stay high even on very long sentences. Long-range dependencies preserved.
Run 8 parallel attention heads with different Q/K/V weight matrices. Each learns a different relationship. Concat + linear → richer representation. "it" → simultaneously links to "animal" AND "tired".
MultiHead(Q,K,V) = Concat(head₁,…,head₈) × W^O
📍 Positional Encoding
Parallel processing loses word order. PE adds sin/cos vectors to embeddings encoding position and relative distance. Without it: "dog bites man" = "man bites dog" to the Transformer.
Input = WordEmbed + PositionVector(sin/cos)
↩️ Residual + Layer Norm
Bypass shortcut around each sublayer. If self-attention isn't useful, data skips it. Prevents vanishing gradients in deep 6-layer stacks. Input + SubLayer(Input) → LayerNorm.
Output = LayerNorm(x + SubLayer(x))
06 — BERT vs GPT — Understanding Modern Architectures (Added)
BERT vs GPT — The Two Dominant Transformer Paradigms
Dimension
BERT (Encoder-only)
GPT (Decoder-only)
Architecture
Bidirectional encoder only (no decoder)
Causal decoder only (left-to-right masked)
Training task
Masked Language Modelling (fill in [MASK] tokens)
Next-token prediction (autoregressive)
Context direction
Sees all tokens (past + future) simultaneously
Only past tokens (masked future)
Best at
Classification, NER, QA, embeddings
Text generation, completion, coding
Fine-tune for
Sentiment, intent detection, information extraction
Chatbots, summarisation, code generation
Examples
BERT · RoBERTa · DistilBERT
GPT-2/3/4 · LLaMA · Mistral
Practical rule: Need to understand text → BERT. Need to generate text → GPT. Need both → Encoder-Decoder (T5, BART). For RAG retrieval → BERT embeddings. For RAG generation → GPT-family LLM.
Scaling Laws & Why Transformers Replaced RNNs in Production
Parallelism
RNNs process sequentially (each step depends on previous). Transformers process ALL tokens simultaneously → GPU utilisation 10–100× better.
Scaling
Kaplan et al. (2020): Model performance scales as a power law with compute, data, and parameters. RNNs don't benefit nearly as much from scale.
Long-range Dependencies
Every token attends to every other token in O(1) steps. RNNs need O(n) steps. Critical for long documents and code.
When to still use RNN/LSTM
✓ Edge deployment (Transformer too large)
✓ Real-time streaming (online learning)
✓ Very limited data (<10K sequences)
✓ Fixed-length time series with no long-range deps
✓ Teaching / understanding sequence models
When to use Transformers
✓ Any production NLP task in 2024+
✓ Long documents, code, multimodal
✓ When GPU compute is available
✓ Transfer learning from pretrained models
✓ Building LLM-powered applications
Master Quick-Reference — One Page to Rule Them All
Complete Deep Learning & NLP — Instant Reference
Vanishing Gradient
σ' ≤ 0.25 → 0.25ⁿ → 0
Fix: ReLU for FFNNs, LSTM/GRU for RNNs. Residual connections for Transformers.
Weight Init
Xavier (Sigmoid) · He (ReLU)
Never all-zeros. He = √(2/nᵢₙ). Xavier = √(6/(nᵢₙ+nₒᵤₜ)).
Activation Quick Rule
ReLU hidden · Softmax out
Sigmoid only binary output. Never Sigmoid in hidden. Leaky/PReLU if neurons die.
Loss Function
MSE=reg · BCE=binary · CCE=multi
Huber if outliers. Sparse CCE skips one-hot encoding. Always match loss to output type.
Optimizer
Adam by default · AdamW for LLMs
Never vanilla GD. β₁=0.9, β₂=0.999, ε=1e-8. Add LR schedule for best results.
CNN Output Size
⌊(n+2p−f)/s⌋ + 1
Same padding: p=⌊f/2⌋. Always ReLU after conv. Max pooling for location invariance.
NLP Vectorization
BoW→TF-IDF→Word2Vec→BERT
Production: use BERT embeddings. BoW/TF-IDF for prototypes. Word2Vec when BERT is too heavy.
He init → Batch Norm → Dropout(0.2–0.5) → LR schedule → Early stopping → Monitor val loss.
IX
Volume 9
Transformers, BERT & GPT
Attention Mechanisms · Encoder vs Decoder · Transfer Learning · Meta-Learning
01 — Full Transformer Architecture (Vaswani et al. 2017)
The diagram below reproduces the full Encoder–Decoder Transformer architecture from "Attention is All You Need" (Vaswani et al., 2017) — the same diagram from Jay Alammar's famous illustrated guide. The left stack is the Encoder. The right stack is the Decoder. Each contains Self-Attention → Add & Normalize → Feed-Forward → Add & Normalize. The Decoder adds a third sublayer: Encoder–Decoder Attention which attends over the encoder's output. A Linear + Softmax layer on top converts decoder output to a probability distribution over the vocabulary.
Encoder (Left Stack)
Each encoder has 2 sublayers: Self-Attention → Add&Norm, then Feed-Forward → Add&Norm. Residual connections (dashed arrows) bypass each sublayer — gradient highway. Output is a rich contextual representation of the input sequence. BERT uses encoder-only.
Decoder (Right Stack)
Each decoder has 3 sublayers: Masked Self-Attention (prevents future-token peeking) → Add&Norm, then Encoder-Decoder Attention (attends over encoder output) → Add&Norm, then Feed-Forward → Add&Norm. GPT uses decoder-only.
Enc-Dec Attention (Pink)
The critical bridge. Queries come from the decoder; Keys and Values come from the final encoder output. This is how the decoder "reads" the encoded input while generating each output token — context flows from encoder to decoder through this layer.
02 — The Evolution of NLP
Legacy Era (2013–2016)
Word2Vec, n-grams, RNNs, LSTMs dominated NLP. Models processed tokens one at a time — inherently sequential and slow. BiLSTMs attempted bidirectionality by concatenating passes, but “bank” in “riverbank” and “bank robber” shared the same vector — no contextual awareness.
Transformer Breakthrough (2017)
“Attention is All You Need” replaced recurrence entirely. All tokens processed simultaneously — massive parallelization. Self-attention computes contextual relationships between every pair of words in one matrix operation. Training time dropped dramatically.
03 — Self-Attention Mechanism
04 — BERT (Encoder-Only) & GPT (Decoder-Only)
BERT — Encoder Stack
Stacks encoder blocks only. Bi-directional context. Pre-trained with MLM (mask 15% of tokens, predict them) + NSP (predict if sentence B follows A). Fine-tune by replacing output layer for specific tasks.
Stacks decoder blocks only. Uni-directional (left-to-right). Pre-trained by predicting the next word. Evolved from fine-tuning (GPT-1) → zero-shot (GPT-2, 1.5B) → few-shot meta-learning (GPT-3, 175B).
GPT-1: 117M — fine-tune per task
GPT-2: 1.5B — zero-shot learning
GPT-3: 175B — few-shot (10-100 examples)
No weight updates at inference time
Dimension
BERT
GPT
Architecture
Encoder-only
Decoder-only
Direction
Bi-directional
Uni-directional
Pre-training
MLM + NSP
Next-word prediction
Strength
Understanding / classifying
Generating / completing
Adaptation
Fine-tune output layer
Prompt / meta-learning
Token limit
512 (strict)
2k–128k+ model-dep.
Cheat Summary
Transformer
Encoder + Decoder, parallel
2017. Attention only. No recurrence. Q/K/V matrices. BERT=Encoder. GPT=Decoder.
Every sublayer. Residual = gradient highway. LayerNorm = stabilizes activations. No vanishing gradients.
Enc-Dec Attention
Q=decoder, K/V=encoder
Bridge between stacks. Decoder queries the encoder's output for context on each output token.
Positional Encoding
sin/cos waves by position
Tokens enter all at once → no inherent order. PE injects order. BERT: max 512. GPT-3: 2048.
Transfer Learning
Pre-train → save → fine-tune
Never train from scratch. HuggingFace: bert-base-uncased. Fine-tune = replace output layer only.
Q&A Flashcards — Exam & Interview Prep
Why can Transformers parallelize but LSTMs cannot?
LSTMs: step t needs step t−1 output — inherently sequential. Transformers: attention computed on all tokens simultaneously in one matrix operation → full GPU/TPU utilization.
Remove Position Embeddings from BERT — what breaks?
All tokens enter without order. “Dog bites man” = “Man bites dog”. Grammar and syntax collapse. Transformer becomes a bag-of-words model.
Why mask exactly 15% in MLM?
Too low (5%) → training is computationally expensive for little signal. Too high (50%) → destroys surrounding context needed to predict the mask. 15% balances both.
Fine-tuning vs meta-learning in GPT?
Fine-tuning: gradients flow, weights update, needs 100k+ examples. Meta-learning: weights frozen, instructions + examples fed as vectors into context window at inference only.
Feed a 1000-word essay into BERT — what happens?
Hard failure. Position embeddings only defined up to token 512. The overflow is truncated or errors. Use Longformer, BigBird, or chunk the document into 512-token segments.
Why does Enc-Dec Attention use Q from decoder and K/V from encoder?
The decoder is generating output tokens. It needs to query the encoder's understanding of the full input. Q = “what am I looking for?”, K/V = “what did the encoder understand?”.
Sources & Further Reading
🔗 Original Paper
PAPER
"Attention is All You Need" — Vaswani et al. (2017)
The original Transformer paper introducing the full Encoder–Decoder architecture with multi-head self-attention. Google Brain / Google Research.
Step-by-step visual walkthrough of the Transformer architecture including the encoder-decoder stack, attention mechanism, and how words flow through each layer.
Deep technical walk-through of all three GPT generations covering the shift from fine-tuning to in-context few-shot learning and the scaling hypothesis.
The gold-standard illustrated walkthrough of the Transformer. The architecture diagram in this volume is based on Jay Alammar's visuals. Covers every component step-by-step with animations.
Companion post to the Transformer guide. Explains BERT's pre-training tasks, token/segment/position embeddings, and fine-tuning in detail with illustrated examples.
Comprehensive mathematical survey of all attention variants — self-attention, multi-head, cross-attention, and beyond. Essential for understanding the math behind every attention mechanism.
Go to gist.github.com
· Filename: study_guide_volumes.json
· Content — paste exactly:
{"volumes":[]}
· Click Create secret gist
· Copy the Gist ID from the URL (the long hash at the end)
3
Create a Personal Access Token
GitHub → avatar → Settings → scroll to bottom → Developer settings
→ Personal access tokens → Tokens (classic) → Generate new token
· Tick only: ✅ gist
· Generate → copy the token (shown once, starts with ghp_...)
4
Edit the HTML file — two lines near top of <script>
Open the HTML in any text editor, find these two lines and replace the placeholder values:
// find these lines:
var GH_TOKEN = 'PASTE_YOUR_GITHUB_TOKEN_HERE';
var GH_GIST_ID = 'PASTE_YOUR_GIST_ID_HERE';
// replace with your actual values:
var GH_TOKEN = 'ghp_yourActualToken...';
var GH_GIST_ID = 'yourActualGistId...';
Save the file → re-deploy to Cloudflare → click 🔌 Test Connection below to verify.
Configured Token (masked)
Configured Gist ID
🗑 Delete a Volume
Only dynamically added volumes can be deleted. The 9 built-in volumes are permanent.
Volume Builder
Add New Study Volume
Copy this prompt → open a new Claude chat → paste it → replace [PASTE YOUR NOTES HERE] with your notes → get HTML back → come to Step 2.
You are an AI engineer educator with 20+ years of experience. Convert the study notes I give you into an illustrated HTML cheatsheet volume matching the exact style of my existing deep learning study guide.
DESIGN SYSTEM (match exactly):
- Background: #06080f page, #0d1120 cards, #141928 code/SVG
- Text: #e2e8f8 primary, #5a6282 muted
- Fonts: Syne 800 headings, Inter body, JetBrains Mono code
- Accent colors: #22d3ee #4ade80 #fbbf24 #a78bfa #fb7185 #38bdf8 #f472b6 #818cf8
- Cards: background #0d1120, border 1px solid rgba(255,255,255,0.07), border-radius 8px
- Code blocks: background #141928, border-left 2px solid accent, border-radius 5px, padding 8px 13px
- SVG diagrams: use viewBox, background #141928, border-radius 5px
- Grids: inline CSS grid, auto-fit minmax for 2, 3, 4 columns
OUTPUT RULES:
1. Output ONLY raw inner HTML — no DOCTYPE, html, head, body, style tags, no markdown fences
2. Do NOT include any outer wrapper div, vol-divider, vol-num, vol-title, or vol-line — the app adds these automatically
3. Start directly with your first section label:
...
4. Use section label divs with class "sl" for each major topic area
5. Every concept must have: SVG diagram OR comparison table OR code block
6. Use inline styles only — no external CSS classes except: sl, card, g2, g3, g4, fml, tag, cheat-grid, cheat-cell, cheat-label, cheat-val, cheat-sub
7. Where an architecture diagram is referenced or provided as a screenshot, reproduce it faithfully as an SVG (viewBox, background #141928, border-radius 5px)
8. End with a cheat-grid div containing cheat-cell divs covering all key concepts
9. After the cheat grid, add a Sources & Further Reading section with links to any papers, YouTube videos, or blog posts mentioned in the notes
10. Output raw HTML only — no markdown, no explanation before or after
CONTENT RULES:
- Cover every sub-topic from the notes — nothing skipped
- Real working Python, Keras, or PyTorch code examples
- SVG architecture diagrams, not placeholder images
- Add AI Engineer insight boxes for production tips not in the notes
- For each concept: What it is, Why it matters, How it works, When to use it
- Comparison tables where multiple approaches exist
- End with Q&A flashcard section for exam and interview prep
[PASTE YOUR NOTES HERE]
Vol No.
Topic Name
Paste Claude's HTML here ↓ (just the inner content — no outer wrapper needed)
✅
Volume Added!
Scroll the page to see it. Click Download to save permanently.