Transformers, end-to-end from scratch
Transformers, end-to-end from scratch
The fastest way to lose intuition for transformers is to start from nn.MultiheadAttention. The fastest way to get it back is to write one without it. This post walks through the version I built — not the cleanest implementation, but the one where every line earned its place.
Why bother
I work on agents day-to-day. The models underneath are transformer-shaped, but I had been treating "attention" as a black box for longer than I'd like to admit. Specifically: I could derive softmax-attention on a whiteboard, but I couldn't tell you why the KV cache lives where it does in a production inference stack, or what changes about gradients when you swap RoPE for ALiBi.
Building one from scratch fixes that — not because the math is hard, but because every shortcut you'd normally take forces you to confront a tradeoff.
The pieces, in order
- Tokenization — byte-pair encoding, and why "just use a bigger vocab" stops working.
- Embeddings + positional encoding — sinusoidal vs learned vs RoPE, and what each one breaks at long context.
- Scaled dot-product attention — the
√d_kterm has a real reason, and it's not "stability." - Multi-head attention — the projection trick, and where the parameter count actually comes from.
- Feed-forward block — the 4× expansion ratio is a hyperparameter, not a law.
- Layer norm placement — pre-norm vs post-norm, and why every modern model picked pre.
- Causal masking and the KV cache — the part I most wanted to internalize.
What I'd do differently next time
- Spend more time on numerical stability, not just correctness. My first pass was correct on toy inputs and silently wrong at sequence lengths > 512.
- Profile memory before reaching for
torch.compile. Two-thirds of the wins I attributed to compilation were really just from getting the attention mask right.
Code
The repo is intentionally messy — I wanted the commit history to show the order in which things clicked. Linked at the bottom.
Next up in this series: harness vs harness — same model, very different results at coding tasks. That post is where this foundations work starts paying rent.