NanoGPT β Shakespeare
A character-level GPT transformer trained from scratch on the complete works of Shakespeare.
Built as part of a mini AI lab β demonstrating the full lifecycle of an AI model: Train β Deploy β Chat β Evaluate
What this is
This is a minimal transformer language model β the same architecture as GPT, just smaller. It reads text one character at a time and learns to predict the next character. After training on Shakespeare, it generates text that sounds like Shakespeare.
Not because it memorized lines. Because it learned the patterns.
Architecture decisions
| Setting | Value | Why |
|---|---|---|
| Embedding dim | 256 | Small enough to train on a laptop. Large enough to learn vocabulary and style. |
| Attention heads | 8 | Each head specializes β some track character names, others track meter and rhythm. |
| Transformer layers | 6 | Depth gives compositional power. Early layers learn characters, later layers learn structure. |
| Context window | 256 characters | Long enough for a full speech. Short enough to fit in memory. |
| Vocab size | 65 | Every unique character in Shakespeare. Character-level β no subword tokenization. |
| Parameters | ~4.8M | Deliberately small. The goal is understanding, not scale. |
Training
| Dataset | Complete works of Shakespeare (~1.1MB, ~1M characters) |
| Split | 90% train / 10% validation |
| Optimizer | AdamW (lr=3e-4) |
| Batch size | 32 sequences Γ 256 characters |
| Iterations | 5,000 |
| Final train loss | 1.1152 |
| Final val loss | 1.4818 |
| Hardware | Apple Silicon (MPS) |
| Time | ~25 minutes |
The train/val loss gap is small β the model learned generalizable patterns, not just memorized lines.
Loss curve
The curve shows the model going from random guessing (loss ~4.2) to genuine pattern recognition (loss ~1.1). Each step: predict the next character β measure how wrong β adjust all 4.82M numbers slightly β repeat.
Sample output
The way to be the gentleman king stones,
And then to be dull of the clouds of your mouth:
I'll give your daughter with you outrage,
When your slower to the cur o' the house, which shall
Follows on me.
FLORIZEL:
His good hence, I do call you thrice;
And will you not be proved with great forfeit corse,
No more to it? my noble papers lord, but he was gone.
LADY CAPULET:
And that propulate more, when many cheek,
Were it to call not foul would steal at it;
But I shall not know that I were dance
Not perfect English. But recognizably Shakespeare β character names, iambic rhythm, stage dialogue structure, period vocabulary. All learned from predicting one character at a time.
The key insight
GPT-4 has 175 billion parameters. This model has 4.82 million. The architecture is identical. The same attention mechanism. The same residual connections. The same training loop. The difference is purely scale β larger embeddings, more layers, more data, more compute.
When someone says "scaling laws" β this is what they mean. More parameters + more data + more compute = smarter model. The architecture doesn't change. Just the dials.
Files
| File | Description |
|---|---|
model.pt |
Trained checkpoint β weights + config + vocab mappings |
config.json |
Architecture configuration |
loss_curve.png |
Training and validation loss over 5,000 steps |
sample_output.txt |
Text generated by the trained model |
Code
Full training code available on GitHub. Includes:
model.pyβ transformer architecture with detailed comments explaining every primitivetrain.pyβ training loop with explanations of every conceptupload_to_hf.pyβ this upload script
- Downloads last month
- 45
