NanoGPT — Shakespeare

A character-level GPT transformer trained from scratch on the complete works of Shakespeare.

Built as part of a mini AI lab — demonstrating the full lifecycle of an AI model: Train → Deploy → Chat → Evaluate

What this is

This is a minimal transformer language model — the same architecture as GPT, just smaller. It reads text one character at a time and learns to predict the next character. After training on Shakespeare, it generates text that sounds like Shakespeare.

Not because it memorized lines. Because it learned the patterns.

Architecture decisions

Setting	Value	Why
Embedding dim	256	Small enough to train on a laptop. Large enough to learn vocabulary and style.
Attention heads	8	Each head specializes — some track character names, others track meter and rhythm.
Transformer layers	6	Depth gives compositional power. Early layers learn characters, later layers learn structure.
Context window	256 characters	Long enough for a full speech. Short enough to fit in memory.
Vocab size	65	Every unique character in Shakespeare. Character-level — no subword tokenization.
Parameters	~4.8M	Deliberately small. The goal is understanding, not scale.

Training


Dataset	Complete works of Shakespeare (~1.1MB, ~1M characters)
Split	90% train / 10% validation
Optimizer	AdamW (lr=3e-4)
Batch size	32 sequences × 256 characters
Iterations	5,000
Final train loss	1.1152
Final val loss	1.4818
Hardware	Apple Silicon (MPS)
Time	~25 minutes

The train/val loss gap is small — the model learned generalizable patterns, not just memorized lines.

Loss curve

The curve shows the model going from random guessing (loss ~4.2) to genuine pattern recognition (loss ~1.1). Each step: predict the next character → measure how wrong → adjust all 4.82M numbers slightly → repeat.

Sample output

The way to be the gentleman king stones,
And then to be dull of the clouds of your mouth:
I'll give your daughter with you outrage,
When your slower to the cur o' the house, which shall
Follows on me.

FLORIZEL:
His good hence, I do call you thrice;
And will you not be proved with great forfeit corse,
No more to it? my noble papers lord, but he was gone.

LADY CAPULET:
And that propulate more, when many cheek,
Were it to call not foul would steal at it;
But I shall not know that I were dance

Not perfect English. But recognizably Shakespeare — character names, iambic rhythm, stage dialogue structure, period vocabulary. All learned from predicting one character at a time.

The key insight

GPT-4 has 175 billion parameters. This model has 4.82 million. The architecture is identical. The same attention mechanism. The same residual connections. The same training loop. The difference is purely scale — larger embeddings, more layers, more data, more compute.

When someone says "scaling laws" — this is what they mean. More parameters + more data + more compute = smarter model. The architecture doesn't change. Just the dials.

Files

File	Description
`model.pt`	Trained checkpoint — weights + config + vocab mappings
`config.json`	Architecture configuration
`loss_curve.png`	Training and validation loss over 5,000 steps
`sample_output.txt`	Text generated by the trained model

Code

Full training code available on GitHub. Includes:

model.py — transformer architecture with detailed comments explaining every primitive
train.py — training loop with explanations of every concept
upload_to_hf.py — this upload script

Downloads last month: 45