NanoGPT — GPT-2 124M Speedrun

GPT-2 (124M) trained on FineWeb-10B in ~78 minutes on a single NVIDIA A100 80GB.

Validation loss: inf (target ≤ 3.28 on FineWeb)

Model


Architecture	GPT-2 decoder-only transformer
Parameters	162,201,600 (162.2M)
Layers / Heads / Dim	12 / 12 / 768
Context length	2048 tokens
Vocabulary	50,304 (GPT-2 BPE, padded to multiple of 128)

Architecture Improvements over Vanilla GPT-2


Muon optimizer	Newton-Schulz orthogonalization for 2D weights — ~40% faster convergence vs AdamW
RoPE	Rotary position embeddings stored as bf16 — zero runtime dtype casts
QK-Norm	`F.rms_norm` on Q and K — stable attention at 2048-token sequences
`F.rms_norm`	Single fused kernel — faster than a custom RMSNorm class
ReLU²	`relu(x).square()` — activation sparsity, ~5% faster than GELU
U-Net skips	Encoder outputs skip-connected to mirrored decoder blocks
Logit soft-cap	`tanh(logits/30)*30` — prevents logit explosion during training
Scaled c_proj init	`std = 0.02/sqrt(2L)` — correct ~10.8 initial loss

Training

Dataset    : FineWeb-10B (HuggingFaceFW/fineweb, sample-10BT)
Steps      : 1,500
Batch      : 524,288 tokens/step  (32 seqs × 2048 tokens × 8 grad accum)
Muon LR    : 0.02  (weight matrices)
AdamW LR   : 0.006 (embeddings)
Schedule   : Trapezoidal — 100 warmup → hold → 60% cosine cooldown to 15%
Throughput : ~195,000 tokens/second
Hardware   : 1× NVIDIA A100 SXM4 80GB
Time       : ~78 minutes

Quickstart

import torch, tiktoken
from huggingface_hub import hf_hub_download
import sys
sys.path.append("path/to/cloned/repo")
from model.gpt import GPT, GPTConfig

config = GPTConfig(
    vocab_size=50304, num_layers=12, num_heads=12,
    model_dim=768, max_seq_len=2048
)
model = GPT(config)
weights = torch.load(
    hf_hub_download("ashrafs1/nanogpt", "pytorch_model.pt"),
    map_location="cpu"
)
model.load_state_dict(weights)
model.eval()

enc = tiktoken.get_encoding("gpt2")
prompt = "The meaning of life is"
idx = torch.tensor([enc.encode(prompt)])

with torch.no_grad():
    out = model.generate(idx, max_new_tokens=200, temperature=0.7, top_k=40)

print(enc.decode(out[0].tolist()))

Repository

nanogpt/
├── model/
│   ├── gpt.py          # GPT-2 architecture with speedrun improvements
│   └── muon.py         # Muon optimizer + LR scheduler
├── data/
│   ├── dataloader.py   # Document-packed token dataloader
│   └── cached_fineweb10B.py
├── train.py            # Training script (W&B logging, MFU tracking)
├── push_to_hub.py      # This script
└── scripts/run.sh      # One-command launcher

References

modded-nanogpt — Keller Jordan
nanogpt-speedrun — Tyler Romero
nanoGPT — Andrej Karpathy

Downloads last month: 26

ashrafs1
/

nanogpt

NanoGPT — GPT-2 124M Speedrun

Model

Architecture Improvements over Vanilla GPT-2

Training

Quickstart

Repository

References

Dataset used to train ashrafs1/nanogpt