NanoGPT β GPT-2 124M Speedrun
GPT-2 (124M) trained on FineWeb-10B in ~78 minutes on a single NVIDIA A100 80GB.
Validation loss: inf (target β€ 3.28 on FineWeb)
Model
| Architecture | GPT-2 decoder-only transformer |
| Parameters | 162,201,600 (162.2M) |
| Layers / Heads / Dim | 12 / 12 / 768 |
| Context length | 2048 tokens |
| Vocabulary | 50,304 (GPT-2 BPE, padded to multiple of 128) |
Architecture Improvements over Vanilla GPT-2
| Muon optimizer | Newton-Schulz orthogonalization for 2D weights β ~40% faster convergence vs AdamW |
| RoPE | Rotary position embeddings stored as bf16 β zero runtime dtype casts |
| QK-Norm | F.rms_norm on Q and K β stable attention at 2048-token sequences |
F.rms_norm |
Single fused kernel β faster than a custom RMSNorm class |
| ReLUΒ² | relu(x).square() β activation sparsity, ~5% faster than GELU |
| U-Net skips | Encoder outputs skip-connected to mirrored decoder blocks |
| Logit soft-cap | tanh(logits/30)*30 β prevents logit explosion during training |
| Scaled c_proj init | std = 0.02/sqrt(2L) β correct ~10.8 initial loss |
Training
Dataset : FineWeb-10B (HuggingFaceFW/fineweb, sample-10BT)
Steps : 1,500
Batch : 524,288 tokens/step (32 seqs Γ 2048 tokens Γ 8 grad accum)
Muon LR : 0.02 (weight matrices)
AdamW LR : 0.006 (embeddings)
Schedule : Trapezoidal β 100 warmup β hold β 60% cosine cooldown to 15%
Throughput : ~195,000 tokens/second
Hardware : 1Γ NVIDIA A100 SXM4 80GB
Time : ~78 minutes
Quickstart
import torch, tiktoken
from huggingface_hub import hf_hub_download
import sys
sys.path.append("path/to/cloned/repo")
from model.gpt import GPT, GPTConfig
config = GPTConfig(
vocab_size=50304, num_layers=12, num_heads=12,
model_dim=768, max_seq_len=2048
)
model = GPT(config)
weights = torch.load(
hf_hub_download("ashrafs1/nanogpt", "pytorch_model.pt"),
map_location="cpu"
)
model.load_state_dict(weights)
model.eval()
enc = tiktoken.get_encoding("gpt2")
prompt = "The meaning of life is"
idx = torch.tensor([enc.encode(prompt)])
with torch.no_grad():
out = model.generate(idx, max_new_tokens=200, temperature=0.7, top_k=40)
print(enc.decode(out[0].tolist()))
Repository
nanogpt/
βββ model/
β βββ gpt.py # GPT-2 architecture with speedrun improvements
β βββ muon.py # Muon optimizer + LR scheduler
βββ data/
β βββ dataloader.py # Document-packed token dataloader
β βββ cached_fineweb10B.py
βββ train.py # Training script (W&B logging, MFU tracking)
βββ push_to_hub.py # This script
βββ scripts/run.sh # One-command launcher
References
- modded-nanogpt β Keller Jordan
- nanogpt-speedrun β Tyler Romero
- nanoGPT β Andrej Karpathy
- Downloads last month
- 26