NanoGPT β€” GPT-2 124M Speedrun

GPT-2 (124M) trained on FineWeb-10B in ~78 minutes on a single NVIDIA A100 80GB.

Validation loss: inf (target ≀ 3.28 on FineWeb)


Model

Architecture GPT-2 decoder-only transformer
Parameters 162,201,600 (162.2M)
Layers / Heads / Dim 12 / 12 / 768
Context length 2048 tokens
Vocabulary 50,304 (GPT-2 BPE, padded to multiple of 128)

Architecture Improvements over Vanilla GPT-2

Muon optimizer Newton-Schulz orthogonalization for 2D weights β€” ~40% faster convergence vs AdamW
RoPE Rotary position embeddings stored as bf16 β€” zero runtime dtype casts
QK-Norm F.rms_norm on Q and K β€” stable attention at 2048-token sequences
F.rms_norm Single fused kernel β€” faster than a custom RMSNorm class
ReLUΒ² relu(x).square() β€” activation sparsity, ~5% faster than GELU
U-Net skips Encoder outputs skip-connected to mirrored decoder blocks
Logit soft-cap tanh(logits/30)*30 β€” prevents logit explosion during training
Scaled c_proj init std = 0.02/sqrt(2L) β€” correct ~10.8 initial loss

Training

Dataset    : FineWeb-10B (HuggingFaceFW/fineweb, sample-10BT)
Steps      : 1,500
Batch      : 524,288 tokens/step  (32 seqs Γ— 2048 tokens Γ— 8 grad accum)
Muon LR    : 0.02  (weight matrices)
AdamW LR   : 0.006 (embeddings)
Schedule   : Trapezoidal β€” 100 warmup β†’ hold β†’ 60% cosine cooldown to 15%
Throughput : ~195,000 tokens/second
Hardware   : 1Γ— NVIDIA A100 SXM4 80GB
Time       : ~78 minutes

Quickstart

import torch, tiktoken
from huggingface_hub import hf_hub_download
import sys
sys.path.append("path/to/cloned/repo")
from model.gpt import GPT, GPTConfig

config = GPTConfig(
    vocab_size=50304, num_layers=12, num_heads=12,
    model_dim=768, max_seq_len=2048
)
model = GPT(config)
weights = torch.load(
    hf_hub_download("ashrafs1/nanogpt", "pytorch_model.pt"),
    map_location="cpu"
)
model.load_state_dict(weights)
model.eval()

enc = tiktoken.get_encoding("gpt2")
prompt = "The meaning of life is"
idx = torch.tensor([enc.encode(prompt)])

with torch.no_grad():
    out = model.generate(idx, max_new_tokens=200, temperature=0.7, top_k=40)

print(enc.decode(out[0].tolist()))

Repository

nanogpt/
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ gpt.py          # GPT-2 architecture with speedrun improvements
β”‚   └── muon.py         # Muon optimizer + LR scheduler
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ dataloader.py   # Document-packed token dataloader
β”‚   └── cached_fineweb10B.py
β”œβ”€β”€ train.py            # Training script (W&B logging, MFU tracking)
β”œβ”€β”€ push_to_hub.py      # This script
└── scripts/run.sh      # One-command launcher

References

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ashrafs1/nanogpt