mrinaal-124m-base
A 123.6M-parameter decoder-only causal language model trained from scratch on a 2B-token FineWeb-Edu slice. It is GPT-2-small scale and uses GPT-2 tokenization, but the architecture is not GPT-2: it uses RoPE, RMSNorm, SwiGLU, bias-free linear layers, and tied input/output embeddings.
model config
| param | value |
|---|---|
| parameters | 123,551,232 |
| layers | 12 |
| hidden size | 768 |
| attention heads | 12 |
| context length | 1024 tokens |
| vocab size | 50257 |
| positional encoding | RoPE |
| norm | RMSNorm |
| activation | SwiGLU |
| dropout | 0.0 |
| tokenizer | GPT-2 tokenizer |
training
- dataset: FineWeb-Edu, pretokenized with the GPT-2 tokenizer
- training tokens: 2,000,000,000 configured; 1,998,848,000 tokens seen
- validation tokens: 20,000,000
- batch size: 8 sequences
- sequence length: 1024 tokens
- optimizer steps: 244,000
- learning rate: 3e-4 max, 3e-5 min, 2,000 warmup steps
- weight decay: 0.1
- gradient clipping: 1.0
- hardware: NVIDIA H100
metrics
This repo currently publishes model_best.safetensors, converted from /vol/checkpoints/124m_main_2b/best.pt.
| metric | value |
|---|---|
| best validation loss | 3.4553542232513426 |
| final train loss | 3.706049680709839 |
| final validation loss | 3.5147283601760866 |
| final validation perplexity | 33.60679773929784 |
| elapsed training time | 21,466.73 seconds |
The full run reached step 244,000. The published best checkpoint is the checkpoint with the lowest validation loss, so its validation loss is lower than the final validation loss.
files
model_best.safetensors— checkpoint with the lowest validation loss during trainingrun_summary.json— full training run metadatamodel_last.safetensorsis not currently uploaded; this repo intentionally publishes the best checkpoint only.
loading
from safetensors.torch import load_file
state_dict = load_file("model_best.safetensors")
to use with the original model class, clone the training repo and:
from first_llm_pretrain.model import DecoderOnlyTransformer, ModelConfig
config = ModelConfig(
vocab_size=50257,
block_size=1024,
n_layer=12,
n_head=12,
n_embd=768,
)
model = DecoderOnlyTransformer(config)
model.load_state_dict(load_file("model_best.safetensors"), strict=False)
model.eval()
strict=False is used because the safetensors conversion removes the duplicate lm_head.weight tensor and keeps token_embedding.weight; the original model class ties those weights.