mrinaal-124m-base

A 123.6M-parameter decoder-only causal language model trained from scratch on a 2B-token FineWeb-Edu slice. It is GPT-2-small scale and uses GPT-2 tokenization, but the architecture is not GPT-2: it uses RoPE, RMSNorm, SwiGLU, bias-free linear layers, and tied input/output embeddings.

model config

param value
parameters 123,551,232
layers 12
hidden size 768
attention heads 12
context length 1024 tokens
vocab size 50257
positional encoding RoPE
norm RMSNorm
activation SwiGLU
dropout 0.0
tokenizer GPT-2 tokenizer

training

  • dataset: FineWeb-Edu, pretokenized with the GPT-2 tokenizer
  • training tokens: 2,000,000,000 configured; 1,998,848,000 tokens seen
  • validation tokens: 20,000,000
  • batch size: 8 sequences
  • sequence length: 1024 tokens
  • optimizer steps: 244,000
  • learning rate: 3e-4 max, 3e-5 min, 2,000 warmup steps
  • weight decay: 0.1
  • gradient clipping: 1.0
  • hardware: NVIDIA H100

metrics

This repo currently publishes model_best.safetensors, converted from /vol/checkpoints/124m_main_2b/best.pt.

metric value
best validation loss 3.4553542232513426
final train loss 3.706049680709839
final validation loss 3.5147283601760866
final validation perplexity 33.60679773929784
elapsed training time 21,466.73 seconds

The full run reached step 244,000. The published best checkpoint is the checkpoint with the lowest validation loss, so its validation loss is lower than the final validation loss.

files

  • model_best.safetensors — checkpoint with the lowest validation loss during training
  • run_summary.json — full training run metadata
  • model_last.safetensors is not currently uploaded; this repo intentionally publishes the best checkpoint only.

loading

from safetensors.torch import load_file

state_dict = load_file("model_best.safetensors")

to use with the original model class, clone the training repo and:

from first_llm_pretrain.model import DecoderOnlyTransformer, ModelConfig

config = ModelConfig(
    vocab_size=50257,
    block_size=1024,
    n_layer=12,
    n_head=12,
    n_embd=768,
)
model = DecoderOnlyTransformer(config)
model.load_state_dict(load_file("model_best.safetensors"), strict=False)
model.eval()

strict=False is used because the safetensors conversion removes the duplicate lm_head.weight tensor and keeps token_embedding.weight; the original model class ties those weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mrinaalarora/mrinaal-124m-base

Collection including mrinaalarora/mrinaal-124m-base