mrinaal-124m-base

A 123.6M-parameter decoder-only causal language model trained from scratch on a 2B-token FineWeb-Edu slice. It is GPT-2-small scale and uses GPT-2 tokenization, but the architecture is not GPT-2: it uses RoPE, RMSNorm, SwiGLU, bias-free linear layers, and tied input/output embeddings.

model config

param	value
parameters	123,551,232
layers	12
hidden size	768
attention heads	12
context length	1024 tokens
vocab size	50257
positional encoding	RoPE
norm	RMSNorm
activation	SwiGLU
dropout	0.0
tokenizer	GPT-2 tokenizer

training

dataset: FineWeb-Edu, pretokenized with the GPT-2 tokenizer
training tokens: 2,000,000,000 configured; 1,998,848,000 tokens seen
validation tokens: 20,000,000
batch size: 8 sequences
sequence length: 1024 tokens
optimizer steps: 244,000
learning rate: 3e-4 max, 3e-5 min, 2,000 warmup steps
weight decay: 0.1
gradient clipping: 1.0
hardware: NVIDIA H100

metrics

This repo currently publishes model_best.safetensors, converted from /vol/checkpoints/124m_main_2b/best.pt.

metric	value
best validation loss	3.4553542232513426
final train loss	3.706049680709839
final validation loss	3.5147283601760866
final validation perplexity	33.60679773929784
elapsed training time	21,466.73 seconds

The full run reached step 244,000. The published best checkpoint is the checkpoint with the lowest validation loss, so its validation loss is lower than the final validation loss.

files

model_best.safetensors — checkpoint with the lowest validation loss during training
run_summary.json — full training run metadata
model_last.safetensors is not currently uploaded; this repo intentionally publishes the best checkpoint only.

loading

from safetensors.torch import load_file

state_dict = load_file("model_best.safetensors")

to use with the original model class, clone the training repo and:

from first_llm_pretrain.model import DecoderOnlyTransformer, ModelConfig

config = ModelConfig(
    vocab_size=50257,
    block_size=1024,
    n_layer=12,
    n_head=12,
    n_embd=768,
)
model = DecoderOnlyTransformer(config)
model.load_state_dict(load_file("model_best.safetensors"), strict=False)
model.eval()

strict=False is used because the safetensors conversion removes the duplicate lm_head.weight tensor and keeps token_embedding.weight; the original model class ties those weights.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mrinaalarora/mrinaal-124m-base

Collection including mrinaalarora/mrinaal-124m-base

124M-Base-Experiments

Collection

Checkpoints from my first 124M LLM pre-training project, covering scratch training, continued pre-training, and SFT experiments. • 5 items • Updated 14 days ago