NanoGPT 53M - Pre-LN Transformer

A 53-million parameter GPT model trained from scratch on TinyStories dataset. This model implements a Pre-LayerNorm (Pre-LN) transformer architecture and serves as a demonstration of efficient training on Apple Silicon using the MLX framework.

Model Format: PyTorch (cross-platform compatible)
Training Framework: Apple MLX (exported to PyTorch for universal compatibility)
Best for: Educational demonstrations, research, and fine-tuning on specific domains

Model Details

Architecture

Model Type: GPT (Decoder-only Transformer)
Parameters: 53M (52,990,464 total, 43M unique with weight tying)
Architecture Pattern: Pre-LayerNorm (Pre-LN)
Layers: 8 transformer blocks
Hidden Size: 384
Attention Heads: 8
Feedforward Dimension: 1536
Context Length: 512 tokens
Vocabulary Size: 50,257 (GPT-2 tokenizer)

Training

Framework: Apple MLX (training), PyTorch (export)
Dataset: TinyStories - Simple children's stories for language learning
Training Hardware: Apple M2 Pro (16GB unified memory)
Checkpoint: 20000 iterations
Training Method: Base pretraining from scratch

Architecture Highlights

This model uses Pre-LayerNorm architecture, different from standard GPT-2's Post-LN:

# Pre-LN (this model)
x = x + attn(ln(x))
x = x + ff(ln(x))

# vs Post-LN (standard GPT-2)
x = ln(x + attn(x))
x = ln(x + ff(x))

Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

Training Details

Dataset: TinyStories (simple children's stories)
Training Tokens: ~2M training tokens
Total Iterations: 20,000
Batch Size: 12 sequences/batch
Sequence Length: 512 tokens
Learning Rate: 3e-4 with cosine decay schedule
Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
Final Training Loss: 0.7583
Training Time: ~4 hours on Apple M2 Pro
Gradient Accumulation: None (direct updates)

Performance Benchmarks

Measured on Apple M2 Pro (16GB unified memory):

Metric	Value
Model Size	53.0M parameters
Memory (fp32)	202.1 MB
Memory (fp16)	101.1 MB
Training Throughput	27,355 tokens/sec
Batch Processing	13.36 batches/sec (batch=4, seq=512)
Inference Speed	169.9 tokens/sec
Generation Latency	~0.59s per 100 tokens
Activation Memory	843 MB (batch=4, seq=512)

Note: All benchmarks measured at checkpoint 20000 (this release).

Usage

Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (requires trust_remote_code for custom architecture)
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/tinystories")
model = AutoModelForCausalLM.from_pretrained(
    "jacksuuuu/tinystories",
    trust_remote_code=True
)

# Generate text
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.8,
    top_k=50,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Example Output

Prompt: "Once upon a time"

Generated:

Once upon a time, the boy named Lily and his dog named Max went for a walk. 
They ran and ran, but they kept each and got very tired. Suddenly the way, 
Max saw something shiny on the ground. He pointed the shiny to his owner and 
explained, "What does this?"

Max meowed and said, "I don't sign, Max. The sign is too small and it's 
important to learn."

Note: This model generates coherent short stories and educational content. While grammatically imperfect due to its small size (53M params), it demonstrates good narrative flow and vocabulary learned from FineWebEdu dataset.

Model Architecture

NanoGPTLMHeadModel(
  (transformer): NanoGPTModel(
    (token_embedding): Embedding(50257, 384)
    (position_embedding): Embedding(512, 384)
    (blocks): ModuleList(
      (0-7): 8 x NanoGPTBlock(
        (ln1): LayerNorm((384,), eps=1e-05)
        (attn): NanoGPTAttention(
          (qkv_proj): Linear(384, 1152)
          (out_proj): Linear(384, 384)
        )
        (ln2): LayerNorm((384,), eps=1e-05)
        (ff): FeedForward(
          (fc1): Linear(384, 1536)
          (fc2): Linear(1536, 384)
        )
      )
    )
    (ln_f): LayerNorm((384,), eps=1e-05)
  )
  (lm_head): Linear(384, 50257)
)

Note: token_embedding and lm_head weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

Training Configuration

{
  "vocab_size": 50257,
  "d_model": 384,
  "n_layers": 8,
  "n_heads": 8,
  "d_ff": 1536,
  "context_length": 512,
  "dropout": 0.1,
  "batch_size": 12,
  "learning_rate": 3e-4,
  "weight_decay": 0.1,
  "max_iters": 20000
}

Limitations

Context length: Limited to 512 tokens (can't process longer documents)
Domain: Trained primarily on educational web content (FineWebEdu)
Model size: 53M parameters - significantly smaller than modern LLMs (1B+)
Generation quality: Produces coherent narratives but with occasional grammatical errors
Factual accuracy: Limited by small model size and training data
No instruction tuning: Base language model - cannot follow instructions or engage in dialogue
Training data: Only 10M tokens (modern models use trillions)

Intended Use

Primary use cases:

Educational demonstrations of transformer training
Resource-constrained inference on Apple Silicon
Base model for fine-tuning on specific domains
Research and experimentation with Pre-LN architectures

Not recommended for:

Production applications requiring factual accuracy
Long-form content generation (>512 tokens)
Instruction following or chat applications (not instruction-tuned)

Ethical Considerations

This model was trained on FineWebEdu, which contains diverse web content. Users should:

Be aware of potential biases in generated content
Validate outputs for factual accuracy
Not use for applications requiring high reliability
Consider fine-tuning on domain-specific data for production use

Citation

If you use this model, please cite:

@software{nanogpt_mlx_2025,
  author = {JackSu},
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
  year = {2025},
  url = {https://huggingface.co/jacksuuuu/tinystories}
}

Additional Resources

GitHub Repository: JackSuuu/nanoGPT-on-MLX
MLX Framework: ml-explore/mlx
Training Dataset: roneneldan/TinyStories

License

MIT License - See repository for details.

Downloads last month: 53

jacksuuuu
/

tinystories