Trellis-506M
A 506M parameter LLaMA-style language model pretrained from scratch on 20 billion tokens of curated data, optimized for structured output tasks (JSON generation, function calling, schema compliance).
Research Question
Does multi-dimensional data curation during pretraining -- selecting training data for topic relevance and reasoning complexity -- measurably improve structured output capabilities beyond what supervised fine-tuning alone achieves?
The thesis: structural patterns (JSON, schemas, type systems, nested hierarchies) embedded deeply during pretraining produce better generalization to unseen schemas than SFT format-following alone.
Model Details
| Parameter | Value |
|---|---|
| Architecture | LLaMA (LlamaForCausalLM) |
| Parameters | 506,328,320 (~506M) |
| Hidden size | 1,280 |
| Intermediate size (FFN) | 2,816 |
| Layers | 24 |
| Attention heads | 20 |
| Key-value heads | 10 (GQA, 2:1 ratio) |
| Context length | 2,048 |
| Vocab size | 50,304 |
| Activation | SiLU (SwiGLU) |
| Normalization | RMSNorm (eps=1e-5) |
| Position encoding | RoPE (theta=10000) |
| Tied embeddings | No |
| Precision | bfloat16 |
Pretraining Data
20 billion tokens assembled from curated sources, with deliberate over-weighting of structured and reasoning-heavy content:
| Source | Tokens | Share | Notes |
|---|---|---|---|
| FineWeb-Edu (curated) | 4.3B | 21.5% | Topic + complexity filtered via BERT classifiers trained on frontier LLM labels |
| StarCoderData (curated) | 3.9B | 19.5% | Quality + structured-relevance filtered code |
| FineMath-4+ | 3.2B | 16.0% | Pre-filtered mathematical text |
| peS2o (CS/math/ML) | 2.2B | 11.0% | Academic papers in technical domains |
| FineWeb-Edu (random) | 2.0B | 10.0% | Uncurated web text baseline |
| Structured Wikipedia | 1.5B | 7.5% | JSON infoboxes paired with article prose |
| SQaLe Text-to-SQL | 1.0B | 5.0% | Schema-to-query pairs |
| StackExchange (technical) | 1.0B | 5.0% | High-score technical Q&A |
| Wikipedia EN (plain) | 0.5B | 2.5% | General knowledge |
| UltraChat | 0.4B | 2.0% | Instruction-following diversity |
The curation pipeline uses BERT-based classifiers (trained on GPT-4 labels) to score documents along two axes -- topic relevance to structured tasks, and reasoning complexity -- then filters and ranks billions of candidate tokens.
Training Configuration
| Parameter | Value |
|---|---|
| Total tokens | 20B |
| Effective batch size | ~128k tokens (micro_batch=4 x seq_len=2048 x grad_accum=16) |
| Max learning rate | 3e-4 |
| Min learning rate | 3e-5 |
| LR schedule | Cosine decay |
| Warmup steps | 2,000 |
| Optimizer | AdamW (beta1=0.9, beta2=0.95) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Gradient checkpointing | Yes |
| torch.compile | Yes |
Tokenizer: GPT-NeoX (same as EleutherAI/pythia-410m, 50,304 vocab).
Hardware: Single NVIDIA RTX 4090 (24GB VRAM). Estimated throughput: ~45-55k tokens/sec.
Intended Use
This is a base model (no instruction tuning or RLHF). It is intended for:
- Research into the effects of curated pretraining on downstream structured output tasks
- Fine-tuning for JSON generation, function calling, schema compliance, and structured extraction
- Comparison against similarly-sized models (Pythia-410M, Pythia-1B) trained on uncurated data
This model is not intended for direct use in production applications without further fine-tuning and safety evaluation.
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"mdonigian/trellis-pretraining",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mdonigian/trellis-pretraining")
inputs = tokenizer("The JSON schema for a user profile is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Experimental Design
This model is one component of a controlled experiment:
- Trellis-506M (this model) -- pretrained on curated data
- Pythia-410M-deduped -- pretrained on The Pile (uncurated), similar parameter count
- Pythia-1B-deduped -- pretrained on The Pile (uncurated), 2x parameter count
All three models undergo identical SFT on the same dataset, with identical hyperparameters. Post-SFT evaluation on structured output benchmarks determines whether curated pretraining provides a lasting advantage.
Limitations
- 506M parameters is small by modern standards; the model has limited general knowledge and reasoning ability
- Context length is limited to 2,048 tokens
- This is a base model with no safety training or alignment
- Benchmark results are pending post-SFT evaluation
Citation
@misc{trellis2026,
title={Trellis-506M: Curated Pretraining for Structured Output},
author={Donigian, Matt},
year={2026},
url={https://huggingface.co/mdonigian/trellis-pretraining}
}
- Downloads last month
- 29