Trellis-506M

A 506M parameter LLaMA-style language model pretrained from scratch on 20 billion tokens of curated data, optimized for structured output tasks (JSON generation, function calling, schema compliance).

Research Question

Does multi-dimensional data curation during pretraining -- selecting training data for topic relevance and reasoning complexity -- measurably improve structured output capabilities beyond what supervised fine-tuning alone achieves?

The thesis: structural patterns (JSON, schemas, type systems, nested hierarchies) embedded deeply during pretraining produce better generalization to unseen schemas than SFT format-following alone.

Model Details

Parameter Value
Architecture LLaMA (LlamaForCausalLM)
Parameters 506,328,320 (~506M)
Hidden size 1,280
Intermediate size (FFN) 2,816
Layers 24
Attention heads 20
Key-value heads 10 (GQA, 2:1 ratio)
Context length 2,048
Vocab size 50,304
Activation SiLU (SwiGLU)
Normalization RMSNorm (eps=1e-5)
Position encoding RoPE (theta=10000)
Tied embeddings No
Precision bfloat16

Pretraining Data

20 billion tokens assembled from curated sources, with deliberate over-weighting of structured and reasoning-heavy content:

Source Tokens Share Notes
FineWeb-Edu (curated) 4.3B 21.5% Topic + complexity filtered via BERT classifiers trained on frontier LLM labels
StarCoderData (curated) 3.9B 19.5% Quality + structured-relevance filtered code
FineMath-4+ 3.2B 16.0% Pre-filtered mathematical text
peS2o (CS/math/ML) 2.2B 11.0% Academic papers in technical domains
FineWeb-Edu (random) 2.0B 10.0% Uncurated web text baseline
Structured Wikipedia 1.5B 7.5% JSON infoboxes paired with article prose
SQaLe Text-to-SQL 1.0B 5.0% Schema-to-query pairs
StackExchange (technical) 1.0B 5.0% High-score technical Q&A
Wikipedia EN (plain) 0.5B 2.5% General knowledge
UltraChat 0.4B 2.0% Instruction-following diversity

The curation pipeline uses BERT-based classifiers (trained on GPT-4 labels) to score documents along two axes -- topic relevance to structured tasks, and reasoning complexity -- then filters and ranks billions of candidate tokens.

Training Configuration

Parameter Value
Total tokens 20B
Effective batch size ~128k tokens (micro_batch=4 x seq_len=2048 x grad_accum=16)
Max learning rate 3e-4
Min learning rate 3e-5
LR schedule Cosine decay
Warmup steps 2,000
Optimizer AdamW (beta1=0.9, beta2=0.95)
Weight decay 0.1
Gradient clipping 1.0
Gradient checkpointing Yes
torch.compile Yes

Tokenizer: GPT-NeoX (same as EleutherAI/pythia-410m, 50,304 vocab).

Hardware: Single NVIDIA RTX 4090 (24GB VRAM). Estimated throughput: ~45-55k tokens/sec.

Intended Use

This is a base model (no instruction tuning or RLHF). It is intended for:

  • Research into the effects of curated pretraining on downstream structured output tasks
  • Fine-tuning for JSON generation, function calling, schema compliance, and structured extraction
  • Comparison against similarly-sized models (Pythia-410M, Pythia-1B) trained on uncurated data

This model is not intended for direct use in production applications without further fine-tuning and safety evaluation.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mdonigian/trellis-pretraining",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mdonigian/trellis-pretraining")

inputs = tokenizer("The JSON schema for a user profile is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Experimental Design

This model is one component of a controlled experiment:

  1. Trellis-506M (this model) -- pretrained on curated data
  2. Pythia-410M-deduped -- pretrained on The Pile (uncurated), similar parameter count
  3. Pythia-1B-deduped -- pretrained on The Pile (uncurated), 2x parameter count

All three models undergo identical SFT on the same dataset, with identical hyperparameters. Post-SFT evaluation on structured output benchmarks determines whether curated pretraining provides a lasting advantage.

Limitations

  • 506M parameters is small by modern standards; the model has limited general knowledge and reasoning ability
  • Context length is limited to 2,048 tokens
  • This is a base model with no safety training or alignment
  • Benchmark results are pending post-SFT evaluation

Citation

@misc{trellis2026,
  title={Trellis-506M: Curated Pretraining for Structured Output},
  author={Donigian, Matt},
  year={2026},
  url={https://huggingface.co/mdonigian/trellis-pretraining}
}
Downloads last month
29
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdonigian/trellis-pretraining

Finetunes
1 model