Helios Nova

Helios Nova — 306M

Helios Nova is a 306M-parameter dense language model that explores the frontier of budget-efficient pre-training. It achieves 96% of SOTA peer-model accuracy while training on 5–30× fewer tokens, on a single GPU, for under $190.

The model incorporates a state-of-the-art transformer architecture — SwiGLU, Grouped-Query Attention, QK-Norm, and RoPE — and was pre-trained on 50 billion tokens from FineWeb-Edu on a single NVIDIA H100 in under 120 hours. Where comparable models consumed up to 1.5T tokens, Helios Nova reaches within 1.5 points of the same benchmark average with 30× less data.

Parameters 306M (dense, 24 unique layers)
Training data 50B tokens · FineWeb-Edu
Tokenizer 16K BPE (custom)
Context length 2,048 tokens
Hardware 1× NVIDIA H100 · < 120 hours
Training cost < $190 USD
Inference RAM < 3 GB (fp32)
License Apache 2.0

The efficiency story

Training data vs performance

Helios Nova trained on just 50B tokens — a fraction of what comparable models use. Despite this, it beats OpenELM-270M (trained on 30× more data) on ARC-Challenge, WinoGrande, and OBQA, and beats Pythia-410M (a larger model trained on 6× more data) on OBQA. The average gap to peer models is only 1.5 points, representing one of the highest accuracy-per-token ratios in this weight class.

Architecture

Dense causal transformer with 24 unique layers. State-of-the-art components designed for maximum learning per token:

Component Configuration
Layers 24 (all unique, no weight sharing)
Hidden dim 1,024
Attention GQA: 16 query / 4 KV heads
Head dim 64
FFN SwiGLU, hidden = 3,072
Positions RoPE (θ = 10,000)
QK-Norm RMSNorm on Q, K pre-dot-product
Normalisation RMSNorm (pre-norm, ε = 10⁻⁶)
Embeddings Tied input/output (saves ~16.7M)
Vocab 16k BPE

Why these choices matter for efficiency

SwiGLU provides 10–15% better parameter efficiency than standard MLPs — the single biggest contributor to Helios Nova's ability to learn more per token. GQA cuts the KV-cache by 4× for fast inference on consumer hardware. QK-Norm enables stable training at the high peak LR (3×10⁻⁴) that maximises learning rate, without gradient spikes. Depth over width (24 layers at d=1024) follows the MobileLLM finding that deeper models outperform wider ones at this scale.

Training

Data & schedule

50B tokens from FineWeb-Edu (sample-100BT). Warmup-Stable-Decay (WSD) schedule: 4k-step warmup → peak LR 3×10⁻⁴ for ~87% of training → cosine decay to 3×10⁻⁵ over the final 10%. WSD outperforms cosine on overtraining runs by keeping the model at peak LR for the vast majority of steps.

Key hyperparameters

AdamW (fused, β₁=0.9, β₂=0.95) · weight decay 0.1 · gradient clipping 1.0 · effective batch 393K tokens/step · bfloat16 + torch.compile · ~127k total steps · 1 epoch

Benchmark results

Evaluated with lm-evaluation-harness. Zero-shot except MMLU (5-shot). Baselines from SmolLM2 paper Table 4 (arXiv:2502.02737).

Model Params Tokens ARC-C WinoGrande PIQA OBQA MMLU (5s) Avg
Helios-Nova 306M 50B 28.4 53.1 63.8 33.2 22.9 40.3
OpenELM-270M 270M 1.5T 27.6 53.0 69.8 33.0 25.4 41.8
MobileLLM-350M 350M 250B 29.4 52.3 68.6 33.0 25.5 41.8
Pythia-410M 410M 300B 29.3 53.8 70.4 30.2 25.3 41.8
OpenELM-450M 450M 1.5T 30.1 53.6 72.3 33.6 25.8 43.1
SmolLM-360M 360M 1.4T 42.0 51.5 71.6 36.4 26.2 45.5

Limitations

  • English only. Trained exclusively on English educational content.
  • Not instruction-tuned. Base completion model — no dialogue or instruction following without fine-tuning.
  • 50B-token knowledge scope. Factual recall (MMLU) is the weakest benchmark accordingly.
  • 2,048-token context. Longer contexts require fine-tuning with extended RoPE.
  • No safety alignment. No RLHF, DPO, or safety filtering.

Intended uses

  • Research on efficient pre-training. A fully reproducible reference for studying data-efficient architectures at sub-500M scale.
  • Educational tool. Clean, self-contained codebase for learning transformer internals and the full LLM lifecycle.
  • Base model for fine-tuning. Starting point for domain-specific adaptation on educational or technical text.
  • On-device / edge deployment. < 3 GB in fp32 — fits on mobile devices, Raspberry Pi, or in-browser via ONNX/WASM.

Reproducibility

Full pipeline at github.com/rafaelespinosamena/Helios-Nova-306M. Every hyperparameter documented in config.yaml. Total cost to reproduce: < $190.

Talk to Helios Nova 306

The easiest way to run Helios Nova is through the interactive chat interface included in the official repository.

1. Clone the repository

git clone https://github.com/rafaelespinosamena/Helios-Nova-306M.git
cd Helios-Nova-306M

2. Install dependencies

pip install -r requirements.txt

3. Start the interactive chat

python chat.py

The script will automatically:

  • Download the model from HuggingFace
  • Load the tokenizer
  • Select the best device available (CUDA → Apple MPS → CPU)

Interactive Chat Controls

While running chat.py you can adjust generation parameters live:

Command Description
!temp 0.7 change temperature
!topk 40 change top-k sampling
!max 512 change generation length
!rep 1.2 change repetition penalty
!stream toggle streaming output
quit / exit exit the program

Example:

You: !max 100
  → max_tokens=100
You: In simple terms, black holes are
Helios Nova: a region of space which is so dense that not even light can escape from it. Black holes do absorb all...

For more details see the full repository:

GitHub
https://github.com/rafaelespinosamena/Helios-Nova-306M

Device compatibility

Platform Device string RAM
NVIDIA GPU device="cuda" ~2 GB VRAM
Apple Silicon device="mps" ~3 GB
CPU device="cpu" ~3 GB

Citation

@misc{espinosamena2025heliosnova,
  title   = {Helios Nova: A Budget-Efficient 306M Parameter Language Model},
  author  = {Espinosa Mena, Rafael},
  year    = {2026},
  url     = {https://github.com/rafaelespinosamena/Helios-Nova-306M},
  note    = {306M dense transformer, 50B tokens, single H100, under \$190 USD}
}

Acknowledgements

Baselines from the SmolLM2 paper (Allal et al. 2025). Architecture informed by SwiGLU (Shazeer 2020), GQA (Ainslie et al. 2023), QK-Norm (Dehghani et al. 2023), RoPE (Su et al. 2021), and depth-over-width scaling (MobileLLM, Liu et al. 2024).

Downloads last month
76
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for respinosamena/Helios-Nova-306M

Finetunes
1 model

Paper for respinosamena/Helios-Nova-306M

Evaluation results