OdinNext-138M-Instruct

OdinNext is a 138M-parameter causal language model that replaces softmax self-attention with an HGRN2-style gated linear recurrence. This repository is the instruction-tuned variant of joelhenwang/OdinNext-138M-Base: the base was post-trained with SFT and sequence-level knowledge distillation (SeqKD) from a 1.2B instruct teacher (LFM2.5-1.2B-Instruct) into a concise, ChatML-formatted assistant.

This is a small instruction-follower, not a knowledge model. Factual accuracy is fundamentally limited at 138M parameters — it follows instructions and produces structured, assistant-style answers, but it will hallucinate facts. Use it for research on compact recurrent/linear-attention chat models, not for knowledge-sensitive applications.

  • Repo: joelhenwang/OdinNext-138M-Instruct
  • Base: joelhenwang/OdinNext-138M-Base (138M HGRN2 recurrent base)
  • Chat format: ChatML (<|im_start|> / <|im_end|>), 32,770-token vocabulary.
  • Context window: 2,048 tokens in the released inference code.
  • License: Apache-2.0.

Uses custom Transformers code. Loading with trust_remote_code=True executes Python from this repo. Review the files or pin a commit before trusting it.

At a glance

Item Value
Parameters ~138M (HGRN2 recurrent)
Layers 16
Hidden size 768
Heads 6
Head state dims 128 × 128 per head
FFN inner size 2,048
Vocabulary 32,770 (32,768 base BPE + 2 ChatML special tokens)
Max sequence length 2,048
Checkpoint dtype fp16
Architecture HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm
Cache type Fixed-size recurrent state, not a growing KV cache
Post-training SFT + SeqKD (ChatML) on the EMA base

Architecture

Decoder-only causal LM, 16 identical pre-norm blocks:

x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(ZCRMSNorm(x))

The HGRN2 recurrent state updates per token as:

S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t

with a per-layer state shaped [B, n_heads, head_f_dim, head_i_dim] = [B, 6, 128, 128], constant in size with respect to context length (O(1)-per-token decoding, not a growing KV cache).

Hybrid RoPE: even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000); odd layers are position-free. Tied embedding / LM head. No linear biases.

Post-training

Starting from the EMA base (OdinNext-138M-Base), three stages were trialed and the SeqKD variant was selected:

Stage Data Result
SFT (full-param, CE-only) smol-smoltalk + no_robots (~549M tokens, ChatML) fluent, hallucinates
DPO (UltraFeedback) rejected — over-optimized to incoherent output on a weak base
SeqKD (selected) LFM2.5-1.2B-Instruct → 10K ChatML completions, 3 SFT epochs best — concise assistant answers, learned teacher formatting (markdown / LaTeX)

The released weights are the SeqKD checkpoint. Knowledge benchmarks are preserved relative to the base (the post-training is style/format, not new knowledge).

Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16 the recurrent state is constant:

layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB

independent of generated length (the pure-PyTorch fallback promotes the scan state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). Cache-state comparison only.

Results

Zero-shot, lm-evaluation-harness (HellaSwag = acc_norm, ARC = mean of Easy + Challenge acc, PIQA = acc), measured on the full validation/test sets. Other rows are as reported by Axiomic Labs on the GPT-X2-125M card.

Correction (2026-06-09): an earlier version of this card reported HellaSwag ~33%. That figure was computed on only the first 2,000 HellaSwag validation examples, which score ~5–6 points higher than the full 10,042-example set. The table below is the corrected full-set result and reproduces under lm-eval (thanks to the Axiomic Labs leaderboard for catching this).

Company Model HellaSwag ARC (avg) PIQA Training tokens
HuggingFace SmolLM2-135M 43.22% 44.62% 67.52% 2T
Axiomic Labs GPT-X2-125M 40.55% 39.90% 66.97% 75B
OpenAI GPT-2 (124M) 31.49% 31.40% 63.28% ~10B
EleutherAI Pythia-160M 30.46% 29.95% 57.94% ~225B
Facebook OPT-125M 31.39% 31.53% 62.02% 180B
EleutherAI GPT-Neo-125M 30.55% 31.43% 61.75% 300B
This work OdinNext-138M-Instruct 28.82% 33.12% 59.41% 101.6B + post-train

Benchmarks are essentially unchanged from the base model — the instruction tuning optimizes assistant style/format, not commonsense-continuation (HellaSwag) or knowledge. All numbers are below the SmolLM / GPT-X2 frontier; the model was trained end-to-end on two consumer AMD mini-PCs.

Usage

pip install "transformers>=4.46" torch safetensors
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, torch_dtype=dtype,
).to(device).eval()

messages = [{"role": "user", "content": "Explain photosynthesis in one sentence."}]
inputs = tok.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt",
).to(device)

with torch.inference_mode():
    out = model.generate(
        inputs,
        max_new_tokens=128,
        do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id, use_cache=True,
    )
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

If you build prompts manually, the ChatML format is:

<|im_start|>user
{your message}<|im_end|>
<|im_start|>assistant

Batching guidance

The recurrent scan does not apply an attention mask. For correct batched generation: avoid left padding, prefer same-length prompts, and verify batched output against single-sample output before relying on it. Single-prompt generation is the safest path.

What this model is good for

  • Short, ChatML-formatted instruction following and assistant-style replies.
  • Research on compact recurrent / linear-attention chat models and fixed-state decoding.

Do not use it for knowledge-sensitive or safety-sensitive generation, or for benchmark claims without running your own evaluation.

Limitations

  • Capacity-limited: 138M parameters — it hallucinates facts and is not a knowledge model.
  • No safety/alignment training beyond SFT+SeqKD: outputs can be biased, false, or incoherent.
  • Hard 2,048-token cap: recurrent state is constant, but the released RoPE cache limits cumulative positions to 2,048.
  • attention_mask ignored in the backbone; padding affects recurrent state.
  • English-focused; multilingual / code ability is uncharacterized.
  • Benchmarks above are zero-shot, full-set, via lm-evaluation-harness — run your own evaluation to confirm.

Citation

@misc{odinnext_138m_instruct_2026,
  title        = {OdinNext-138M-Instruct},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}},
  note         = {138M HGRN2 recurrent instruction-tuned (SFT + SeqKD) checkpoint}
}

References

  • Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
  • Bowen Peng et al. Efficient Pre-Training with Token Superposition. arXiv:2605.06546.
  • Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
  • Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.
Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joelhenwang/OdinNext-138M-Instruct

Finetuned
(1)
this model

Space using joelhenwang/OdinNext-138M-Instruct 1

Papers for joelhenwang/OdinNext-138M-Instruct