Instructions to use joelhenwang/OdinNext-138M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use joelhenwang/OdinNext-138M-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use joelhenwang/OdinNext-138M-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "joelhenwang/OdinNext-138M-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
- SGLang
How to use joelhenwang/OdinNext-138M-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "joelhenwang/OdinNext-138M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joelhenwang/OdinNext-138M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use joelhenwang/OdinNext-138M-Instruct with Docker Model Runner:
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
OdinNext-138M-Instruct
OdinNext is a 138M-parameter causal language model that replaces softmax
self-attention with an HGRN2-style gated linear recurrence. This repository
is the instruction-tuned variant of
joelhenwang/OdinNext-138M-Base:
the base was post-trained with SFT and sequence-level knowledge distillation
(SeqKD) from a 1.2B instruct teacher (LFM2.5-1.2B-Instruct) into a concise, ChatML-formatted
assistant.
This is a small instruction-follower, not a knowledge model. Factual accuracy is fundamentally limited at 138M parameters — it follows instructions and produces structured, assistant-style answers, but it will hallucinate facts. Use it for research on compact recurrent/linear-attention chat models, not for knowledge-sensitive applications.
- Repo:
joelhenwang/OdinNext-138M-Instruct - Base:
joelhenwang/OdinNext-138M-Base(138M HGRN2 recurrent base) - Chat format: ChatML (
<|im_start|>/<|im_end|>), 32,770-token vocabulary. - Context window: 2,048 tokens in the released inference code.
- License: Apache-2.0.
Uses custom Transformers code. Loading with
trust_remote_code=Trueexecutes Python from this repo. Review the files or pin a commit before trusting it.
At a glance
| Item | Value |
|---|---|
| Parameters | ~138M (HGRN2 recurrent) |
| Layers | 16 |
| Hidden size | 768 |
| Heads | 6 |
| Head state dims | 128 × 128 per head |
| FFN inner size | 2,048 |
| Vocabulary | 32,770 (32,768 base BPE + 2 ChatML special tokens) |
| Max sequence length | 2,048 |
| Checkpoint dtype | fp16 |
| Architecture | HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm |
| Cache type | Fixed-size recurrent state, not a growing KV cache |
| Post-training | SFT + SeqKD (ChatML) on the EMA base |
Architecture
Decoder-only causal LM, 16 identical pre-norm blocks:
x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn) * SwiGLU²(ZCRMSNorm(x))
The HGRN2 recurrent state updates per token as:
S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t
with a per-layer state shaped [B, n_heads, head_f_dim, head_i_dim] =
[B, 6, 128, 128], constant in size with respect to context length
(O(1)-per-token decoding, not a growing KV cache).
Hybrid RoPE: even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000); odd layers are position-free. Tied embedding / LM head. No linear biases.
Post-training
Starting from the EMA base (OdinNext-138M-Base), three stages were trialed
and the SeqKD variant was selected:
| Stage | Data | Result |
|---|---|---|
| SFT (full-param, CE-only) | smol-smoltalk + no_robots (~549M tokens, ChatML) | fluent, hallucinates |
| DPO (UltraFeedback) | — | rejected — over-optimized to incoherent output on a weak base |
| SeqKD (selected) | LFM2.5-1.2B-Instruct → 10K ChatML completions, 3 SFT epochs | best — concise assistant answers, learned teacher formatting (markdown / LaTeX) |
The released weights are the SeqKD checkpoint. Knowledge benchmarks are preserved relative to the base (the post-training is style/format, not new knowledge).
Memory: recurrent state vs Transformer KV cache
For batch size 1 in fp16 the recurrent state is constant:
layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB
independent of generated length (the pure-PyTorch fallback promotes the scan state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). Cache-state comparison only.
Results
Zero-shot, lm-evaluation-harness (HellaSwag = acc_norm, ARC = mean of Easy + Challenge acc, PIQA = acc), measured on the full validation/test sets. Other rows are as reported by Axiomic Labs on the GPT-X2-125M card.
Correction (2026-06-09): an earlier version of this card reported HellaSwag ~33%. That figure was computed on only the first 2,000 HellaSwag validation examples, which score ~5–6 points higher than the full 10,042-example set. The table below is the corrected full-set result and reproduces under lm-eval (thanks to the Axiomic Labs leaderboard for catching this).
| Company | Model | HellaSwag | ARC (avg) | PIQA | Training tokens |
|---|---|---|---|---|---|
| HuggingFace | SmolLM2-135M | 43.22% | 44.62% | 67.52% | 2T |
| Axiomic Labs | GPT-X2-125M | 40.55% | 39.90% | 66.97% | 75B |
| OpenAI | GPT-2 (124M) | 31.49% | 31.40% | 63.28% | ~10B |
| EleutherAI | Pythia-160M | 30.46% | 29.95% | 57.94% | ~225B |
| OPT-125M | 31.39% | 31.53% | 62.02% | 180B | |
| EleutherAI | GPT-Neo-125M | 30.55% | 31.43% | 61.75% | 300B |
| This work | OdinNext-138M-Instruct | 28.82% | 33.12% | 59.41% | 101.6B + post-train |
Benchmarks are essentially unchanged from the base model — the instruction tuning optimizes assistant style/format, not commonsense-continuation (HellaSwag) or knowledge. All numbers are below the SmolLM / GPT-X2 frontier; the model was trained end-to-end on two consumer AMD mini-PCs.
Usage
pip install "transformers>=4.46" torch safetensors
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "joelhenwang/OdinNext-138M-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo, trust_remote_code=True, torch_dtype=dtype,
).to(device).eval()
messages = [{"role": "user", "content": "Explain photosynthesis in one sentence."}]
inputs = tok.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt",
).to(device)
with torch.inference_mode():
out = model.generate(
inputs,
max_new_tokens=128,
do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1,
pad_token_id=tok.pad_token_id, use_cache=True,
)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))
If you build prompts manually, the ChatML format is:
<|im_start|>user
{your message}<|im_end|>
<|im_start|>assistant
Batching guidance
The recurrent scan does not apply an attention mask. For correct batched generation: avoid left padding, prefer same-length prompts, and verify batched output against single-sample output before relying on it. Single-prompt generation is the safest path.
What this model is good for
- Short, ChatML-formatted instruction following and assistant-style replies.
- Research on compact recurrent / linear-attention chat models and fixed-state decoding.
Do not use it for knowledge-sensitive or safety-sensitive generation, or for benchmark claims without running your own evaluation.
Limitations
- Capacity-limited: 138M parameters — it hallucinates facts and is not a knowledge model.
- No safety/alignment training beyond SFT+SeqKD: outputs can be biased, false, or incoherent.
- Hard 2,048-token cap: recurrent state is constant, but the released RoPE cache limits cumulative positions to 2,048.
attention_maskignored in the backbone; padding affects recurrent state.- English-focused; multilingual / code ability is uncharacterized.
- Benchmarks above are zero-shot, full-set, via lm-evaluation-harness — run your own evaluation to confirm.
Citation
@misc{odinnext_138m_instruct_2026,
title = {OdinNext-138M-Instruct},
author = {Wang, Joel},
year = {2026},
howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}},
note = {138M HGRN2 recurrent instruction-tuned (SFT + SeqKD) checkpoint}
}
References
- Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
- Bowen Peng et al. Efficient Pre-Training with Token Superposition. arXiv:2605.06546.
- Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
- Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.
- Downloads last month
- 22
Model tree for joelhenwang/OdinNext-138M-Instruct
Base model
joelhenwang/OdinNext-138M-Base