Instructions to use joelhenwang/OdinNext-138M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joelhenwang/OdinNext-138M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("joelhenwang/OdinNext-138M-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joelhenwang/OdinNext-138M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joelhenwang/OdinNext-138M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/joelhenwang/OdinNext-138M-Instruct

SGLang

How to use joelhenwang/OdinNext-138M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joelhenwang/OdinNext-138M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joelhenwang/OdinNext-138M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joelhenwang/OdinNext-138M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use joelhenwang/OdinNext-138M-Instruct with Docker Model Runner:
```
docker model run hf.co/joelhenwang/OdinNext-138M-Instruct
```

OdinNext-138M-Instruct

OdinNext is a 138M-parameter causal language model that replaces softmax self-attention with an HGRN2-style gated linear recurrence. This repository is the instruction-tuned variant of joelhenwang/OdinNext-138M-Base: the base was post-trained with SFT and sequence-level knowledge distillation (SeqKD) from a 1.2B instruct teacher (LFM2.5-1.2B-Instruct) into a concise, ChatML-formatted assistant.

This is a small instruction-follower, not a knowledge model. Factual accuracy is fundamentally limited at 138M parameters — it follows instructions and produces structured, assistant-style answers, but it will hallucinate facts. Use it for research on compact recurrent/linear-attention chat models, not for knowledge-sensitive applications.

Repo: joelhenwang/OdinNext-138M-Instruct
Base: joelhenwang/OdinNext-138M-Base (138M HGRN2 recurrent base)
Chat format: ChatML (<|im_start|> / <|im_end|>), 32,770-token vocabulary.
Context window: 2,048 tokens in the released inference code.
License: Apache-2.0.

Uses custom Transformers code. Loading with trust_remote_code=True executes Python from this repo. Review the files or pin a commit before trusting it.

At a glance

Item	Value
Parameters	~138M (HGRN2 recurrent)
Layers	16
Hidden size	768
Heads	6
Head state dims	128 × 128 per head
FFN inner size	2,048
Vocabulary	32,770 (32,768 base BPE + 2 ChatML special tokens)
Max sequence length	2,048
Checkpoint dtype	fp16
Architecture	HGRN2 recurrence + alternating RoPE + SwiGLU² FFN + ZCRMSNorm
Cache type	Fixed-size recurrent state, not a growing KV cache
Post-training	SFT + SeqKD (ChatML) on the EMA base

Architecture

Decoder-only causal LM, 16 identical pre-norm blocks:

x = x + sigmoid(gate_attn) * HGRN2(ZCRMSNorm(x))
x = x + sigmoid(gate_ffn)  * SwiGLU²(ZCRMSNorm(x))

The HGRN2 recurrent state updates per token as:

S_t = diag(exp(g_t)) S_{t-1} + k_t ⊗ v_t
o_t = q_t S_t

with a per-layer state shaped [B, n_heads, head_f_dim, head_i_dim] = [B, 6, 128, 128], constant in size with respect to context length (O(1)-per-token decoding, not a growing KV cache).

Hybrid RoPE: even layers (0, 2, …, 14) apply RoPE to q/k (θ = 100,000); odd layers are position-free. Tied embedding / LM head. No linear biases.

Post-training

Starting from the EMA base (OdinNext-138M-Base), three stages were trialed and the SeqKD variant was selected:

Stage	Data	Result
SFT (full-param, CE-only)	smol-smoltalk + no_robots (~549M tokens, ChatML)	fluent, hallucinates
DPO (UltraFeedback)	—	rejected — over-optimized to incoherent output on a weak base
SeqKD (selected)	LFM2.5-1.2B-Instruct → 10K ChatML completions, 3 SFT epochs	best — concise assistant answers, learned teacher formatting (markdown / LaTeX)

The released weights are the SeqKD checkpoint. Knowledge benchmarks are preserved relative to the base (the post-training is style/format, not new knowledge).

Memory: recurrent state vs Transformer KV cache

For batch size 1 in fp16 the recurrent state is constant:

layers × heads × head_f_dim × head_i_dim × bytes
= 16 × 6 × 128 × 128 × 2 = 3,145,728 bytes ≈ 3.0 MiB

independent of generated length (the pure-PyTorch fallback promotes the scan state to fp32, ≈ 6.0 MiB). A same-depth fp16 Transformer KV cache would grow linearly (≈ 48 MiB at 1K tokens, ≈ 768 MiB at 16K). Cache-state comparison only.

Results

Zero-shot, lm-evaluation-harness (HellaSwag = acc_norm, ARC = mean of Easy + Challenge acc, PIQA = acc), measured on the full validation/test sets. Other rows are as reported by Axiomic Labs on the GPT-X2-125M card.

Correction (2026-06-09): an earlier version of this card reported HellaSwag ~33%. That figure was computed on only the first 2,000 HellaSwag validation examples, which score ~5–6 points higher than the full 10,042-example set. The table below is the corrected full-set result and reproduces under lm-eval (thanks to the Axiomic Labs leaderboard for catching this).

Company	Model	HellaSwag	ARC (avg)	PIQA	Training tokens
HuggingFace	SmolLM2-135M	43.22%	44.62%	67.52%	2T
Axiomic Labs	GPT-X2-125M	40.55%	39.90%	66.97%	75B
OpenAI	GPT-2 (124M)	31.49%	31.40%	63.28%	~10B
EleutherAI	Pythia-160M	30.46%	29.95%	57.94%	~225B
Facebook	OPT-125M	31.39%	31.53%	62.02%	180B
EleutherAI	GPT-Neo-125M	30.55%	31.43%	61.75%	300B
This work	OdinNext-138M-Instruct	28.82%	33.12%	59.41%	101.6B + post-train

Benchmarks are essentially unchanged from the base model — the instruction tuning optimizes assistant style/format, not commonsense-continuation (HellaSwag) or knowledge. All numbers are below the SmolLM / GPT-X2 frontier; the model was trained end-to-end on two consumer AMD mini-PCs.

Usage

pip install "transformers>=4.46" torch safetensors

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "joelhenwang/OdinNext-138M-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, torch_dtype=dtype,
).to(device).eval()

messages = [{"role": "user", "content": "Explain photosynthesis in one sentence."}]
inputs = tok.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt",
).to(device)

with torch.inference_mode():
    out = model.generate(
        inputs,
        max_new_tokens=128,
        do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1,
        pad_token_id=tok.pad_token_id, use_cache=True,
    )
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

If you build prompts manually, the ChatML format is:

<|im_start|>user
{your message}<|im_end|>
<|im_start|>assistant

Batching guidance

The recurrent scan does not apply an attention mask. For correct batched generation: avoid left padding, prefer same-length prompts, and verify batched output against single-sample output before relying on it. Single-prompt generation is the safest path.

What this model is good for

Short, ChatML-formatted instruction following and assistant-style replies.
Research on compact recurrent / linear-attention chat models and fixed-state decoding.

Do not use it for knowledge-sensitive or safety-sensitive generation, or for benchmark claims without running your own evaluation.

Limitations

Capacity-limited: 138M parameters — it hallucinates facts and is not a knowledge model.
No safety/alignment training beyond SFT+SeqKD: outputs can be biased, false, or incoherent.
Hard 2,048-token cap: recurrent state is constant, but the released RoPE cache limits cumulative positions to 2,048.
attention_mask ignored in the backbone; padding affects recurrent state.
English-focused; multilingual / code ability is uncharacterized.
Benchmarks above are zero-shot, full-set, via lm-evaluation-harness — run your own evaluation to confirm.

Citation

@misc{odinnext_138m_instruct_2026,
  title        = {OdinNext-138M-Instruct},
  author       = {Wang, Joel},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/joelhenwang/OdinNext-138M-Instruct}},
  note         = {138M HGRN2 recurrent instruction-tuned (SFT + SeqKD) checkpoint}
}

References

Zhen Qin et al. HGRN2: Gated Linear RNNs with State Expansion. arXiv:2404.07904.
Bowen Peng et al. Efficient Pre-Training with Token Superposition. arXiv:2605.06546.
Chenze Shao et al. Patch-Level Training for Large Language Models. arXiv:2407.12665.
Makoto Shing et al. DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation. arXiv:2506.14202.