gpt-oss-20b-tq3

TurboQuant 3-bit MLX quantization of openai/gpt-oss-20b — produced with TurboQuant-MLX.

GPT-OSS-20B is a 21 B-parameter Mixture-of-Experts model with 32 experts and ~3.6 B active parameters per token. After TurboQuant 3-bit compression it fits comfortably on a 16 GB Apple Silicon Mac with full 131K-token context — and with the v0.2 KV-cache compression layered on top, the cache shrinks 4× as well.

Model Details

Base Model: openai/gpt-oss-20b (21 B total, 32 experts, ~3.6 B active)
Quantization: TurboQuant 3-bit (Hadamard rotation + Lloyd-Max codebook), group_size=64
Calibration data: none — TurboQuant is data-free
Size: ~9.5 GB on disk
Peak wired RAM at decode: ~11 GB (verified on a 16 GB Mac with macOS background apps)
Decode speed: 60–80 tok/s (M-series), up to 73 tok/s on M4 Max with fp16 KV cache
Runs on: Apple Silicon (M1/M2/M3/M4) with 16 GB or more unified memory

Requirements

pip install "turboquant-mlx-full>=0.2.0" "mlx-lm>=0.31.3"

Sampler recommendations

GPT-OSS-20B is a sub-25B model, which means it sits right at the edge of capability for multi-step reasoning. Sampler choice matters more here than on larger models:

Use case	Recommended sampler
Casual chat / creative writing / Q&A	`--temp 0.7 --rep-penalty 1.1`
Math, code, multi-step reasoning	`--temp 0.3 --rep-penalty 1.1`

At temp 0.7 the model occasionally gives up mid-problem on word problems, or writes plausible-looking but logically buggy code. Dropping to temp 0.3 stabilizes the reasoning trace and produces correct setups for both math and code.

Verified quality (6-test stress harness)

Tested with scripts/stress_hybrid_sampler.py on a 64 GB M-series Mac (peak RAM matches 16 GB target):

#	Test	Verdict (recommended sampler)
01	long_essay (1500-word Roman Empire, 3500 max_tok)	clean, no degenerate tail
02	math (two trains, meeting time + distance, 800 max_tok)	correct at `--temp 0.3` (sets up `60t + 75(t-0.5) = 215`, solves t≈1.87 hr → 10:52 AM); unstable at temp 0.7
03	code (`merge_intervals` + 3 unit tests, 1500 max_tok)	correct function logic at `--temp 0.3`; occasional hallucinated assertion values (function works, fix the test)
04	needle (FUCHSIA-7741 in haystack, 200 max_tok)	password retrieved verbatim
05	format (5-item list under 15 words/line, 1500 max_tok)	exactly 5 short numbered lines, no commentary
06	repetition_trap (sky-blue thorough, 4096 max_tok)	clean answer, no paragraph loops

Decode speed across all 6 tests: 46–94 tok/s. Peak RAM: 11.0–11.2 GB.

Quick Start

Download the model

hf download manjunathshiva/gpt-oss-20b-tq3 \
    --local-dir ~/models/gpt-oss-20b-tq3

Generate text — standard chat

turboquant-generate \
    --model ~/models/gpt-oss-20b-tq3 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 1024 --temp 0.7 --rep-penalty 1.1

Generate text — math / code (temp 0.3)

turboquant-generate \
    --model ~/models/gpt-oss-20b-tq3 \
    --prompt "Solve this multi-step word problem..." \
    --max-tokens 1024 --temp 0.3 --rep-penalty 1.1

Generate with TurboQuant KV cache (v0.2+) — 4× smaller cache

For long-context generation, layer the v0.2 KV-cache compression on top. K8/V3 mixed precision is required when stacking on TurboQuant-quantized weights — symmetric K3 would compound the noise and break long-form output past ~800 tokens. The 128-token fp16 sink protects attention sinks at the prompt start.

turboquant-generate \
    --model ~/models/gpt-oss-20b-tq3 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 1024 --temp 0.7 --rep-penalty 1.1 \
    --kv-k-bits 8 --kv-v-bits 3 --kv-min-tokens 128

License

Apache-2.0 (inherited from the base model).

Citation & Project

Built with TurboQuant-MLX. For the science (Hadamard rotation + Lloyd-Max codebooks for data-free quantization), see Zandieh et al., 2025 — TurboQuant: Online Vector Quantization with Optimal Distortion-Rate Trade-off.

Downloads last month: 432

Safetensors

Model size

22B params

Tensor type

F16

U32

BF16

MLX

Hardware compatibility

3-bit

Model tree for newzyerror/gpt-oss-20b-tq3

Base model

openai/gpt-oss-20b

Quantized

(206)

this model

Paper for newzyerror/gpt-oss-20b-tq3

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34