gpt-oss-20b-tq3

TurboQuant 3-bit MLX quantization of openai/gpt-oss-20b — produced with TurboQuant-MLX.

GPT-OSS-20B is a 21 B-parameter Mixture-of-Experts model with 32 experts and ~3.6 B active parameters per token. After TurboQuant 3-bit compression it fits comfortably on a 16 GB Apple Silicon Mac with full 131K-token context — and with the v0.2 KV-cache compression layered on top, the cache shrinks 4× as well.

Model Details

  • Base Model: openai/gpt-oss-20b (21 B total, 32 experts, ~3.6 B active)
  • Quantization: TurboQuant 3-bit (Hadamard rotation + Lloyd-Max codebook), group_size=64
  • Calibration data: none — TurboQuant is data-free
  • Size: ~9.5 GB on disk
  • Peak wired RAM at decode: ~11 GB (verified on a 16 GB Mac with macOS background apps)
  • Decode speed: 60–80 tok/s (M-series), up to 73 tok/s on M4 Max with fp16 KV cache
  • Runs on: Apple Silicon (M1/M2/M3/M4) with 16 GB or more unified memory

Requirements

pip install "turboquant-mlx-full>=0.2.0" "mlx-lm>=0.31.3"

Sampler recommendations

GPT-OSS-20B is a sub-25B model, which means it sits right at the edge of capability for multi-step reasoning. Sampler choice matters more here than on larger models:

Use case Recommended sampler
Casual chat / creative writing / Q&A --temp 0.7 --rep-penalty 1.1
Math, code, multi-step reasoning --temp 0.3 --rep-penalty 1.1

At temp 0.7 the model occasionally gives up mid-problem on word problems, or writes plausible-looking but logically buggy code. Dropping to temp 0.3 stabilizes the reasoning trace and produces correct setups for both math and code.

Verified quality (6-test stress harness)

Tested with scripts/stress_hybrid_sampler.py on a 64 GB M-series Mac (peak RAM matches 16 GB target):

# Test Verdict (recommended sampler)
01 long_essay (1500-word Roman Empire, 3500 max_tok) clean, no degenerate tail
02 math (two trains, meeting time + distance, 800 max_tok) correct at --temp 0.3 (sets up 60t + 75(t-0.5) = 215, solves t≈1.87 hr → 10:52 AM); unstable at temp 0.7
03 code (merge_intervals + 3 unit tests, 1500 max_tok) correct function logic at --temp 0.3; occasional hallucinated assertion values (function works, fix the test)
04 needle (FUCHSIA-7741 in haystack, 200 max_tok) password retrieved verbatim
05 format (5-item list under 15 words/line, 1500 max_tok) exactly 5 short numbered lines, no commentary
06 repetition_trap (sky-blue thorough, 4096 max_tok) clean answer, no paragraph loops

Decode speed across all 6 tests: 46–94 tok/s. Peak RAM: 11.0–11.2 GB.

Quick Start

Download the model

hf download manjunathshiva/gpt-oss-20b-tq3 \
    --local-dir ~/models/gpt-oss-20b-tq3

Generate text — standard chat

turboquant-generate \
    --model ~/models/gpt-oss-20b-tq3 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 1024 --temp 0.7 --rep-penalty 1.1

Generate text — math / code (temp 0.3)

turboquant-generate \
    --model ~/models/gpt-oss-20b-tq3 \
    --prompt "Solve this multi-step word problem..." \
    --max-tokens 1024 --temp 0.3 --rep-penalty 1.1

Generate with TurboQuant KV cache (v0.2+) — 4× smaller cache

For long-context generation, layer the v0.2 KV-cache compression on top. K8/V3 mixed precision is required when stacking on TurboQuant-quantized weights — symmetric K3 would compound the noise and break long-form output past ~800 tokens. The 128-token fp16 sink protects attention sinks at the prompt start.

turboquant-generate \
    --model ~/models/gpt-oss-20b-tq3 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 1024 --temp 0.7 --rep-penalty 1.1 \
    --kv-k-bits 8 --kv-v-bits 3 --kv-min-tokens 128

License

Apache-2.0 (inherited from the base model).

Citation & Project

Built with TurboQuant-MLX. For the science (Hadamard rotation + Lloyd-Max codebooks for data-free quantization), see Zandieh et al., 2025 — TurboQuant: Online Vector Quantization with Optimal Distortion-Rate Trade-off.

Downloads last month
432
Safetensors
Model size
22B params
Tensor type
F16
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for newzyerror/gpt-oss-20b-tq3

Quantized
(206)
this model

Paper for newzyerror/gpt-oss-20b-tq3