Instructions to use newzyerror/gpt-oss-20b-tq3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use newzyerror/gpt-oss-20b-tq3 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("newzyerror/gpt-oss-20b-tq3") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use newzyerror/gpt-oss-20b-tq3 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "newzyerror/gpt-oss-20b-tq3"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "newzyerror/gpt-oss-20b-tq3" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use newzyerror/gpt-oss-20b-tq3 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "newzyerror/gpt-oss-20b-tq3"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default newzyerror/gpt-oss-20b-tq3
Run Hermes
hermes
- MLX LM
How to use newzyerror/gpt-oss-20b-tq3 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "newzyerror/gpt-oss-20b-tq3"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "newzyerror/gpt-oss-20b-tq3" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "newzyerror/gpt-oss-20b-tq3", "messages": [ {"role": "user", "content": "Hello"} ] }'
gpt-oss-20b-tq3
TurboQuant 3-bit MLX quantization of openai/gpt-oss-20b — produced with TurboQuant-MLX.
GPT-OSS-20B is a 21 B-parameter Mixture-of-Experts model with 32 experts and ~3.6 B active parameters per token. After TurboQuant 3-bit compression it fits comfortably on a 16 GB Apple Silicon Mac with full 131K-token context — and with the v0.2 KV-cache compression layered on top, the cache shrinks 4× as well.
Model Details
- Base Model: openai/gpt-oss-20b (21 B total, 32 experts, ~3.6 B active)
- Quantization: TurboQuant 3-bit (Hadamard rotation + Lloyd-Max codebook),
group_size=64 - Calibration data: none — TurboQuant is data-free
- Size: ~9.5 GB on disk
- Peak wired RAM at decode: ~11 GB (verified on a 16 GB Mac with macOS background apps)
- Decode speed: 60–80 tok/s (M-series), up to 73 tok/s on M4 Max with fp16 KV cache
- Runs on: Apple Silicon (M1/M2/M3/M4) with 16 GB or more unified memory
Requirements
pip install "turboquant-mlx-full>=0.2.0" "mlx-lm>=0.31.3"
Sampler recommendations
GPT-OSS-20B is a sub-25B model, which means it sits right at the edge of capability for multi-step reasoning. Sampler choice matters more here than on larger models:
| Use case | Recommended sampler |
|---|---|
| Casual chat / creative writing / Q&A | --temp 0.7 --rep-penalty 1.1 |
| Math, code, multi-step reasoning | --temp 0.3 --rep-penalty 1.1 |
At temp 0.7 the model occasionally gives up mid-problem on word problems, or writes plausible-looking but logically buggy code. Dropping to temp 0.3 stabilizes the reasoning trace and produces correct setups for both math and code.
Verified quality (6-test stress harness)
Tested with scripts/stress_hybrid_sampler.py on a 64 GB M-series Mac (peak RAM matches 16 GB target):
| # | Test | Verdict (recommended sampler) |
|---|---|---|
| 01 | long_essay (1500-word Roman Empire, 3500 max_tok) | clean, no degenerate tail |
| 02 | math (two trains, meeting time + distance, 800 max_tok) | correct at --temp 0.3 (sets up 60t + 75(t-0.5) = 215, solves t≈1.87 hr → 10:52 AM); unstable at temp 0.7 |
| 03 | code (merge_intervals + 3 unit tests, 1500 max_tok) |
correct function logic at --temp 0.3; occasional hallucinated assertion values (function works, fix the test) |
| 04 | needle (FUCHSIA-7741 in haystack, 200 max_tok) | password retrieved verbatim |
| 05 | format (5-item list under 15 words/line, 1500 max_tok) | exactly 5 short numbered lines, no commentary |
| 06 | repetition_trap (sky-blue thorough, 4096 max_tok) | clean answer, no paragraph loops |
Decode speed across all 6 tests: 46–94 tok/s. Peak RAM: 11.0–11.2 GB.
Quick Start
Download the model
hf download manjunathshiva/gpt-oss-20b-tq3 \
--local-dir ~/models/gpt-oss-20b-tq3
Generate text — standard chat
turboquant-generate \
--model ~/models/gpt-oss-20b-tq3 \
--prompt "Why is the sky blue? Explain in detail." \
--max-tokens 1024 --temp 0.7 --rep-penalty 1.1
Generate text — math / code (temp 0.3)
turboquant-generate \
--model ~/models/gpt-oss-20b-tq3 \
--prompt "Solve this multi-step word problem..." \
--max-tokens 1024 --temp 0.3 --rep-penalty 1.1
Generate with TurboQuant KV cache (v0.2+) — 4× smaller cache
For long-context generation, layer the v0.2 KV-cache compression on top. K8/V3 mixed precision is required when stacking on TurboQuant-quantized weights — symmetric K3 would compound the noise and break long-form output past ~800 tokens. The 128-token fp16 sink protects attention sinks at the prompt start.
turboquant-generate \
--model ~/models/gpt-oss-20b-tq3 \
--prompt "Why is the sky blue? Explain in detail." \
--max-tokens 1024 --temp 0.7 --rep-penalty 1.1 \
--kv-k-bits 8 --kv-v-bits 3 --kv-min-tokens 128
License
Apache-2.0 (inherited from the base model).
Citation & Project
Built with TurboQuant-MLX. For the science (Hadamard rotation + Lloyd-Max codebooks for data-free quantization), see Zandieh et al., 2025 — TurboQuant: Online Vector Quantization with Optimal Distortion-Rate Trade-off.
- Downloads last month
- 432
3-bit
Model tree for newzyerror/gpt-oss-20b-tq3
Base model
openai/gpt-oss-20b