ZAYA1-8B β bitsandbytes Quantizations
bitsandbytes quantizations of Zyphra/ZAYA1-8B.
Note: ZAYA1-8B uses a custom sparse MoE architecture (
ZayaForCausalLM) that is not yet supported by llama.cpp. GGUF files will be added once support lands (issue #22776). In the meantime, these bitsandbytes quantizations provide a working alternative.
Available Files
| Folder | Format | Bits | Size | Description |
|---|---|---|---|---|
NF4/ |
NF4 | 4-bit | ~5.0 GB | Normal Float 4 β best 4-bit quality |
NF4-DQ/ |
NF4 + DQ | ~4-bit | ~4.7 GB | NF4 + double quantization β slightly smaller |
INT8/ |
INT8 | 8-bit | ~9.0 GB | Near-lossless |
About ZAYA1-8B
ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training.
ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications.
- Technical report: https://www.zyphra.com/zaya1-8b-technical-report
- Blog post: https://www.zyphra.com/post/zaya1-8b
- Pretraining base: Zyphra/ZAYA1-reasoning-base
Performance
In-class comparison
| Category | Benchmark | ZAYA1-8B (0.7B / 8B) | Qwen3-4B-Think | Qwen3.5-4B | Gemma-4-E4B-it |
|---|---|---|---|---|---|
| Math | AIME'26 | 89.1 | 77.5 | 84.5 | 50.3 |
| Math | HMMT Feb.'26 | 71.6 | 60.8 | 63.6 | 32.1 |
| Math | IMO-AnswerBench | 59.3 | 50.9 | 48.7 | 27.3 |
| Math | APEX-shortlist | 32.2 | 16.9 | -- | 6.1 |
| Code | LiveCodeBench-v6 | 65.8 | 54.2 | -- | 54.2 |
| Knowledge | GPQA-Diamond | 71.0 | 66.5 | 76.2 | 57.4 |
| Knowledge | MMLU-Pro | 74.2 | 74.3 | 79.1 | 70.2 |
| Instruction | IFEval | 85.58 | 86.8 | 89.8 | 88.50 |
| Instruction | IFBench | 52.56 | 52.9 | 59.2 | 42.67 |
| Style & chat | EQBench | 72.95 | 79.6 | 79.5 | 80.15 |
| Agentic | BFCL-v4 | 39.22 | 49.7 | 45.2 | 31.7 |
Scaling comparison against larger models
| Model | Active | Total | AIME'26 | HMMT'26 | LCB-v6 | GPQA-D | MMLU-Pro |
|---|---|---|---|---|---|---|---|
| ZAYA1-8B | 0.7B | 8B | 89.1 | 71.6 | 63.8 | 71.0 | 74.2 |
| Arcee-Trinity-Mini | 3B | 26B | 59.6 | 36.9 | 33.3 | 46.8 | 70.6 |
| N3-Nano-30B | 3B | 30B | 90.1 | 75.5 | 64.6 | 75.1 | 78.9 |
| OLMo-3.1-32B-Think | 32B | 32B | 78.9 | 50.6 | 58.3 | 59.6 | 75.8 |
| Qwen3-Next-80B-A3B | 3B | 80B | 90.2 | 79.3 | 67.8 | 76.7 | 82.6 |
| Intellect-3 | 12B | 106B | 86.3 | 72.2 | 66.8 | 74.6 | 82.3 |
| Mistral-Small-4-119B | 6B | 119B | 86.4 | 70.6 | 57.9 | 77.2 | 81.6 |
All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.
Download
HuggingFace's inference widget and one-click download are not available for this repo.
ZayaForCausalLMrequires Zyphra's customtransformersfork β use the commands below.
Download a specific quantization
# NF4 (4-bit) β recommended
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4
# NF4 with double quantization
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ
# INT8 (8-bit)
huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8
Download everything
huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB
Usage
Zyphra's custom transformers fork is required to load ZayaForCausalLM:
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
pip install bitsandbytes>=0.43.0 accelerate
Load NF4 (4-bit)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True)
Load INT8 (8-bit)
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True)
Inference
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the sum of the first 100 prime numbers?"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
Quantization Details
- Source: Zyphra/ZAYA1-8B (BF16 safetensors)
- Method: bitsandbytes
- Quantized by: barozp
- GGUF status: Pending llama.cpp support β issue #22776
Original Model Prerequisites
# vLLM (recommended for serving)
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1"
# Transformers
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
License
Apache 2.0 β same as the original model.
Model tree for barozp/ZAYA1-8B-BNB
Base model
Zyphra/ZAYA1-8B
