ZAYA1-8B — bitsandbytes Quantizations

bitsandbytes quantizations of Zyphra/ZAYA1-8B.

Note: ZAYA1-8B uses a custom sparse MoE architecture (ZayaForCausalLM) that is not yet supported by llama.cpp. GGUF files will be added once support lands (issue #22776). In the meantime, these bitsandbytes quantizations provide a working alternative.

Available Files

Folder	Format	Bits	Size	Description
`NF4/`	NF4	4-bit	~5.0 GB	Normal Float 4 — best 4-bit quality
`NF4-DQ/`	NF4 + DQ	~4-bit	~4.7 GB	NF4 + double quantization — slightly smaller
`INT8/`	INT8	8-bit	~9.0 GB	Near-lossless

About ZAYA1-8B

ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training.

ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications.

Technical report: https://www.zyphra.com/zaya1-8b-technical-report
Blog post: https://www.zyphra.com/post/zaya1-8b
Pretraining base: Zyphra/ZAYA1-reasoning-base

Performance

In-class comparison

Category	Benchmark	ZAYA1-8B (0.7B / 8B)	Qwen3-4B-Think	Qwen3.5-4B	Gemma-4-E4B-it
Math	AIME'26	89.1	77.5	84.5	50.3
Math	HMMT Feb.'26	71.6	60.8	63.6	32.1
Math	IMO-AnswerBench	59.3	50.9	48.7	27.3
Math	APEX-shortlist	32.2	16.9	--	6.1
Code	LiveCodeBench-v6	65.8	54.2	--	54.2
Knowledge	GPQA-Diamond	71.0	66.5	76.2	57.4
Knowledge	MMLU-Pro	74.2	74.3	79.1	70.2
Instruction	IFEval	85.58	86.8	89.8	88.50
Instruction	IFBench	52.56	52.9	59.2	42.67
Style & chat	EQBench	72.95	79.6	79.5	80.15
Agentic	BFCL-v4	39.22	49.7	45.2	31.7

Scaling comparison against larger models

Model	Active	Total	AIME'26	HMMT'26	LCB-v6	GPQA-D	MMLU-Pro
ZAYA1-8B	0.7B	8B	89.1	71.6	63.8	71.0	74.2
Arcee-Trinity-Mini	3B	26B	59.6	36.9	33.3	46.8	70.6
N3-Nano-30B	3B	30B	90.1	75.5	64.6	75.1	78.9
OLMo-3.1-32B-Think	32B	32B	78.9	50.6	58.3	59.6	75.8
Qwen3-Next-80B-A3B	3B	80B	90.2	79.3	67.8	76.7	82.6
Intellect-3	12B	106B	86.3	72.2	66.8	74.6	82.3
Mistral-Small-4-119B	6B	119B	86.4	70.6	57.9	77.2	81.6

All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.

Download

HuggingFace's inference widget and one-click download are not available for this repo.
ZayaForCausalLM requires Zyphra's custom transformers fork — use the commands below.

Download a specific quantization

# NF4 (4-bit) — recommended
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4

# NF4 with double quantization
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ

# INT8 (8-bit)
huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8

Download everything

huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB

Usage

Zyphra's custom transformers fork is required to load ZayaForCausalLM:

pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
pip install bitsandbytes>=0.43.0 accelerate

Load NF4 (4-bit)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
                                           trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
                                              quantization_config=bnb_config,
                                              device_map="auto",
                                              trust_remote_code=True)

Load INT8 (8-bit)

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
                                           trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
                                              quantization_config=bnb_config,
                                              device_map="auto",
                                              trust_remote_code=True)

Inference

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the sum of the first 100 prime numbers?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

Quantization Details

Source: Zyphra/ZAYA1-8B (BF16 safetensors)
Method: bitsandbytes
Quantized by: barozp
GGUF status: Pending llama.cpp support — issue #22776

Original Model Prerequisites

# vLLM (recommended for serving)
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1"

# Transformers
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"

License

Apache 2.0 — same as the original model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for barozp/ZAYA1-8B-BNB

Base model

Zyphra/ZAYA1-8B

Quantized

(9)

this model