ZAYA1-8B β€” bitsandbytes Quantizations

bitsandbytes quantizations of Zyphra/ZAYA1-8B.

Note: ZAYA1-8B uses a custom sparse MoE architecture (ZayaForCausalLM) that is not yet supported by llama.cpp. GGUF files will be added once support lands (issue #22776). In the meantime, these bitsandbytes quantizations provide a working alternative.


Available Files

Folder Format Bits Size Description
NF4/ NF4 4-bit ~5.0 GB Normal Float 4 β€” best 4-bit quality
NF4-DQ/ NF4 + DQ ~4-bit ~4.7 GB NF4 + double quantization β€” slightly smaller
INT8/ INT8 8-bit ~9.0 GB Near-lossless

About ZAYA1-8B

ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training.

ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications.


Performance

Performance chart

Scaling comparison

In-class comparison

Category Benchmark ZAYA1-8B (0.7B / 8B) Qwen3-4B-Think Qwen3.5-4B Gemma-4-E4B-it
Math AIME'26 89.1 77.5 84.5 50.3
Math HMMT Feb.'26 71.6 60.8 63.6 32.1
Math IMO-AnswerBench 59.3 50.9 48.7 27.3
Math APEX-shortlist 32.2 16.9 -- 6.1
Code LiveCodeBench-v6 65.8 54.2 -- 54.2
Knowledge GPQA-Diamond 71.0 66.5 76.2 57.4
Knowledge MMLU-Pro 74.2 74.3 79.1 70.2
Instruction IFEval 85.58 86.8 89.8 88.50
Instruction IFBench 52.56 52.9 59.2 42.67
Style & chat EQBench 72.95 79.6 79.5 80.15
Agentic BFCL-v4 39.22 49.7 45.2 31.7

Scaling comparison against larger models

Model Active Total AIME'26 HMMT'26 LCB-v6 GPQA-D MMLU-Pro
ZAYA1-8B 0.7B 8B 89.1 71.6 63.8 71.0 74.2
Arcee-Trinity-Mini 3B 26B 59.6 36.9 33.3 46.8 70.6
N3-Nano-30B 3B 30B 90.1 75.5 64.6 75.1 78.9
OLMo-3.1-32B-Think 32B 32B 78.9 50.6 58.3 59.6 75.8
Qwen3-Next-80B-A3B 3B 80B 90.2 79.3 67.8 76.7 82.6
Intellect-3 12B 106B 86.3 72.2 66.8 74.6 82.3
Mistral-Small-4-119B 6B 119B 86.4 70.6 57.9 77.2 81.6

All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.


Download

HuggingFace's inference widget and one-click download are not available for this repo.
ZayaForCausalLM requires Zyphra's custom transformers fork β€” use the commands below.

Download a specific quantization

# NF4 (4-bit) β€” recommended
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4

# NF4 with double quantization
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ

# INT8 (8-bit)
huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8

Download everything

huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB

Usage

Zyphra's custom transformers fork is required to load ZayaForCausalLM:

pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
pip install bitsandbytes>=0.43.0 accelerate

Load NF4 (4-bit)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
                                           trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
                                              quantization_config=bnb_config,
                                              device_map="auto",
                                              trust_remote_code=True)

Load INT8 (8-bit)

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
                                           trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
                                              quantization_config=bnb_config,
                                              device_map="auto",
                                              trust_remote_code=True)

Inference

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the sum of the first 100 prime numbers?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

Quantization Details


Original Model Prerequisites

# vLLM (recommended for serving)
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1"

# Transformers
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"

License

Apache 2.0 β€” same as the original model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for barozp/ZAYA1-8B-BNB

Base model

Zyphra/ZAYA1-8B
Quantized
(9)
this model