Chroma Context-1 (MXFP4)
MXFP4-quantized version of chromadb/context-1, a 20B parameter agentic search model fine-tuned from openai/gpt-oss-20b.
This checkpoint reduces the model size from 39 GB (BF16) to ~14 GB (MXFP4) with minimal quality degradation, enabling inference on a single GPU with more headroom for KV cache.
Model Details
- Base model: chromadb/context-1 (BF16)
- Architecture: GptOssForCausalLM (Mixture of Experts, 24 layers, 32 experts, top-4 routing)
- Parameters: 20B total
- Quantization: MXFP4 (E2M1 weights + E8M0 scales, group size 32)
- Quantized layers: MoE expert weights (
gate_up_proj,down_proj) - Non-quantized layers: attention, router, embeddings, LM head (remain in BF16)
- File size: ~14 GB (model.safetensors)
Quantization Details
MXFP4 (Microscaling FP4) is a 4-bit floating-point format using the E2M1 representation (2-bit exponent, 1-bit mantissa) with shared E8M0 per-group scaling factors. Each group of 32 weights shares a single 8-bit scale, and each individual weight is stored as a 4-bit FP4 code. Two FP4 codes are packed into one uint8 byte (low nibble = even index, high nibble = odd index).
This matches the quantization format used by the official openai/gpt-oss-20b MXFP4 checkpoint and is natively supported by vLLM's Marlin MXFP4 kernels.
What is quantized
| Component | Format | Notes |
|---|---|---|
mlp.experts.gate_up_proj |
MXFP4 | Stored as _blocks (U8) + _scales (U8) |
mlp.experts.down_proj |
MXFP4 | Stored as _blocks (U8) + _scales (U8) |
self_attn.* |
BF16 | Kept at full precision |
mlp.router.* |
BF16 | Kept at full precision |
embed_tokens, lm_head |
BF16 | Kept at full precision |
Tensor layout
Expert weights are transposed from the BF16 checkpoint layout
[num_experts, in_features, out_features] to the vLLM-expected
[num_experts, out_features, in_features] before quantization.
This is required because vLLM's MXFP4 loader performs a direct copy
without transposition (unlike the BF16 loader which transposes on load).
Usage with vLLM
Requires vLLM >= 0.18.0.
vllm serve evilfreelancer/context-1-mxfp4 \
--served-model-name evilfreelancer/context-1-mxfp4 \
--trust-remote-code \
--dtype auto \
--gpu-memory-utilization 0.9 \
--max-model-len 64000 \
--max-num-batched-tokens 64000 \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice \
--tool-call-parser openai
Docker Compose example:
services:
gpt-oss-20b:
image: vllm/vllm-openai:v0.18.0
restart: always
entrypoint: vllm
command: >
serve evilfreelancer/context-1-mxfp4
--served-model-name evilfreelancer/context-1-mxfp4
--trust-remote-code
--dtype auto
--gpu-memory-utilization 0.9
--max-model-len 64000
--max-num-batched-tokens 64000
--kv-cache-dtype fp8
--enable-auto-tool-choice
--tool-call-parser openai
ports:
- 8080:8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
Note on generation_config.json
The eos_token_id includes token 200012 (<|call|>) in addition to the
standard <|return|> (200002). This is required for tool calling to work
correctly - without it, the model does not stop generation after emitting
a tool call, and the Harmony parser fails.
Conversion Script
The included convert_mxfp4.py converts the original BF16
chromadb/context-1 weights
to MXFP4 format. It requires only numpy.
pip install numpy
python convert_mxfp4.py
The script expects the BF16 model in a sibling context-1/ directory
and writes the quantized model to context-1-mxfp4/.
What the script does
- Reads each tensor from the BF16
model.safetensors - For MoE expert weights (
gate_up_projanddown_proj):- Transposes from
[E, in, out]to[E, out, in] - Computes per-group E8M0 scales (group size 32)
- Quantizes to E2M1 FP4 codes using nearest rounding
- Packs two FP4 codes per byte
- Saves as
*_blocks(packed weights) and*_scales(shared exponents)
- Transposes from
- Copies all other tensors (attention, router, embeddings) as-is in BF16
- Copies tokenizer and chat template files
- Writes
config.jsonwithquantization_configfor vLLM
Key Capabilities
(Inherited from chromadb/context-1)
- Query decomposition: Breaks complex multi-constraint questions into targeted subqueries.
- Parallel tool calling: Averages 2.56 tool calls per turn, reducing total turns and end-to-end latency.
- Self-editing context: Selectively prunes irrelevant documents mid-search to sustain retrieval quality over long horizons within a bounded context window (0.94 prune accuracy).
- Cross-domain generalization: Trained on web, legal, and finance tasks; generalizes to held-out domains and public benchmarks (BrowseComp-Plus, SealQA, FRAMES, HLE).
Important: Agent Harness Required
Context-1 is trained to operate within a specific agent harness that manages tool execution, token budgets, context pruning, and deduplication. The harness is not yet public. Running the model without it will not reproduce the results reported in the technical report.
See the technical report for details on the harness design.
Citation
@techreport{bashir2026context1,
title = {Chroma Context-1: Training a Self-Editing Search Agent},
author = {Bashir, Hammad and Hong, Kelly and Jiang, Patrick and Shi, Zhiyi},
year = {2026},
month = {March},
institution = {Chroma},
url = {https://trychroma.com/research/context-1},
}
License
Apache 2.0
- Downloads last month
- 283