bge-m3-onnx (fp32)
ONNX (fp32) export of BAAI/bge-m3, exposing all three representations in a single forward pass: dense (1024-d), sparse (lexical weights) and ColBERT (multi-vector). Built for CPU inference with ONNX Runtime.
It powers the CPU flavor of bge-m3-service
(sophiacloud/bge-m3-service:cpu), but works standalone.
Why this exists (and where it actually helps)
On CPU, ONNX Runtime is faster than PyTorch eager because of graph-level operator fusion and lower per-op overhead. We measured it against the original model run through FlagEmbedding, same threads, fp32, on long real-world documents (x86):
| op | ONNX-CPU | FlagEmbedding-CPU | speedup |
|---|---|---|---|
| embed | 857 ms | 1091 ms | 1.27× |
Honest framing:
- The win is moderate (~1.3×), not 2-4×. The big multiplier you read about elsewhere is for int8 quantization — which we deliberately do not ship (see below).
- On GPU there is no reason to use this. PyTorch fp16 + flash-attention on the
original
BAAI/bge-m3is as fast or faster. Use ONNX only for CPU deployments. - The other real win is a lighter image / no torch dependency on CPU.
Numerically identical to the original
Validated against FlagEmbedding fp32 as reference, on real domain text:
| signal | result |
|---|---|
| DENSE cosine vs reference | 1.000000 (mean & min) |
| COLBERT MaxSim Δrel | 0.0000% |
| SPARSE Jaccard (active tokens) | 0.963 |
The sparse Jaccard < 1 is only near-zero-weight tokens flipping above/below the threshold; it does not affect the sparse dot-product score.
Technical note: the ColBERT CLS fix
BGE-M3 builds the ColBERT matrix from last_hidden_state[:, 1:] — i.e. the CLS token
is excluded. A naive ONNX export keeps it, which shifts MaxSim by ~1.5% (up to ~7.7%).
This export's consumer code drops the CLS vector to match the original exactly (that is
why COLBERT Δrel is 0%). If you write your own inference loop, exclude the first token
of the ColBERT output.
Why no int8
Dynamic int8 degraded quality past our bar: dense cosine dropped to 0.985 and reranker top-1 overlap fell to 0.625. For a retrieval system that is not acceptable, so we ship fp32 only. The export tooling can still produce int8 if you accept the trade-off.
Files
model_fp32.onnx+model_fp32.onnx.data— graph + external weights (~2.2 GB)tokenizer_hf/— the HuggingFace tokenizer (use this, not a native ONNX tokenizer; ONNX SentencePiece tokenizers can diverge on multilingual edge cases)
Usage
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
d = snapshot_download("Sophia-AI/bge-m3-onnx")
tok = AutoTokenizer.from_pretrained(f"{d}/tokenizer_hf")
sess = ort.InferenceSession(f"{d}/model_fp32.onnx", providers=["CPUExecutionProvider"])
enc = tok(["esempio di testo"], padding=True, truncation=True,
max_length=1024, return_tensors="np")
dense, sparse, colbert = sess.run(
["dense", "sparse", "colbert"],
{"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64)},
)
# dense: (B,1024) L2-normalized | sparse: (B,seq,1) token weights
# colbert: (B,seq,1024) — drop padding AND the CLS token (index 0) per text
Inputs: input_ids, attention_mask (int64). Outputs: dense, sparse, colbert.
License
MIT, inherited from the base model BAAI/bge-m3.
Model tree for Sophia-AI/bge-m3-onnx
Base model
BAAI/bge-m3