bge-m3-onnx (fp32)

ONNX (fp32) export of BAAI/bge-m3, exposing all three representations in a single forward pass: dense (1024-d), sparse (lexical weights) and ColBERT (multi-vector). Built for CPU inference with ONNX Runtime.

It powers the CPU flavor of bge-m3-service (sophiacloud/bge-m3-service:cpu), but works standalone.

Why this exists (and where it actually helps)

On CPU, ONNX Runtime is faster than PyTorch eager because of graph-level operator fusion and lower per-op overhead. We measured it against the original model run through FlagEmbedding, same threads, fp32, on long real-world documents (x86):

op	ONNX-CPU	FlagEmbedding-CPU	speedup
embed	857 ms	1091 ms	1.27×

Honest framing:

The win is moderate (~1.3×), not 2-4×. The big multiplier you read about elsewhere is for int8 quantization — which we deliberately do not ship (see below).
On GPU there is no reason to use this. PyTorch fp16 + flash-attention on the original BAAI/bge-m3 is as fast or faster. Use ONNX only for CPU deployments.
The other real win is a lighter image / no torch dependency on CPU.

Numerically identical to the original

Validated against FlagEmbedding fp32 as reference, on real domain text:

signal	result
DENSE cosine vs reference	1.000000 (mean & min)
COLBERT MaxSim Δrel	0.0000%
SPARSE Jaccard (active tokens)	0.963

The sparse Jaccard < 1 is only near-zero-weight tokens flipping above/below the threshold; it does not affect the sparse dot-product score.

Technical note: the ColBERT CLS fix

BGE-M3 builds the ColBERT matrix from last_hidden_state[:, 1:] — i.e. the CLS token is excluded. A naive ONNX export keeps it, which shifts MaxSim by ~1.5% (up to ~7.7%). This export's consumer code drops the CLS vector to match the original exactly (that is why COLBERT Δrel is 0%). If you write your own inference loop, exclude the first token of the ColBERT output.

Why no int8

Dynamic int8 degraded quality past our bar: dense cosine dropped to 0.985 and reranker top-1 overlap fell to 0.625. For a retrieval system that is not acceptable, so we ship fp32 only. The export tooling can still produce int8 if you accept the trade-off.

Files

model_fp32.onnx + model_fp32.onnx.data — graph + external weights (~2.2 GB)
tokenizer_hf/ — the HuggingFace tokenizer (use this, not a native ONNX tokenizer; ONNX SentencePiece tokenizers can diverge on multilingual edge cases)

Usage

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

d = snapshot_download("Sophia-AI/bge-m3-onnx")
tok = AutoTokenizer.from_pretrained(f"{d}/tokenizer_hf")
sess = ort.InferenceSession(f"{d}/model_fp32.onnx", providers=["CPUExecutionProvider"])

enc = tok(["esempio di testo"], padding=True, truncation=True,
          max_length=1024, return_tensors="np")
dense, sparse, colbert = sess.run(
    ["dense", "sparse", "colbert"],
    {"input_ids": enc["input_ids"].astype(np.int64),
     "attention_mask": enc["attention_mask"].astype(np.int64)},
)
# dense: (B,1024) L2-normalized | sparse: (B,seq,1) token weights
# colbert: (B,seq,1024) — drop padding AND the CLS token (index 0) per text

Inputs: input_ids, attention_mask (int64). Outputs: dense, sparse, colbert.

License

MIT, inherited from the base model BAAI/bge-m3.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Sophia-AI/bge-m3-onnx

Base model

BAAI/bge-m3

Quantized

(266)

this model