bge-small-en-v1.5 β MNN format (fp16)
MNN-format conversion of BAAI/bge-small-en-v1.5 for use in on-device inference via the MNN runtime (Android / Linux / iOS). Produced for TokForge to consolidate embedding + LLM inference on a single runtime and remove a GGUF/llama.cpp dependency for RAG and memory/reflection embeddings.
Model details
- Base model:
BAAI/bge-small-en-v1.5(BERT encoder, 33M params) - Architecture: 12-layer BERT, hidden=384, 12 attention heads, intermediate=1536
- Pooling: CLS token (token 0 of
last_hidden_state), then L2-normalize - Max sequence length: 512
- Embedding dim: 384
- Precision: fp16 weights
- File size: 64 MB (vs 33 MB fp32 safetensors; MNN fp16 ~2x denser than GGUF Q4 at 24 MB but no dequant cost)
Conversion pipeline
- Downloaded
BAAI/bge-small-en-v1.5from Hugging Face β the repo ships anonnx/model.onnx(opset 11) exported from PyTorch. - Ran
MNNConvert -f ONNX --fp16against that ONNX, which fused 25 LayerNorms and producedencoder.mnn. - Copied tokenizer files (
tokenizer.json,tokenizer_config.json,special_tokens_map.json,vocab.txt) from the source repo.
Reproduce:
MNNConvert -f ONNX \
--modelFile BAAI/bge-small-en-v1.5/onnx/model.onnx \
--MNNModel encoder.mnn \
--fp16 --bizCode bge-small
A/B quality vs GGUF baseline
Synthetic retrieval benchmark over 300 chunks + 20 gold queries drawn from an internal infra documentation corpus. GGUF baseline ran via llama.cpp llama-embedding against the standard BAAI bge-small-en-v1.5.Q*_K_M.gguf. Both models use CLS pooling + L2 normalization and the BGE query prefix "Represent this sentence for searching relevant passages: ".
| Metric | GGUF (baseline) | MNN fp16 | Ratio |
|---|---|---|---|
| Recall@1 | 70.0% | 70.0% | 100% |
| Recall@3 | 75.0% | 75.0% | 100% |
| Recall@5 | 85.0% | 85.0% | 100% |
| Recall@10 | 95.0% | 95.0% | 100% |
| MRR | 0.7659 | 0.7653 | 99.9% |
| Mean gold-chunk cos sim | 0.7646 | 0.7654 | β |
Cross-model per-chunk cosine similarity: mean 0.9962, 5th-percentile 0.9948. The MNN fp16 conversion produces essentially the same embedding subspace as the PyTorch/ONNX reference (cosine 1.0000 vs sentence-transformers on sanity checks). There is no measurable retrieval regression.
Inputs / outputs
Input tensors (int32):
input_ids: shape(batch, seq)attention_mask: shape(batch, seq)token_type_ids: shape(batch, seq), all zeros for single-text embedding
Output tensor (fp32 on CPU):
last_hidden_state: shape(batch, seq, 384)β Caffe/NCHW layout (useTensor_DimensionType_Caffewhen copying to host).
To get a sentence embedding: take position 0 (CLS token), then L2 normalize.
Python usage (reference)
import MNN, numpy as np
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("darkmaniac7/bge-small-en-MNN")
interp = MNN.Interpreter("encoder.mnn")
sess = interp.createSession({"backend": "CPU", "numThread": 4})
in_ids = interp.getSessionInput(sess, "input_ids")
in_mask = interp.getSessionInput(sess, "attention_mask")
in_tok = interp.getSessionInput(sess, "token_type_ids")
out_t = interp.getSessionOutput(sess, "last_hidden_state")
def embed(text):
enc = tok(text, padding=False, truncation=True, max_length=512, return_tensors="np")
seq = enc["input_ids"].shape[1]
for t, arr in ((in_ids, enc["input_ids"]), (in_mask, enc["attention_mask"]), (in_tok, enc["token_type_ids"])):
interp.resizeTensor(t, (1, seq))
interp.resizeSession(sess)
for t, arr in ((in_ids, enc["input_ids"]), (in_mask, enc["attention_mask"]), (in_tok, enc["token_type_ids"])):
tmp = MNN.Tensor((1, seq), MNN.Halide_Type_Int,
arr.astype(np.int32), MNN.Tensor_DimensionType_Tensorflow)
t.copyFrom(tmp)
interp.runSession(sess)
shape = out_t.getShape()
host = MNN.Tensor(shape, MNN.Halide_Type_Float,
np.zeros(tuple(shape), dtype=np.float32),
MNN.Tensor_DimensionType_Caffe)
out_t.copyToHostTensor(host)
arr = np.array(host.getData(), dtype=np.float32).reshape(tuple(shape))
cls = arr[0, 0, :]
return cls / (np.linalg.norm(cls) + 1e-12)
# For asymmetric retrieval, prefix queries per BGE convention:
qvec = embed("Represent this sentence for searching relevant passages: " + "your query here")
pvec = embed("passage text here")
score = float(np.dot(qvec, pvec)) # cosine (both are unit-norm)
Attribution
Source weights are Copyright 2023 Beijing Academy of Artificial Intelligence, MIT-licensed. This conversion contains no novel weights β the ONNX graph from the source repo was re-serialized into MNN's binary format with fp16 weight compression.
- Downloads last month
- 3
Model tree for darkmaniac7/bge-small-en-MNN
Base model
BAAI/bge-small-en-v1.5