bge-small-en-v1.5 — MNN format (fp16)

MNN-format conversion of BAAI/bge-small-en-v1.5 for use in on-device inference via the MNN runtime (Android / Linux / iOS). Produced for TokForge to consolidate embedding + LLM inference on a single runtime and remove a GGUF/llama.cpp dependency for RAG and memory/reflection embeddings.

Model details

Base model: BAAI/bge-small-en-v1.5 (BERT encoder, 33M params)
Architecture: 12-layer BERT, hidden=384, 12 attention heads, intermediate=1536
Pooling: CLS token (token 0 of last_hidden_state), then L2-normalize
Max sequence length: 512
Embedding dim: 384
Precision: fp16 weights
File size: 64 MB (vs 33 MB fp32 safetensors; MNN fp16 ~2x denser than GGUF Q4 at 24 MB but no dequant cost)

Conversion pipeline

Downloaded BAAI/bge-small-en-v1.5 from Hugging Face — the repo ships an onnx/model.onnx (opset 11) exported from PyTorch.
Ran MNNConvert -f ONNX --fp16 against that ONNX, which fused 25 LayerNorms and produced encoder.mnn.
Copied tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.txt) from the source repo.

Reproduce:

MNNConvert -f ONNX \
    --modelFile BAAI/bge-small-en-v1.5/onnx/model.onnx \
    --MNNModel encoder.mnn \
    --fp16 --bizCode bge-small

A/B quality vs GGUF baseline

Synthetic retrieval benchmark over 300 chunks + 20 gold queries drawn from an internal infra documentation corpus. GGUF baseline ran via llama.cpp llama-embedding against the standard BAAI bge-small-en-v1.5.Q*_K_M.gguf. Both models use CLS pooling + L2 normalization and the BGE query prefix "Represent this sentence for searching relevant passages: ".

Metric	GGUF (baseline)	MNN fp16	Ratio
Recall@1	70.0%	70.0%	100%
Recall@3	75.0%	75.0%	100%
Recall@5	85.0%	85.0%	100%
Recall@10	95.0%	95.0%	100%
MRR	0.7659	0.7653	99.9%
Mean gold-chunk cos sim	0.7646	0.7654	—

Cross-model per-chunk cosine similarity: mean 0.9962, 5th-percentile 0.9948. The MNN fp16 conversion produces essentially the same embedding subspace as the PyTorch/ONNX reference (cosine 1.0000 vs sentence-transformers on sanity checks). There is no measurable retrieval regression.

Inputs / outputs

Input tensors (int32):

input_ids: shape (batch, seq)
attention_mask: shape (batch, seq)
token_type_ids: shape (batch, seq), all zeros for single-text embedding

Output tensor (fp32 on CPU):

last_hidden_state: shape (batch, seq, 384) — Caffe/NCHW layout (use Tensor_DimensionType_Caffe when copying to host).

To get a sentence embedding: take position 0 (CLS token), then L2 normalize.

Python usage (reference)

import MNN, numpy as np
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("darkmaniac7/bge-small-en-MNN")
interp = MNN.Interpreter("encoder.mnn")
sess = interp.createSession({"backend": "CPU", "numThread": 4})
in_ids  = interp.getSessionInput(sess, "input_ids")
in_mask = interp.getSessionInput(sess, "attention_mask")
in_tok  = interp.getSessionInput(sess, "token_type_ids")
out_t   = interp.getSessionOutput(sess, "last_hidden_state")

def embed(text):
    enc = tok(text, padding=False, truncation=True, max_length=512, return_tensors="np")
    seq = enc["input_ids"].shape[1]
    for t, arr in ((in_ids, enc["input_ids"]), (in_mask, enc["attention_mask"]), (in_tok, enc["token_type_ids"])):
        interp.resizeTensor(t, (1, seq))
    interp.resizeSession(sess)
    for t, arr in ((in_ids, enc["input_ids"]), (in_mask, enc["attention_mask"]), (in_tok, enc["token_type_ids"])):
        tmp = MNN.Tensor((1, seq), MNN.Halide_Type_Int,
                         arr.astype(np.int32), MNN.Tensor_DimensionType_Tensorflow)
        t.copyFrom(tmp)
    interp.runSession(sess)
    shape = out_t.getShape()
    host = MNN.Tensor(shape, MNN.Halide_Type_Float,
                      np.zeros(tuple(shape), dtype=np.float32),
                      MNN.Tensor_DimensionType_Caffe)
    out_t.copyToHostTensor(host)
    arr = np.array(host.getData(), dtype=np.float32).reshape(tuple(shape))
    cls = arr[0, 0, :]
    return cls / (np.linalg.norm(cls) + 1e-12)

# For asymmetric retrieval, prefix queries per BGE convention:
qvec = embed("Represent this sentence for searching relevant passages: " + "your query here")
pvec = embed("passage text here")
score = float(np.dot(qvec, pvec))  # cosine (both are unit-norm)

Attribution

Source weights are Copyright 2023 Beijing Academy of Artificial Intelligence, MIT-licensed. This conversion contains no novel weights — the ONNX graph from the source repo was re-serialized into MNN's binary format with fp16 weight compression.

Downloads last month: 3

Model tree for darkmaniac7/bge-small-en-MNN

Base model

BAAI/bge-small-en-v1.5

Finetuned

(357)

this model