bge-small-en-v1.5 β€” MNN format (fp16)

MNN-format conversion of BAAI/bge-small-en-v1.5 for use in on-device inference via the MNN runtime (Android / Linux / iOS). Produced for TokForge to consolidate embedding + LLM inference on a single runtime and remove a GGUF/llama.cpp dependency for RAG and memory/reflection embeddings.

Model details

  • Base model: BAAI/bge-small-en-v1.5 (BERT encoder, 33M params)
  • Architecture: 12-layer BERT, hidden=384, 12 attention heads, intermediate=1536
  • Pooling: CLS token (token 0 of last_hidden_state), then L2-normalize
  • Max sequence length: 512
  • Embedding dim: 384
  • Precision: fp16 weights
  • File size: 64 MB (vs 33 MB fp32 safetensors; MNN fp16 ~2x denser than GGUF Q4 at 24 MB but no dequant cost)

Conversion pipeline

  1. Downloaded BAAI/bge-small-en-v1.5 from Hugging Face β€” the repo ships an onnx/model.onnx (opset 11) exported from PyTorch.
  2. Ran MNNConvert -f ONNX --fp16 against that ONNX, which fused 25 LayerNorms and produced encoder.mnn.
  3. Copied tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.txt) from the source repo.

Reproduce:

MNNConvert -f ONNX \
    --modelFile BAAI/bge-small-en-v1.5/onnx/model.onnx \
    --MNNModel encoder.mnn \
    --fp16 --bizCode bge-small

A/B quality vs GGUF baseline

Synthetic retrieval benchmark over 300 chunks + 20 gold queries drawn from an internal infra documentation corpus. GGUF baseline ran via llama.cpp llama-embedding against the standard BAAI bge-small-en-v1.5.Q*_K_M.gguf. Both models use CLS pooling + L2 normalization and the BGE query prefix "Represent this sentence for searching relevant passages: ".

Metric GGUF (baseline) MNN fp16 Ratio
Recall@1 70.0% 70.0% 100%
Recall@3 75.0% 75.0% 100%
Recall@5 85.0% 85.0% 100%
Recall@10 95.0% 95.0% 100%
MRR 0.7659 0.7653 99.9%
Mean gold-chunk cos sim 0.7646 0.7654 β€”

Cross-model per-chunk cosine similarity: mean 0.9962, 5th-percentile 0.9948. The MNN fp16 conversion produces essentially the same embedding subspace as the PyTorch/ONNX reference (cosine 1.0000 vs sentence-transformers on sanity checks). There is no measurable retrieval regression.

Inputs / outputs

Input tensors (int32):

  • input_ids: shape (batch, seq)
  • attention_mask: shape (batch, seq)
  • token_type_ids: shape (batch, seq), all zeros for single-text embedding

Output tensor (fp32 on CPU):

  • last_hidden_state: shape (batch, seq, 384) β€” Caffe/NCHW layout (use Tensor_DimensionType_Caffe when copying to host).

To get a sentence embedding: take position 0 (CLS token), then L2 normalize.

Python usage (reference)

import MNN, numpy as np
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("darkmaniac7/bge-small-en-MNN")
interp = MNN.Interpreter("encoder.mnn")
sess = interp.createSession({"backend": "CPU", "numThread": 4})
in_ids  = interp.getSessionInput(sess, "input_ids")
in_mask = interp.getSessionInput(sess, "attention_mask")
in_tok  = interp.getSessionInput(sess, "token_type_ids")
out_t   = interp.getSessionOutput(sess, "last_hidden_state")

def embed(text):
    enc = tok(text, padding=False, truncation=True, max_length=512, return_tensors="np")
    seq = enc["input_ids"].shape[1]
    for t, arr in ((in_ids, enc["input_ids"]), (in_mask, enc["attention_mask"]), (in_tok, enc["token_type_ids"])):
        interp.resizeTensor(t, (1, seq))
    interp.resizeSession(sess)
    for t, arr in ((in_ids, enc["input_ids"]), (in_mask, enc["attention_mask"]), (in_tok, enc["token_type_ids"])):
        tmp = MNN.Tensor((1, seq), MNN.Halide_Type_Int,
                         arr.astype(np.int32), MNN.Tensor_DimensionType_Tensorflow)
        t.copyFrom(tmp)
    interp.runSession(sess)
    shape = out_t.getShape()
    host = MNN.Tensor(shape, MNN.Halide_Type_Float,
                      np.zeros(tuple(shape), dtype=np.float32),
                      MNN.Tensor_DimensionType_Caffe)
    out_t.copyToHostTensor(host)
    arr = np.array(host.getData(), dtype=np.float32).reshape(tuple(shape))
    cls = arr[0, 0, :]
    return cls / (np.linalg.norm(cls) + 1e-12)

# For asymmetric retrieval, prefix queries per BGE convention:
qvec = embed("Represent this sentence for searching relevant passages: " + "your query here")
pvec = embed("passage text here")
score = float(np.dot(qvec, pvec))  # cosine (both are unit-norm)

Attribution

Source weights are Copyright 2023 Beijing Academy of Artificial Intelligence, MIT-licensed. This conversion contains no novel weights β€” the ONNX graph from the source repo was re-serialized into MNN's binary format with fp16 weight compression.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for darkmaniac7/bge-small-en-MNN

Finetuned
(357)
this model