bge-small-rrf-v3

A 33M-parameter (384-dim) English embedding model, fine-tuned from BAAI/bge-small-en-v1.5 using vstash's self-supervised hybrid-retrieval disagreement signal.

Same size and speed as the base model. Higher retrieval quality on all three BEIR datasets it was trained on.

What changed vs v2

bge-small-rrf-v3 is trained with the winning config from the 2026-04-19 H-R9 ablation:

  • 2x training volume. 60,000 target triples across three BEIR datasets instead of v2's 30,000. Volume was the single largest lever at this scale: every dataset improved simultaneously when doubling total_triples, no trade-off observed.
  • Same corpus balance. temperature=0.5 sampling keeps the same ratio v2 used; the volume increase scales every dataset proportionally rather than reshuffling.
  • Observability. The training pipeline now records NDCG@3 and Recall@100 alongside NDCG@10 in training_meta.json.

Eval numbers

Evaluated on the full 5-dataset BEIR cut (SciFact, NFCorpus, SciDocs, FiQA, ArguAna) with vstash's production retrieval pipeline (RRF hybrid + adaptive weights + doc-level dedup, wide top-100 candidate pool). Both v2 and v3 re-evaluated under the same pipeline for an apples-to-apples comparison.

Absolute NDCG@10 vs BM25, ColBERTv2, and previous releases

Dataset BM25 ColBERTv2 Base v2 v3
SciFact 0.665 0.693 0.9082 0.9107 0.9361
NFCorpus 0.325 0.344 0.3674 0.4325 0.3927
SciDocs 0.158 0.154 0.3637 0.3676 0.3693
FiQA 0.236 0.356 0.6509 0.6541 0.7506
ArguAna 0.315 0.463 0.7686 0.7579 0.7540
macro - - 0.6118 0.6246 0.6405

Wins vs ColBERTv2

Both v2 and v3 beat ColBERTv2 on 5/5 BEIR datasets under this pipeline. v3 improves macro by +1.6 absolute NDCG@10 over v2 (+2.6% relative).

v3 vs v2, dataset by dataset

Dataset v2 v3 Winner Note
SciFact 0.9107 0.9361 v3 +0.025 absolute
FiQA 0.6541 0.7506 v3 +0.097 absolute (the big v3 win)
SciDocs 0.3676 0.3693 v3 within noise
NFCorpus 0.4325 0.3927 v2 v2 retains the advantage (-0.040 in v3)
ArguAna 0.7579 0.7540 v2 within noise

Use v3 by default: it wins or ties on 3/5 datasets, the macro is cleanly higher, and the FiQA improvement is substantial (+14% absolute relative to the base). v2 remains the better pick when NFCorpus-style retrieval dominates your workload (keyword-heavy medical / biomedical corpora).

Supporting metrics

v3's candidate-set health (Recall@100) and head quality (NDCG@3) also improve across the board on datasets where the overall NDCG@10 went up:

  • FiQA Recall@100: 0.9188 (v2) -> 0.9867 (v3)
  • SciFact NDCG@3: 0.8976 (v2) -> 0.9265 (v3)

Full metrics JSON at experiments/results/v2_v3_head_to_head.json.

Reproduce the comparison with python -m experiments.v2_v3_head_to_head or the Colab notebook experiments/v2_v3_head_to_head.ipynb. Full H-R9 ablation (temperature / volume sweep that produced the v3 config) in experiments/retrain_roadmap.md.

Usage

Drop-in via sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Stffens/bge-small-rrf-v3")
embeddings = model.encode(["what is hybrid retrieval?"], normalize_embeddings=True)

Inside vstash

vstash reindex --model Stffens/bge-small-rrf-v3

As a search / RAG backbone

Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity, instruction-free encoding. Drop into any retrieval stack built around the base model.

Training recipe

Reproducible via the published notebook retrain_t1_4_multi_beir.ipynb after setting total_triples=60000. Single command:

vstash retrain-multi \
    --store scifact=./scifact.db \
    --store nfcorpus=./nfcorpus.db \
    --store fiqa=./fiqa.db \
    --sampling-strategy temperature \
    --sampling-temperature 0.5 \
    --total-triples 60000 \
    --epochs 2 --lr 3e-6 --batch-size 32 \
    --bulk-mine --bulk-eval \
    --seed 42 \
    --output ./bge-small-rrf-v3

Pipeline

  1. Ingest SciFact (5183 docs), NFCorpus (3633), FiQA (57638) into separate vstash stores.
  2. Sample training queries from BEIR queries.jsonl + qrels (v5 labeled-query recipe).
  3. Mine hard negatives via vec-heavy / FTS-heavy RRF disagreement on each store.
  4. Train one model on the union with MNRL for 2 epochs.
  5. Evaluate per-dataset; promote the candidate only if macro NDCG@10 exceeds the base.

Hyperparameters

Key Value
Base model BAAI/bge-small-en-v1.5
Loss MultipleNegativesRankingLoss
Total training triples 60,000 (target) / 39,852 (emitted)
Sampling temperature, alpha=0.5
Epochs 2
Learning rate 3e-6
Batch size 32
Warmup steps 50
max_seq_length 256
Mixed precision FP16 (AMP on)
Seed 42
Training hardware NVIDIA A100
Training time ~15 minutes

Limitations

  • English only. The base model and training data are English. Cross-lingual retrieval may regress vs a multilingual model.
  • NFCorpus still saturates. Even at 2x volume, the NFCorpus NDCG@10 stays around 0.376, short of v5's published 0.409. The gap is likely model-capacity (33M params) and can be closed by a cross-encoder reranker on top; see vstash's T2.4 design doc.
  • Domain-specific corpora (clinical, legal, heavily-jargoned) may benefit more from retraining with vstash retrain-multi --store mydomain=... on top of v3 than from v3 out of the box.

Citation

@software{vstash_bge_small_rrf_v3_2026,
  author  = {Steffens, Jay},
  title   = {bge-small-rrf-v3: self-supervised retrieval fine-tune of BGE-small via vstash},
  year    = {2026},
  url     = {https://huggingface.co/Stffens/bge-small-rrf-v3}
}

For vstash itself:

@software{vstash_2026,
  author  = {Steffens, Jay},
  title   = {vstash: local-first document memory with instant semantic search},
  year    = {2026},
  url     = {https://github.com/stffns/vstash}
}
Downloads last month
-
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stffens/bge-small-rrf-v3

Finetuned
(309)
this model

Datasets used to train Stffens/bge-small-rrf-v3