bge-small-rrf-v3
A 33M-parameter (384-dim) English embedding model, fine-tuned from
BAAI/bge-small-en-v1.5
using vstash's self-supervised
hybrid-retrieval disagreement signal.
Same size and speed as the base model. Higher retrieval quality on all three BEIR datasets it was trained on.
What changed vs v2
bge-small-rrf-v3 is trained with the winning config from the
2026-04-19 H-R9 ablation:
- 2x training volume. 60,000 target triples across three BEIR
datasets instead of v2's 30,000. Volume was the single largest
lever at this scale: every dataset improved simultaneously when
doubling
total_triples, no trade-off observed. - Same corpus balance.
temperature=0.5sampling keeps the same ratio v2 used; the volume increase scales every dataset proportionally rather than reshuffling. - Observability. The training pipeline now records NDCG@3 and
Recall@100 alongside NDCG@10 in
training_meta.json.
Eval numbers
Evaluated on the full 5-dataset BEIR cut (SciFact, NFCorpus, SciDocs, FiQA, ArguAna) with vstash's production retrieval pipeline (RRF hybrid + adaptive weights + doc-level dedup, wide top-100 candidate pool). Both v2 and v3 re-evaluated under the same pipeline for an apples-to-apples comparison.
Absolute NDCG@10 vs BM25, ColBERTv2, and previous releases
| Dataset | BM25 | ColBERTv2 | Base | v2 | v3 |
|---|---|---|---|---|---|
| SciFact | 0.665 | 0.693 | 0.9082 | 0.9107 | 0.9361 |
| NFCorpus | 0.325 | 0.344 | 0.3674 | 0.4325 | 0.3927 |
| SciDocs | 0.158 | 0.154 | 0.3637 | 0.3676 | 0.3693 |
| FiQA | 0.236 | 0.356 | 0.6509 | 0.6541 | 0.7506 |
| ArguAna | 0.315 | 0.463 | 0.7686 | 0.7579 | 0.7540 |
| macro | - | - | 0.6118 | 0.6246 | 0.6405 |
Wins vs ColBERTv2
Both v2 and v3 beat ColBERTv2 on 5/5 BEIR datasets under this pipeline. v3 improves macro by +1.6 absolute NDCG@10 over v2 (+2.6% relative).
v3 vs v2, dataset by dataset
| Dataset | v2 | v3 | Winner | Note |
|---|---|---|---|---|
| SciFact | 0.9107 | 0.9361 | v3 | +0.025 absolute |
| FiQA | 0.6541 | 0.7506 | v3 | +0.097 absolute (the big v3 win) |
| SciDocs | 0.3676 | 0.3693 | v3 | within noise |
| NFCorpus | 0.4325 | 0.3927 | v2 | v2 retains the advantage (-0.040 in v3) |
| ArguAna | 0.7579 | 0.7540 | v2 | within noise |
Use v3 by default: it wins or ties on 3/5 datasets, the macro is cleanly higher, and the FiQA improvement is substantial (+14% absolute relative to the base). v2 remains the better pick when NFCorpus-style retrieval dominates your workload (keyword-heavy medical / biomedical corpora).
Supporting metrics
v3's candidate-set health (Recall@100) and head quality (NDCG@3) also improve across the board on datasets where the overall NDCG@10 went up:
- FiQA Recall@100: 0.9188 (v2) -> 0.9867 (v3)
- SciFact NDCG@3: 0.8976 (v2) -> 0.9265 (v3)
Full metrics JSON at experiments/results/v2_v3_head_to_head.json.
Reproduce the comparison with python -m experiments.v2_v3_head_to_head
or the Colab notebook
experiments/v2_v3_head_to_head.ipynb.
Full H-R9 ablation (temperature / volume sweep that produced the
v3 config) in
experiments/retrain_roadmap.md.
Usage
Drop-in via sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Stffens/bge-small-rrf-v3")
embeddings = model.encode(["what is hybrid retrieval?"], normalize_embeddings=True)
Inside vstash
vstash reindex --model Stffens/bge-small-rrf-v3
As a search / RAG backbone
Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity,
instruction-free encoding. Drop into any retrieval stack built
around the base model.
Training recipe
Reproducible via the published notebook
retrain_t1_4_multi_beir.ipynb
after setting total_triples=60000. Single command:
vstash retrain-multi \
--store scifact=./scifact.db \
--store nfcorpus=./nfcorpus.db \
--store fiqa=./fiqa.db \
--sampling-strategy temperature \
--sampling-temperature 0.5 \
--total-triples 60000 \
--epochs 2 --lr 3e-6 --batch-size 32 \
--bulk-mine --bulk-eval \
--seed 42 \
--output ./bge-small-rrf-v3
Pipeline
- Ingest SciFact (5183 docs), NFCorpus (3633), FiQA (57638) into separate vstash stores.
- Sample training queries from BEIR
queries.jsonl+ qrels (v5 labeled-query recipe). - Mine hard negatives via vec-heavy / FTS-heavy RRF disagreement on each store.
- Train one model on the union with MNRL for 2 epochs.
- Evaluate per-dataset; promote the candidate only if macro NDCG@10 exceeds the base.
Hyperparameters
| Key | Value |
|---|---|
| Base model | BAAI/bge-small-en-v1.5 |
| Loss | MultipleNegativesRankingLoss |
| Total training triples | 60,000 (target) / 39,852 (emitted) |
| Sampling | temperature, alpha=0.5 |
| Epochs | 2 |
| Learning rate | 3e-6 |
| Batch size | 32 |
| Warmup steps | 50 |
max_seq_length |
256 |
| Mixed precision | FP16 (AMP on) |
| Seed | 42 |
| Training hardware | NVIDIA A100 |
| Training time | ~15 minutes |
Limitations
- English only. The base model and training data are English. Cross-lingual retrieval may regress vs a multilingual model.
- NFCorpus still saturates. Even at 2x volume, the NFCorpus NDCG@10 stays around 0.376, short of v5's published 0.409. The gap is likely model-capacity (33M params) and can be closed by a cross-encoder reranker on top; see vstash's T2.4 design doc.
- Domain-specific corpora (clinical, legal, heavily-jargoned)
may benefit more from retraining with
vstash retrain-multi --store mydomain=...on top of v3 than from v3 out of the box.
Citation
@software{vstash_bge_small_rrf_v3_2026,
author = {Steffens, Jay},
title = {bge-small-rrf-v3: self-supervised retrieval fine-tune of BGE-small via vstash},
year = {2026},
url = {https://huggingface.co/Stffens/bge-small-rrf-v3}
}
For vstash itself:
@software{vstash_2026,
author = {Steffens, Jay},
title = {vstash: local-first document memory with instant semantic search},
year = {2026},
url = {https://github.com/stffns/vstash}
}
- Downloads last month
- -
Model tree for Stffens/bge-small-rrf-v3
Base model
BAAI/bge-small-en-v1.5