bge-small-rrf-v3

A 33M-parameter (384-dim) English embedding model, fine-tuned from BAAI/bge-small-en-v1.5 using vstash's self-supervised hybrid-retrieval disagreement signal.

Same size and speed as the base model. Higher retrieval quality on all three BEIR datasets it was trained on.

What changed vs v2

bge-small-rrf-v3 is trained with the winning config from the 2026-04-19 H-R9 ablation:

2x training volume. 60,000 target triples across three BEIR datasets instead of v2's 30,000. Volume was the single largest lever at this scale: every dataset improved simultaneously when doubling total_triples, no trade-off observed.
Same corpus balance. temperature=0.5 sampling keeps the same ratio v2 used; the volume increase scales every dataset proportionally rather than reshuffling.
Observability. The training pipeline now records NDCG@3 and Recall@100 alongside NDCG@10 in training_meta.json.

Eval numbers

Evaluated on the full 5-dataset BEIR cut (SciFact, NFCorpus, SciDocs, FiQA, ArguAna) with vstash's production retrieval pipeline (RRF hybrid + adaptive weights + doc-level dedup, wide top-100 candidate pool). Both v2 and v3 re-evaluated under the same pipeline for an apples-to-apples comparison.

Absolute NDCG@10 vs BM25, ColBERTv2, and previous releases

Dataset	BM25	ColBERTv2	Base	v2	v3
SciFact	0.665	0.693	0.9082	0.9107	0.9361
NFCorpus	0.325	0.344	0.3674	0.4325	0.3927
SciDocs	0.158	0.154	0.3637	0.3676	0.3693
FiQA	0.236	0.356	0.6509	0.6541	0.7506
ArguAna	0.315	0.463	0.7686	0.7579	0.7540
macro	-	-	0.6118	0.6246	0.6405

Wins vs ColBERTv2

Both v2 and v3 beat ColBERTv2 on 5/5 BEIR datasets under this pipeline. v3 improves macro by +1.6 absolute NDCG@10 over v2 (+2.6% relative).

v3 vs v2, dataset by dataset

Dataset	v2	v3	Winner	Note
SciFact	0.9107	0.9361	v3	+0.025 absolute
FiQA	0.6541	0.7506	v3	+0.097 absolute (the big v3 win)
SciDocs	0.3676	0.3693	v3	within noise
NFCorpus	0.4325	0.3927	v2	v2 retains the advantage (-0.040 in v3)
ArguAna	0.7579	0.7540	v2	within noise

Use v3 by default: it wins or ties on 3/5 datasets, the macro is cleanly higher, and the FiQA improvement is substantial (+14% absolute relative to the base). v2 remains the better pick when NFCorpus-style retrieval dominates your workload (keyword-heavy medical / biomedical corpora).

Supporting metrics

v3's candidate-set health (Recall@100) and head quality (NDCG@3) also improve across the board on datasets where the overall NDCG@10 went up:

FiQA Recall@100: 0.9188 (v2) -> 0.9867 (v3)
SciFact NDCG@3: 0.8976 (v2) -> 0.9265 (v3)

Full metrics JSON at experiments/results/v2_v3_head_to_head.json.

Reproduce the comparison with python -m experiments.v2_v3_head_to_head or the Colab notebook experiments/v2_v3_head_to_head.ipynb. Full H-R9 ablation (temperature / volume sweep that produced the v3 config) in experiments/retrain_roadmap.md.

Usage

Drop-in via sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Stffens/bge-small-rrf-v3")
embeddings = model.encode(["what is hybrid retrieval?"], normalize_embeddings=True)

Inside vstash

vstash reindex --model Stffens/bge-small-rrf-v3

As a search / RAG backbone

Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity, instruction-free encoding. Drop into any retrieval stack built around the base model.

Training recipe

Reproducible via the published notebook retrain_t1_4_multi_beir.ipynb after setting total_triples=60000. Single command:

vstash retrain-multi \
    --store scifact=./scifact.db \
    --store nfcorpus=./nfcorpus.db \
    --store fiqa=./fiqa.db \
    --sampling-strategy temperature \
    --sampling-temperature 0.5 \
    --total-triples 60000 \
    --epochs 2 --lr 3e-6 --batch-size 32 \
    --bulk-mine --bulk-eval \
    --seed 42 \
    --output ./bge-small-rrf-v3

Pipeline

Ingest SciFact (5183 docs), NFCorpus (3633), FiQA (57638) into separate vstash stores.
Sample training queries from BEIR queries.jsonl + qrels (v5 labeled-query recipe).
Mine hard negatives via vec-heavy / FTS-heavy RRF disagreement on each store.
Train one model on the union with MNRL for 2 epochs.
Evaluate per-dataset; promote the candidate only if macro NDCG@10 exceeds the base.

Hyperparameters

Key	Value
Base model	`BAAI/bge-small-en-v1.5`
Loss	`MultipleNegativesRankingLoss`
Total training triples	60,000 (target) / 39,852 (emitted)
Sampling	temperature, `alpha=0.5`
Epochs	2
Learning rate	3e-6
Batch size	32
Warmup steps	50
`max_seq_length`	256
Mixed precision	FP16 (AMP on)
Seed	42
Training hardware	NVIDIA A100
Training time	~15 minutes

Limitations

English only. The base model and training data are English. Cross-lingual retrieval may regress vs a multilingual model.
NFCorpus still saturates. Even at 2x volume, the NFCorpus NDCG@10 stays around 0.376, short of v5's published 0.409. The gap is likely model-capacity (33M params) and can be closed by a cross-encoder reranker on top; see vstash's T2.4 design doc.
Domain-specific corpora (clinical, legal, heavily-jargoned) may benefit more from retraining with vstash retrain-multi --store mydomain=... on top of v3 than from v3 out of the box.

Citation

@software{vstash_bge_small_rrf_v3_2026,
  author  = {Steffens, Jay},
  title   = {bge-small-rrf-v3: self-supervised retrieval fine-tune of BGE-small via vstash},
  year    = {2026},
  url     = {https://huggingface.co/Stffens/bge-small-rrf-v3}
}

For vstash itself:

@software{vstash_2026,
  author  = {Steffens, Jay},
  title   = {vstash: local-first document memory with instant semantic search},
  year    = {2026},
  url     = {https://github.com/stffns/vstash}
}

Downloads last month: -

Safetensors

Model size

33.4M params

Tensor type

F32

Model tree for Stffens/bge-small-rrf-v3

Base model

BAAI/bge-small-en-v1.5

Finetuned

(309)

this model

Stffens
/

bge-small-rrf-v3