E5-base Code Search v9-200k (LoRA fine-tuned)

A fine-tuned code search embedding model based on intfloat/e5-base-v2 (110M parameters, 768 dimensions). Trained with call-graph false-negative filtering on 200K balanced pairs across 9 programming languages. Built for cqs — code intelligence and RAG for AI agents.

Production Eval (v3.v2 fixture, 2026-05-01)

The headline results below are from cqs's production fixture — 218 queries (109 test + 109 dev) curated from real agent telemetry and LLM-generated retrieval cases on the cqs codebase itself. This is the eval that drives default-model decisions.

split	metric	BGE-large (1024-dim)	v9-200k bare (768-dim)	v9-200k + LLM summaries
test	R@1	43.1%	45.9%	39.4%
test	R@5	69.7%	70.6%	69.7%
test	R@20	83.5%	80.7%	80.7%
dev	R@1	45.9%	46.8%	45.0%
dev	R@5	77.1%	68.8%	67.9%
dev	R@20	86.2%	81.7%	86.2%

v9-200k essentially ties BGE-large on test R@5 and edges it on R@1 across both splits — at 1/3 the parameter count and 25% smaller embeddings (768 vs 1024 dim). The cost is on dev R@5/R@20 (~5–8 pp behind), where BGE-large's broader pre-training base helps on out-of-distribution queries. For latency-sensitive or memory-constrained workloads, v9-200k is the right choice.

Run v9-200k bare — skip cqs's --llm-summaries enrichment pass for this model. Adding LLM-generated summaries to chunks (and re-running the embedding pass over the summary-augmented text) hurts test R@1 by ~6 pp and is a wash on R@5 across both splits. The call-graph-trained dense channel already captures the signal summaries would add; injecting summary text dilutes the model's strongest top-1 signal. The only metric that materially benefits is dev R@20 (+4.5 pp). For BGE-large the same enrichment pass is a small net win; the v9-200k training distribution is the difference.

Decision (2026-05-01): cqs keeps BGE-large as default for the dev R@5 hedge, but v9-200k is shipped as a first-class opt-in preset. Set CQS_EMBEDDING_MODEL=v9-200k or cqs slot create v9 --model v9-200k to use it. Don't pass --llm-summaries on the index command unless you're specifically optimizing for dev R@20.

Historical results (296q synthetic fixture)

These are from an earlier synthetic eval (296q across 7 languages, enriched chunks). They show the model's strength on cleanly-curated code-search pairs, where the call-graph training signal is most visible:

Eval	Metric	This Model	BGE-large (335M)	BGE-large FT (335M)
Fixture (296q, 7 languages, enriched)	R@1	90.5%	90.9%	91.6%
Fixture	MRR	0.948	0.949	0.952
Raw code embedding (55q, no enrichment)	R@1	70.9%	61.8%	66.2%
CoIR 9-task (19 subtasks)	Overall	52.7	55.7	57.5
CoIR CodeSearchNet (6 languages)	NDCG@10	0.615	0.721	0.779

Note (2026-05-01): an earlier evaluation against v3.v2 (2026-04-25) reported v9-200k as ~30 pp behind BGE-large and led to a "retired" verdict. That gap turned out to be ~95% fixture-side artifact — cqs's eval matcher required strict (file, name, line_start) to score a gold chunk, and routine code edits between the fixture's pin date and the rerun shifted ~38% of gold-chunk line numbers, making them invisible to the matcher even when search returned them. After loosening the matcher to (file, name) (cqs PR #1284), the numbers in the Production Eval table above are what holds. Lesson: when a benchmark drops 25 pp overnight, suspect the harness before the model.

Training Details

Base Model: intfloat/e5-base-v2 (110M params, 768 dimensions)
Data: 200K balanced pairs (22,222 per language × 9 languages) from cqs-indexed Stack repos
Key Technique: Call-graph false-negative filtering — excludes structurally related functions from contrastive negatives (zero API cost, SQLite lookup)
Loss: CachedGISTEmbedLoss (guide: intfloat/e5-base-v2) + MatryoshkaLoss (768/384/192/128 dims)
LoRA: rank 16, alpha 32 (targets: query, key, value, dense)
Epochs: 1 (more epochs degrades enrichment compatibility)
Dataset: jamie8johnson/cqs-code-search-200k

Supported Languages

Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript, C++

Usage

With cqs

export CQS_EMBEDDING_MODEL=v9-200k
cqs index --force
# or, for slot-based comparisons:
cqs slot create v9 --model v9-200k
cqs index --slot v9 --force

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jamie8johnson/e5-base-v2-code-search")
query_emb = model.encode("query: find functions that validate email addresses")
code_emb = model.encode("passage: def validate_email(addr): ...")

License

Apache 2.0 (same as base model).

Downloads last month: 32

Model tree for jamie8johnson/e5-base-v2-code-search

Base model

intfloat/e5-base-v2

Quantized

(13)

this model

jamie8johnson
/

e5-base-v2-code-search