--- license: apache-2.0 tags: - sentence-similarity - feature-extraction - static-embeddings - lf4-quantization - retrieval - code-search model_name: Vortex-Embed v2 datasets: - VTXAI/Vortex-Embed metrics: - recall@1 - recall@5 - recall@10 - mrr --- # Vortex-Embed v2 **Retrieval-optimized 4-bit static embeddings for code search.** Built on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M) (29528 vocab × 256 dim, 4-bit LF4 packed = **4.7 MB** on disk) with a set of training-free retrieval upgrades that lift R@1 from 0.314 → **0.745** on the Webscout codebase benchmark (51 hand-verified code queries, 5,168 chunks across 349 files). ## What changed vs the v1 model All four upgrades are inference-time only — the underlying 4-bit weights are bit-identical to the v1 artifact. They are: 1. **SIF IDF weighting.** Each token's contribution is scaled by `a / (a + p(t))` where `p(t)` is its corpus frequency. Common tokens ("import", "def", "class") are down-weighted; rare tokens are amplified. 2. **Top-8 principal component removal.** The dominant common-topic direction of the corpus is fitted once via SVD and projected out of every chunk/query vector (Arora et al. 2017). 3. **File-path header injection.** Before encoding each chunk, its file path tokens (e.g. `model_fetcher`, `search`, `engines`) are prepended ×15. The file name effectively becomes a "tag" the chunk retrieves on. 4. **Search-time file-extension score bias.** Within the top-50 dense candidates, `.py` chunks get `+0.05` and `.md` chunks get `-0.02`. This fixes the common failure where README.md and docs/*.md outrank the actual code (higher topic overlap but lower action relevance). ## Benchmark Corpus: 5,168 chunks × 256-dim across 349 files in the Webscout codebase. Queries: 51 hand-verified natural-language → file-path pairs. | Model | R@1 | R@5 | R@10 | MRR | enc@1 | enc@64 | search@64 | |---|---|---|---|---|---|---|---| | Vortex-Embed v1 (baseline) | 0.314 | 0.667 | 0.863 | 0.478 | 6.2 ms | 227 ms | 4.2 ms | | **Vortex-Embed v2 (this)** | **0.745** | **0.843** | **0.882** | **0.779** | 6.4 ms | 107 ms | 9.1 ms | **+137% R@1, +63% MRR.** Encode of 64 chunks is **2.1× faster** thanks to the same `torch.scatter_add_` (ATen) and sorted `reduceat` kernels used in v1. ## Usage ```python from huggingface_hub import snapshot_download from lf4_v2 import VortexEmbedV2 # Download model + tokenizer + config path = snapshot_download("VTXAI/Vortex-Embed-v2") # Load model = VortexEmbedV2.from_pretrained(path) print(f"vocab={model.vocab_size}, dim={model.dim}, size={model.model_size_mb:.1f} MB") # Single-query encode vec = model.encode("find python json parser", normalize=True) # vec.shape == (256,) # Batch encode docs = [ "def parse_json(s): return json.loads(s)", "class WeatherAPI: pass", "import requests", ] doc_embs = model.encode(docs, normalize=True) # (3, 256) # Search import numpy as np scores, indices = model.search(vec, doc_embs, top_k=3) # scores.shape == (1, 3), indices.shape == (1, 3) ``` ### Codebase retrieval (the real use case) ```python from pathlib import Path from lf4_v2 import VortexEmbedV2 # 1. Chunk a codebase (line-based, 40 lines/chunk, 5 line overlap) chunks, texts = [], [] for path in Path("./src").rglob("*.py"): for i, line in enumerate(path.read_text().splitlines()): chunk_start = max(0, i - 40) chunk = "\n".join(path.read_text().splitlines()[chunk_start:i+5]) chunks.append((str(path), chunk_start, chunk)) texts.append(chunk) # 2. Load + bind paths (this enables file-path header injection) model = VortexEmbedV2.from_pretrained("VTXAI/Vortex-Embed-v2") model.set_file_paths([c[0] for c in chunks]) # critical for v2 quality # 3. Fit IDF on the corpus (one-time, ~200 ms) token_lists = [model.tokenizer.encode(t).ids for t in texts] model.fit_idf(token_lists) # 4. Encode corpus import_emb = model.encode_batch(texts, normalize=True) # (n, 256) # 5. Fit top-K PC on the corpus (one-time, ~300 ms) model.fit_pc(import_emb, k=8) # 6. Re-encode with PC removal applied import_emb = model.encode_batch(texts, normalize=True) # 7. Query query = "where do we parse JSON requests" q_emb = model.encode(query, normalize=True) scores, indices = model.search(q_emb, import_emb, top_k=10) for rank, (s, i) in enumerate(zip(scores[0], indices[0]), 1): file, line, text = chunks[i] print(f"#{rank} ({s:.3f}) {file}:{line}") ``` ## Configuration knobs All retrieval hyperparameters live in `config.json` and can be overridden at load time: ```python model = VortexEmbedV2.from_pretrained( "VTXAI/Vortex-Embed-v2", sif_a=1e-3, # SIF smoothing (lower = sharper) pc_k=0, # disable PC removal header_repeat=10, # reduce path-header weight py_bonus=0.0, # disable extension bias ) ``` | Knob | Default | Effect | |---|---|---| | `sif_a` | 1e-4 | SIF smoothing. Lower = sharper IDF weighting | | `pc_k` | 8 | Number of principal components to remove | | `sif_pc` | 1.0 | PC removal strength (0 = disabled) | | `header_repeat` | 15 | How many times to repeat path-header tokens | | `py_bonus` | 0.05 | Score boost for `.py` chunks in top-50 | | `md_penalty` | -0.02 | Score penalty for `.md` chunks in top-50 | | `bias_top_k` | 50 | Candidate pool size for the bias | ## Files - `model.safetensors` — 4-bit LF4 packed weights (3.7 MB) - `embedding_scales` (FP16), `embedding_zeros` (FP16) — per-block quantization params - `config.json` — model + retrieval config - `tokenizer.json` — HuggingFace fast tokenizer (29 KB) - `lf4_v2.py` — self-contained model class (drop-in to any project) ## Citation The SIF/PC technique is from: > Arora, Liang, Ma (2017). *A Simple but Tough-to-Beat Baseline for Sentence Embeddings.* ICLR. The LF4 quantization is from: > Original Vortex-Embed-4.7M model card on [VTXAI/Vortex-Embed-4.7M](https://huggingface.co/VTXAI/Vortex-Embed-4.7M). If you use v2 in research, please cite the original Vortex-Embed paper and this AutoResearch loop (see [Vortex-AutoResearch](https://github.com/VortexAI)).