Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | Blog
NanoVDR-M-Multi: Multilingual Query Encoder for Visual Document Retrieval
NanoVDR-M-Multi is a 116M-parameter multilingual text-only query encoder for visual document retrieval. It retrieves document page images as effectively as Vision-Language Models 30-100x its size, with strong cross-lingual transfer across 6 languages.
Built on NanoVDR-S and further trained with multilingual query augmentation (English + German, French, Spanish, Italian, Portuguese), it is the recommended model for production use with multilingual or mixed-language queries.
Results
| Model | Params | ViDoRe v1 (en) | ViDoRe v2 (multi) | ViDoRe v3 (multi) |
|---|---|---|---|---|
| Qwen3-VL-Emb (Teacher) | 2.0B | 84.3 | 65.3 | 50.0 |
| NanoVDR-M-Multi | 112M | 82.5 | 62.8 | 47.5 |
| NanoVDR-S-Multi | 69M | 82.2 | 61.9 | 46.5 |
| ColPali | ~3B | 84.2 | 54.7 | 42.0 |
Per-Language Teacher Retention
| Language | NDCG@5 | Teacher Retention |
|---|---|---|
| English | 50.7 | 93.0% |
| French | 47.8 | 93.6% |
| Spanish | 47.8 | 93.1% |
| Italian | 45.7 | 93.3% |
| German | 45.4 | 92.0% |
| Portuguese | 46.1 | 94.6% |
All 6 languages achieve >92% of the 2B teacher's performance.
How It Works
NanoVDR decouples query encoding from document encoding in visual document retrieval:
- Offline indexing: The VLM teacher (Qwen3-VL-Embedding-2B) encodes document page images into single-vector embeddings. This is a one-time cost.
- Online querying: NanoVDR-M-Multi encodes text queries in any supported language into the same embedding space via a lightweight text encoder + MLP projector. No vision model needed at query time.
Retrieval uses standard cosine similarity between query and document embeddings.
Usage
from sentence_transformers import SentenceTransformer
# Load the multilingual query encoder
model = SentenceTransformer("nanovdr/NanoVDR-M-Multi")
# Encode queries in any supported language
queries = [
"What was the revenue growth in Q3 2024?", # English
"Quel est le chiffre d'affaires du trimestre?", # French
"Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
"Cual fue el crecimiento de ingresos en el Q3?", # Spanish
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape) # (4, 2048)
# Retrieve against pre-indexed document embeddings from the VLM teacher
# scores = query_embeddings @ doc_embeddings.T
Full Retrieval Pipeline
from sentence_transformers import SentenceTransformer
# Step 1: Index documents with the VLM teacher (one-time, offline)
from transformers import AutoModel
teacher = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
# doc_embeddings = teacher.encode(document_images) # See Qwen3-VL-Embedding docs
# Step 2: Query with NanoVDR-S-Multi (online, fast, CPU-only)
student = SentenceTransformer("nanovdr/NanoVDR-M-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")
# Step 3: Retrieve
scores = query_emb @ doc_embeddings.T
top_k = scores.argsort()[-5:][::-1]
Training Details
- Architecture: google-bert/bert-base-uncased + 2-layer MLP projector (768 β 768 β 2048)
- Training objective: Pointwise cosine alignment with teacher query embeddings
- Training data: 1.49M query-document pairs β 711K original (4 public sources) + 778K machine-translated queries in 5 languages (DE, FR, ES, IT, PT) via Helsinki-NLP Opus-MT models
- Training cost: ~15 GPU-hours on a single H200
- Epochs: 10, lr=3e-4, batch size 1024 (effective)
Multilingual Augmentation Pipeline
- Extract 489K English queries from training data
- Translate to 5 target languages using Helsinki-NLP Opus-MT models (~200K per language)
- Re-encode translated queries with the frozen teacher in text mode to produce target embeddings
- Combine with original 711K pairs β 1.49M total training samples
Key Properties
- Output dimension: 2048 (aligned with Qwen3-VL-Embedding-2B)
- Max sequence length: 512 tokens
- Supported languages: English, German, French, Spanish, Italian, Portuguese
- Similarity function: Cosine similarity
- Pooling: Mean pooling
- Normalization: L2-normalized output
Efficiency
| Metric | NanoVDR-M-Multi | ColPali (3B) | Teacher (2B) |
|---|---|---|---|
| Query latency (CPU, B=1) | 51 ms | 7,300 ms | GPU only |
| Model size | 116M | ~3B | 2B |
| Index type | Single-vector | Multi-vector | Single-vector |
| Scoring | Cosine | MaxSim | Cosine |
Related Models
- NanoVDR-S β English-focused, same architecture
- NanoVDR-M β BERT-base backbone (116M)
- NanoVDR-L β ModernBERT backbone (155M)
Citation
@article{nanovdr2026,
title={NanoVDR: Asymmetric Cross-Modal Distillation for Efficient Visual Document Retrieval},
author={...},
journal={arXiv preprint},
year={2026}
}
License
Apache 2.0
- Downloads last month
- -
Model tree for nanovdr/NanoVDR-M-Multi
Base model
google-bert/bert-base-uncased