Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | Blog

NanoVDR-M-Multi: Multilingual Query Encoder for Visual Document Retrieval

NanoVDR-M-Multi is a 116M-parameter multilingual text-only query encoder for visual document retrieval. It retrieves document page images as effectively as Vision-Language Models 30-100x its size, with strong cross-lingual transfer across 6 languages.

Built on NanoVDR-S and further trained with multilingual query augmentation (English + German, French, Spanish, Italian, Portuguese), it is the recommended model for production use with multilingual or mixed-language queries.

Results

Model	Params	ViDoRe v1 (en)	ViDoRe v2 (multi)	ViDoRe v3 (multi)
Qwen3-VL-Emb (Teacher)	2.0B	84.3	65.3	50.0
NanoVDR-M-Multi	112M	82.5	62.8	47.5
NanoVDR-S-Multi	69M	82.2	61.9	46.5
ColPali	~3B	84.2	54.7	42.0

Per-Language Teacher Retention

Language	NDCG@5	Teacher Retention
English	50.7	93.0%
French	47.8	93.6%
Spanish	47.8	93.1%
Italian	45.7	93.3%
German	45.4	92.0%
Portuguese	46.1	94.6%

All 6 languages achieve >92% of the 2B teacher's performance.

How It Works

NanoVDR decouples query encoding from document encoding in visual document retrieval:

Offline indexing: The VLM teacher (Qwen3-VL-Embedding-2B) encodes document page images into single-vector embeddings. This is a one-time cost.
Online querying: NanoVDR-M-Multi encodes text queries in any supported language into the same embedding space via a lightweight text encoder + MLP projector. No vision model needed at query time.

Retrieval uses standard cosine similarity between query and document embeddings.

Usage

from sentence_transformers import SentenceTransformer

# Load the multilingual query encoder
model = SentenceTransformer("nanovdr/NanoVDR-M-Multi")

# Encode queries in any supported language
queries = [
    "What was the revenue growth in Q3 2024?",           # English
    "Quel est le chiffre d'affaires du trimestre?",       # French
    "Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
    "Cual fue el crecimiento de ingresos en el Q3?",      # Spanish
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape)  # (4, 2048)

# Retrieve against pre-indexed document embeddings from the VLM teacher
# scores = query_embeddings @ doc_embeddings.T

Full Retrieval Pipeline

from sentence_transformers import SentenceTransformer

# Step 1: Index documents with the VLM teacher (one-time, offline)
from transformers import AutoModel
teacher = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
# doc_embeddings = teacher.encode(document_images)  # See Qwen3-VL-Embedding docs

# Step 2: Query with NanoVDR-S-Multi (online, fast, CPU-only)
student = SentenceTransformer("nanovdr/NanoVDR-M-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")

# Step 3: Retrieve
scores = query_emb @ doc_embeddings.T
top_k = scores.argsort()[-5:][::-1]

Training Details

Architecture: google-bert/bert-base-uncased + 2-layer MLP projector (768 → 768 → 2048)
Training objective: Pointwise cosine alignment with teacher query embeddings
Training data: 1.49M query-document pairs — 711K original (4 public sources) + 778K machine-translated queries in 5 languages (DE, FR, ES, IT, PT) via Helsinki-NLP Opus-MT models
Training cost: ~15 GPU-hours on a single H200
Epochs: 10, lr=3e-4, batch size 1024 (effective)

Multilingual Augmentation Pipeline

Extract 489K English queries from training data
Translate to 5 target languages using Helsinki-NLP Opus-MT models (~200K per language)
Re-encode translated queries with the frozen teacher in text mode to produce target embeddings
Combine with original 711K pairs → 1.49M total training samples

Key Properties

Output dimension: 2048 (aligned with Qwen3-VL-Embedding-2B)
Max sequence length: 512 tokens
Supported languages: English, German, French, Spanish, Italian, Portuguese
Similarity function: Cosine similarity
Pooling: Mean pooling
Normalization: L2-normalized output

Efficiency

Metric	NanoVDR-M-Multi	ColPali (3B)	Teacher (2B)
Query latency (CPU, B=1)	51 ms	7,300 ms	GPU only
Model size	116M	~3B	2B
Index type	Single-vector	Multi-vector	Single-vector
Scoring	Cosine	MaxSim	Cosine

Related Models

NanoVDR-S — English-focused, same architecture
NanoVDR-M — BERT-base backbone (116M)
NanoVDR-L — ModernBERT backbone (155M)

Citation

@article{nanovdr2026,
  title={NanoVDR: Asymmetric Cross-Modal Distillation for Efficient Visual Document Retrieval},
  author={...},
  journal={arXiv preprint},
  year={2026}
}

License

Apache 2.0

Downloads last month: 7

Model tree for nanovdr/NanoVDR-M-Multi

Base model

google-bert/bert-base-uncased

Finetuned

(6700)

this model

Paper for nanovdr/NanoVDR-M-Multi

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Paper • 2603.12824 • Published Mar 13 • 5