Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | Blog

NanoVDR-M-Multi: Multilingual Query Encoder for Visual Document Retrieval

NanoVDR-M-Multi is a 116M-parameter multilingual text-only query encoder for visual document retrieval. It retrieves document page images as effectively as Vision-Language Models 30-100x its size, with strong cross-lingual transfer across 6 languages.

Built on NanoVDR-S and further trained with multilingual query augmentation (English + German, French, Spanish, Italian, Portuguese), it is the recommended model for production use with multilingual or mixed-language queries.

Results

Model Params ViDoRe v1 (en) ViDoRe v2 (multi) ViDoRe v3 (multi)
Qwen3-VL-Emb (Teacher) 2.0B 84.3 65.3 50.0
NanoVDR-M-Multi 112M 82.5 62.8 47.5
NanoVDR-S-Multi 69M 82.2 61.9 46.5
ColPali ~3B 84.2 54.7 42.0

Per-Language Teacher Retention

Language NDCG@5 Teacher Retention
English 50.7 93.0%
French 47.8 93.6%
Spanish 47.8 93.1%
Italian 45.7 93.3%
German 45.4 92.0%
Portuguese 46.1 94.6%

All 6 languages achieve >92% of the 2B teacher's performance.

How It Works

NanoVDR decouples query encoding from document encoding in visual document retrieval:

  • Offline indexing: The VLM teacher (Qwen3-VL-Embedding-2B) encodes document page images into single-vector embeddings. This is a one-time cost.
  • Online querying: NanoVDR-M-Multi encodes text queries in any supported language into the same embedding space via a lightweight text encoder + MLP projector. No vision model needed at query time.

Retrieval uses standard cosine similarity between query and document embeddings.

Usage

from sentence_transformers import SentenceTransformer

# Load the multilingual query encoder
model = SentenceTransformer("nanovdr/NanoVDR-M-Multi")

# Encode queries in any supported language
queries = [
    "What was the revenue growth in Q3 2024?",           # English
    "Quel est le chiffre d'affaires du trimestre?",       # French
    "Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
    "Cual fue el crecimiento de ingresos en el Q3?",      # Spanish
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape)  # (4, 2048)

# Retrieve against pre-indexed document embeddings from the VLM teacher
# scores = query_embeddings @ doc_embeddings.T

Full Retrieval Pipeline

from sentence_transformers import SentenceTransformer

# Step 1: Index documents with the VLM teacher (one-time, offline)
from transformers import AutoModel
teacher = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
# doc_embeddings = teacher.encode(document_images)  # See Qwen3-VL-Embedding docs

# Step 2: Query with NanoVDR-S-Multi (online, fast, CPU-only)
student = SentenceTransformer("nanovdr/NanoVDR-M-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")

# Step 3: Retrieve
scores = query_emb @ doc_embeddings.T
top_k = scores.argsort()[-5:][::-1]

Training Details

  • Architecture: google-bert/bert-base-uncased + 2-layer MLP projector (768 β†’ 768 β†’ 2048)
  • Training objective: Pointwise cosine alignment with teacher query embeddings
  • Training data: 1.49M query-document pairs β€” 711K original (4 public sources) + 778K machine-translated queries in 5 languages (DE, FR, ES, IT, PT) via Helsinki-NLP Opus-MT models
  • Training cost: ~15 GPU-hours on a single H200
  • Epochs: 10, lr=3e-4, batch size 1024 (effective)

Multilingual Augmentation Pipeline

  1. Extract 489K English queries from training data
  2. Translate to 5 target languages using Helsinki-NLP Opus-MT models (~200K per language)
  3. Re-encode translated queries with the frozen teacher in text mode to produce target embeddings
  4. Combine with original 711K pairs β†’ 1.49M total training samples

Key Properties

  • Output dimension: 2048 (aligned with Qwen3-VL-Embedding-2B)
  • Max sequence length: 512 tokens
  • Supported languages: English, German, French, Spanish, Italian, Portuguese
  • Similarity function: Cosine similarity
  • Pooling: Mean pooling
  • Normalization: L2-normalized output

Efficiency

Metric NanoVDR-M-Multi ColPali (3B) Teacher (2B)
Query latency (CPU, B=1) 51 ms 7,300 ms GPU only
Model size 116M ~3B 2B
Index type Single-vector Multi-vector Single-vector
Scoring Cosine MaxSim Cosine

Related Models

  • NanoVDR-S β€” English-focused, same architecture
  • NanoVDR-M β€” BERT-base backbone (116M)
  • NanoVDR-L β€” ModernBERT backbone (155M)

Citation

@article{nanovdr2026,
  title={NanoVDR: Asymmetric Cross-Modal Distillation for Efficient Visual Document Retrieval},
  author={...},
  journal={arXiv preprint},
  year={2026}
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nanovdr/NanoVDR-M-Multi

Finetuned
(6474)
this model

Paper for nanovdr/NanoVDR-M-Multi