Rank-Embed-0.6B
Rank-Embed-0.6B is a specialized bi-encoder model designed for semantic search and dense retrieval. Instead of relying only on keyword overlap, it maps queries and documents into a shared vector space so they can be compared based on meaning, context, and intent.
Built on top of Qwen/Qwen2.5-0.5B-Instruct, the model is optimized for retrieval-first workloads such as semantic search, ranking, retrieval-augmented generation, clustering, and duplicate detection. It is compact enough for efficient deployment while retaining the language understanding needed for more complex search tasks.
Model Summary
| Property | Value |
|---|---|
| Architecture | Bi-encoder / two-tower embedding model |
| Base model | Qwen/Qwen2.5-0.5B-Instruct |
| Parameters | ~0.6B |
| Backbone hidden size | 896 |
| Embedding dimension | 768 |
| Pooling | Mean pooling |
| Projection head | nn.Linear(896, 768) |
| Similarity | Cosine similarity over L2-normalized vectors |
| Framework | PyTorch / Transformers |
| License | Apache 2.0 |
Key Capabilities
- Dense embedding generation for queries, passages, and documents
- Semantic search based on meaning rather than exact keyword matching
- Efficient cosine-similarity retrieval with normalized embeddings
- Strong support for complex and intent-heavy search queries
- Practical deployment footprint for production retrieval systems
What This Model Is
Rank-Embed-0.6B is designed to transform text into dense numerical vectors, or embeddings, that capture semantic meaning. In a traditional keyword-based system, retrieval depends on exact lexical overlap. In contrast, this model enables systems to compare text based on intent, topic, and contextual similarity.
As a compact retrieval model built on Qwen2.5-0.5B-Instruct, it provides an efficient balance between inference speed and semantic quality. This makes it a strong fit for production search systems that need to serve high-quality results without requiring unnecessarily large infrastructure.
Unlike a generative chatbot, Rank-Embed-0.6B is purpose-built for retrieval. Its role is not to generate responses, but to identify, compare, and surface the most relevant pieces of information from a corpus.
How It Works
1. Bi-Encoder Architecture
The model uses a two-tower, or bi-encoder, design:
- Query tower: processes the user's search query
- Document tower: processes candidate documents or passages
- Shared objective: maps both into the same high-dimensional space so relevant pairs are positioned close together
In practice, if a document meaningfully answers a query, their embeddings should be near one another in the 768-dimensional representation space.
2. Core Components
- Backbone: the model uses Qwen2.5-0.5B-Instruct as its language backbone, providing strong prior understanding of natural language and complex instruction-like phrasing.
- Pooling layer: because the backbone produces token-level representations, mean pooling is used to aggregate them into a single sentence-level embedding.
- Projection head: a linear projection layer,
nn.Linear(896, 768), reduces the backbone hidden size to a 768-dimensional embedding size suitable for vector search systems. - Normalization: final embeddings are L2-normalized so similarity can be computed efficiently with cosine similarity.
What It Can Do
- Semantic search: retrieves relevant content even when the query and document use different wording.
- Complex search: handles nuanced, intent-rich queries where the best result depends on meaning rather than exact phrasing.
- Retrieval-augmented generation: serves as the retrieval layer in RAG systems by surfacing relevant context for downstream language models.
- Clustering and organization: groups documents, tickets, or records by semantic similarity.
- Duplicate detection: identifies differently worded inputs that express the same underlying meaning.
Quick Start
Installation
pip install transformers torch
Basic Usage
import torch
from transformers import AutoModel, AutoTokenizer
model_id = "GorankLabs/Rank-Embed-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
model.eval()
def mean_pool(last_hidden_state, attention_mask):
mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
return (last_hidden_state * mask).sum(1) / torch.clamp(mask.sum(1), min=1e-9)
def embed(texts):
encoded = tokenizer(
texts,
padding=True,
truncation=True,
return_tensors="pt",
)
with torch.no_grad():
outputs = model(**encoded)
embeddings = mean_pool(outputs.last_hidden_state, encoded["attention_mask"])
return torch.nn.functional.normalize(embeddings, p=2, dim=-1)
queries = ["How do I fix a leaky faucet?"]
documents = [
"Steps to repair a leaking kitchen faucet at home.",
"How to replace brake pads on a bicycle.",
]
query_embeddings = embed(queries)
document_embeddings = embed(documents)
scores = query_embeddings @ document_embeddings.T
print(scores.tolist())
Architecture Notes
The model is designed around a retrieval-oriented embedding pipeline:
- token-level representations are produced by the Qwen backbone
- mean pooling converts them into a single sentence representation
- a learned projection maps the representation into a 768-dimensional embedding space
- L2 normalization makes the final vectors directly usable for cosine-similarity retrieval
This design keeps the model simple, efficient, and well aligned with modern vector database workflows.
License
This model is released under the Apache License 2.0.
The base model weights are derived from Qwen/Qwen2.5-0.5B-Instruct. Use of this repository must comply with the applicable Qwen license terms in addition to the license for this repository where required.
- Downloads last month
- 57
