Rank-Embed-0.6B

Rank-Embed-0.6B is a specialized bi-encoder model designed for semantic search and dense retrieval. Instead of relying only on keyword overlap, it maps queries and documents into a shared vector space so they can be compared based on meaning, context, and intent.

Built on top of Qwen/Qwen2.5-0.5B-Instruct, the model is optimized for retrieval-first workloads such as semantic search, ranking, retrieval-augmented generation, clustering, and duplicate detection. It is compact enough for efficient deployment while retaining the language understanding needed for more complex search tasks.

Model Summary

Property	Value
Architecture	Bi-encoder / two-tower embedding model
Base model	`Qwen/Qwen2.5-0.5B-Instruct`
Parameters	~0.6B
Backbone hidden size	896
Embedding dimension	768
Pooling	Mean pooling
Projection head	`nn.Linear(896, 768)`
Similarity	Cosine similarity over L2-normalized vectors
Framework	PyTorch / Transformers
License	Apache 2.0

Key Capabilities

Dense embedding generation for queries, passages, and documents
Semantic search based on meaning rather than exact keyword matching
Efficient cosine-similarity retrieval with normalized embeddings
Strong support for complex and intent-heavy search queries
Practical deployment footprint for production retrieval systems

What This Model Is

Rank-Embed-0.6B is designed to transform text into dense numerical vectors, or embeddings, that capture semantic meaning. In a traditional keyword-based system, retrieval depends on exact lexical overlap. In contrast, this model enables systems to compare text based on intent, topic, and contextual similarity.

As a compact retrieval model built on Qwen2.5-0.5B-Instruct, it provides an efficient balance between inference speed and semantic quality. This makes it a strong fit for production search systems that need to serve high-quality results without requiring unnecessarily large infrastructure.

Unlike a generative chatbot, Rank-Embed-0.6B is purpose-built for retrieval. Its role is not to generate responses, but to identify, compare, and surface the most relevant pieces of information from a corpus.

How It Works

1. Bi-Encoder Architecture

The model uses a two-tower, or bi-encoder, design:

Query tower: processes the user's search query
Document tower: processes candidate documents or passages
Shared objective: maps both into the same high-dimensional space so relevant pairs are positioned close together

In practice, if a document meaningfully answers a query, their embeddings should be near one another in the 768-dimensional representation space.

2. Core Components

Backbone: the model uses Qwen2.5-0.5B-Instruct as its language backbone, providing strong prior understanding of natural language and complex instruction-like phrasing.
Pooling layer: because the backbone produces token-level representations, mean pooling is used to aggregate them into a single sentence-level embedding.
Projection head: a linear projection layer, nn.Linear(896, 768), reduces the backbone hidden size to a 768-dimensional embedding size suitable for vector search systems.
Normalization: final embeddings are L2-normalized so similarity can be computed efficiently with cosine similarity.

What It Can Do

Semantic search: retrieves relevant content even when the query and document use different wording.
Complex search: handles nuanced, intent-rich queries where the best result depends on meaning rather than exact phrasing.
Retrieval-augmented generation: serves as the retrieval layer in RAG systems by surfacing relevant context for downstream language models.
Clustering and organization: groups documents, tickets, or records by semantic similarity.
Duplicate detection: identifies differently worded inputs that express the same underlying meaning.

Quick Start

Installation

pip install transformers torch

Basic Usage

import torch
from transformers import AutoModel, AutoTokenizer

model_id = "GorankLabs/Rank-Embed-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
model.eval()

def mean_pool(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    return (last_hidden_state * mask).sum(1) / torch.clamp(mask.sum(1), min=1e-9)

def embed(texts):
    encoded = tokenizer(
        texts,
        padding=True,
        truncation=True,
        return_tensors="pt",
    )
    with torch.no_grad():
        outputs = model(**encoded)
    embeddings = mean_pool(outputs.last_hidden_state, encoded["attention_mask"])
    return torch.nn.functional.normalize(embeddings, p=2, dim=-1)

queries = ["How do I fix a leaky faucet?"]
documents = [
    "Steps to repair a leaking kitchen faucet at home.",
    "How to replace brake pads on a bicycle.",
]

query_embeddings = embed(queries)
document_embeddings = embed(documents)

scores = query_embeddings @ document_embeddings.T
print(scores.tolist())

Architecture Notes

The model is designed around a retrieval-oriented embedding pipeline:

token-level representations are produced by the Qwen backbone
mean pooling converts them into a single sentence representation
a learned projection maps the representation into a 768-dimensional embedding space
L2 normalization makes the final vectors directly usable for cosine-similarity retrieval

This design keeps the model simple, efficient, and well aligned with modern vector database workflows.

License

This model is released under the Apache License 2.0.

The base model weights are derived from Qwen/Qwen2.5-0.5B-Instruct. Use of this repository must comply with the applicable Qwen license terms in addition to the license for this repository where required.

Downloads last month: 57

Model tree for GorankLabs/Rank-Embed-0.6B

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(670)

this model

Collection including GorankLabs/Rank-Embed-0.6B

Rank Embed

Collection

2 items • Updated about 19 hours ago