| --- |
| language: |
| - ru |
| - en |
| pipeline_tag: sentence-similarity |
| tags: |
| - embeddings |
| - sentence-transformers |
| - vllm |
| - inference-optimized |
| - inference |
| license: mit |
| base_model: cointegrated/rubert-tiny2 |
| --- |
| |
| # rubert-tiny2-vllm |
|
|
| **vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference. |
|
|
| This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching. |
|
|
| ## Modifications |
|
|
| - **No weight changes** - uses original query/key/value weights directly |
| - vLLM automatically converts Q/K/V to fused qkv_proj format during loading |
| - Removed pretraining heads (MLM/NSP) - not needed for embeddings |
| - Changed architecture to `BertModel` for vLLM compatibility |
| |
| ## Usage |
| |
| ### vLLM Server |
| ```bash |
| # IMPORTANT: Use fp32 for exact numerical match with original model |
| vllm serve WpythonW/rubert-tiny2-vllm --dtype float32 |
| ``` |
| |
| ### OpenAI-compatible API |
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI( |
| base_url="http://localhost:8000/v1", |
| api_key="dummy" |
| ) |
| |
| response = client.embeddings.create( |
| input="Привет мир", |
| model="WpythonW/rubert-tiny2-vllm" |
| ) |
| print(response.data[0].embedding[:5]) |
| ``` |
| |
| ### Transformers |
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModel |
| |
| tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm") |
| model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm") |
| |
| def embed_bert_cls(text, model, tokenizer): |
| t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') |
| with torch.no_grad(): |
| model_output = model(**{k: v.to(model.device) for k, v in t.items()}) |
| embeddings = model_output.last_hidden_state[:, 0, :] |
| embeddings = torch.nn.functional.normalize(embeddings) |
| return embeddings[0].cpu().numpy() |
| |
| print(embed_bert_cls('привет мир', model, tokenizer).shape) |
| # (312,) |
| ``` |
|
|
| ### Sentence Transformers |
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer('WpythonW/rubert-tiny2-vllm') |
| sentences = ["привет мир", "hello world", "здравствуй вселенная"] |
| embeddings = model.encode(sentences) |
| print(embeddings.shape) |
| ``` |
|
|
| ## Validation Results |
|
|
| Comparison between vLLM and SentenceTransformers on identical inputs: |
| ``` |
| Max embedding difference: 3.375e-7 |
| Mean embedding difference: 1.136e-7 |
| Cosine similarity matrices: Identical (np.allclose with default tolerances) |
| ``` |
|
|
| This confirms **bit-level equivalence** within float32 precision limits. |
|
|
| ## Conversion |
|
|
| Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW) |
|
|
| **Conversion process:** |
| 1. Load original cointegrated/rubert-tiny2 weights |
| 2. Remove `bert.` prefix from weight names |
| 3. Remove unused heads (cls.*, bert.pooler.*) |
| 4. Keep query/key/value weights as-is (vLLM handles fusion automatically) |
|
|
| Tested on Google Colab Tesla T4 with: |
| - vLLM 0.11.2 |
| - Transformers 4.57.2 |
| - PyTorch 2.9.0+cu126 |
|
|
| ## Original Model |
|
|
| For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) |