Instructions to use sbintuitions/sarashina-embedding-v2-1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sbintuitions/sarashina-embedding-v2-1b with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use sbintuitions/sarashina-embedding-v2-1b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="sbintuitions/sarashina-embedding-v2-1b")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina-embedding-v2-1b") model = AutoModel.from_pretrained("sbintuitions/sarashina-embedding-v2-1b") - Notebooks
- Google Colab
- Kaggle
Sarashina-Embedding-v2-1B
ๆฅๆฌ่ชใฎREADME/Japanese README
"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "Sarashina2.2-1B". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. )
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Sarashina2.2-1B
- Maximum Sequence Length: 8,192 tokens
- Output Dimensionality: 1,792 dimensions
- Similarity Function: Cosine Similarity
- Language: Japanese
- License: Sarashina Model NonCommercial License Agreement
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)
Usage
First install the Sentence Transformers library:
pip install sentence-transformers==4.0.2
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b")
# Run inference
query = [
'task: ใฏใจใชใไธใใใฎใงใไธใใใใWebๆค็ดขใฏใจใชใซ็ญใใ้ข้ฃๆ็ซ ใๆค็ดขใใฆใใ ใใใ\nquery: Sarashinaใฎใใญในใๅใ่พผใฟใขใใซใฏใใใพใใ?'
]
texts = [
'text: ๆด็ดๆฅ่จใฏใๅนณๅฎๆไปฃไธญๆใซ่
ๅๅญๆจๅฅณใซใใฃใฆๆธใใใๅๆณ้ฒใงใใ',
'text: SarashinaใฏใSB Intuitionsใ้็บใใๆฅๆฌ่ชๅคง่ฆๆจก่จ่ชใขใใซใงใใใใใพใงใซ7B, 13B, 70B, 8x70Bใฎใขใใซใๅ
ฌ้ใใใฆใใพใใ',
'text: ใตใฉใทใใจใณใใใฃใณใฐใฏๆฅๆฌ่ช่จ่ชใขใใซใใใผในใซใใๆฅๆฌ่ชๅใ่พผใฟใขใใซใงใใ'
]
query_embedding = model.encode(query)
text_embeddings = model.encode(texts)
# Get the similarity scores between the embeddings
similarities = model.similarity(query_embedding, text_embeddings)
print(similarities)
# tensor([[0.7403, 0.8651, 0.8775]])
How to add instructions and prefixes
For both the query and document sides, use different prefix formats. On the query side, add the prefix task: followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.)
- Query Side:
task: {Instrcution}\nquery: {Query} - Document Side:
text: {Document}
Templates for instructions and prefixes
The table below provides instruction and prefix templates for five main tasks.
| Task | Query Side | Document Side |
|---|---|---|
| Retrieval Reranking |
task: ่ณชๅใไธใใใฎใงใใใฎ่ณชๅใซ็ญใใใฎใซๅฝน็ซใค้ข้ฃๆๆธใๆค็ดขใใฆใใ ใใใ\nquery: | text: |
| Clustering | task: ไธใใใใใใญใฅใกใณใใฎใใใใฏใพใใฏใใผใใ็นๅฎใใฆใใ ใใใ\nquery: | - |
| Classification | task: ไธใใใใใฌใใฅใผใ้ฉๅใช่ฉไพกใซใใดใชใซๅ้กใใฆใใ ใใใ\nquery: | - |
| STS | task: ใฏใจใชใไธใใใฎใง๏ผใใฃใจใใฏใจใชใซๆๅณใไผผใฆใใไธ็ฏใๆขใใฆใใ ใใใ\nquery: | task: ใฏใจใชใไธใใใฎใง๏ผใใฃใจใใฏใจใชใซๆๅณใไผผใฆใใไธ็ฏใๆขใใฆใใ ใใใ\nquery: |
Training
Sarashina-Embedding-v2-1B is created through the following three-stage learning process:
Stage 1: Weakly-supervised Learning
To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets.
Step2: Supervised Fine-tuning
To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data.
Stage 3: Model Merging
To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging.
Evaluation Results (*) with JMTEB
| Model | Avg. | Retrieval | STS | Classification | Reranking | Clustering |
|---|---|---|---|---|---|---|
| Sarashina-Embedding-v2-1B (This model) | 76.38 | 76.48 | 84.22 | 77.14 | 86.28 | 52.56 |
| cl-nagoya/ruri-v3-310m | 75.85 | 76.03 | 81.59 | 77.65 | 85.84 | 50.52 |
| sbintuitions/sarashina-embedding-v1-1b | 74.87 | 74.53 | 81.71 | 77.20 | 84.36 | 50.30 |
| OpenAI/text-embedding-3-large | 73.86 | 71.95 | 82.52 | 77.27 | 83.06 | 51.82 |
(*) Evaluated on July 28, 2025.
License
This model is licensed under Sarashina Model NonCommercial License Agreement.
If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.
- Downloads last month
- 1,159
Model tree for sbintuitions/sarashina-embedding-v2-1b
Base model
sbintuitions/sarashina2.2-1b