Sarashina-Embedding-v2-1B

ๆ—ฅๆœฌ่ชžใฎREADME/Japanese README

"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "Sarashina2.2-1B". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. )

This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.

Model Details

Model Description

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)

Usage

First install the Sentence Transformers library:

pip install sentence-transformers==4.0.2

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the ๐Ÿค— Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b")
# Run inference
query = [
      'task: ใ‚ฏใ‚จใƒชใ‚’ไธŽใˆใ‚‹ใฎใงใ€ไธŽใˆใ‚‰ใ‚ŒใŸWebๆคœ็ดขใ‚ฏใ‚จใƒชใซ็ญ”ใˆใ‚‹้–ข้€ฃๆ–‡็ซ ใ‚’ๆคœ็ดขใ—ใฆใใ ใ•ใ„ใ€‚\nquery: Sarashinaใฎใƒ†ใ‚ญใ‚นใƒˆๅŸ‹ใ‚่พผใฟใƒขใƒ‡ใƒซใฏใ‚ใ‚Šใพใ™ใ‹?'
  ]
texts = [
      'text: ๆ›ด็ดšๆ—ฅ่จ˜ใฏใ€ๅนณๅฎ‰ๆ™‚ไปฃไธญๆœŸใซ่…ๅŽŸๅญๆจ™ๅฅณใซใ‚ˆใฃใฆๆ›ธใ‹ใ‚ŒใŸๅ›žๆƒณ้Œฒใงใ™ใ€‚',
      'text: Sarashinaใฏใ€SB IntuitionsใŒ้–‹็™บใ—ใŸๆ—ฅๆœฌ่ชžๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซใงใ™ใ€‚ใ“ใ‚Œใพใงใซ7B, 13B, 70B, 8x70Bใฎใƒขใƒ‡ใƒซใŒๅ…ฌ้–‹ใ•ใ‚Œใฆใ„ใพใ™ใ€‚',
      'text: ใ‚ตใƒฉใ‚ทใƒŠใ‚จใƒณใƒ™ใƒ‡ใ‚ฃใƒณใ‚ฐใฏๆ—ฅๆœฌ่ชž่จ€่ชžใƒขใƒ‡ใƒซใ‚’ใƒ™ใƒผใ‚นใซใ—ใŸๆ—ฅๆœฌ่ชžๅŸ‹ใ‚่พผใฟใƒขใƒ‡ใƒซใงใ™ใ€‚'
]
query_embedding = model.encode(query)
text_embeddings = model.encode(texts)
# Get the similarity scores between the embeddings
similarities = model.similarity(query_embedding, text_embeddings)
print(similarities)
# tensor([[0.7403, 0.8651, 0.8775]])

How to add instructions and prefixes

For both the query and document sides, use different prefix formats. On the query side, add the prefix task: followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.)

  • Query Side: task: {Instrcution}\nquery: {Query}
  • Document Side: text: {Document}

Templates for instructions and prefixes

The table below provides instruction and prefix templates for five main tasks.

Task Query Side Document Side
Retrieval
Reranking
task: ่ณชๅ•ใ‚’ไธŽใˆใ‚‹ใฎใงใ€ใใฎ่ณชๅ•ใซ็ญ”ใˆใ‚‹ใฎใซๅฝน็ซ‹ใค้–ข้€ฃๆ–‡ๆ›ธใ‚’ๆคœ็ดขใ—ใฆใใ ใ•ใ„ใ€‚\nquery: text:
Clustering task: ไธŽใˆใ‚‰ใ‚ŒใŸใƒ‰ใ‚ญใƒฅใƒกใƒณใƒˆใฎใƒˆใƒ”ใƒƒใ‚ฏใพใŸใฏใƒ†ใƒผใƒžใ‚’็‰นๅฎšใ—ใฆใใ ใ•ใ„ใ€‚\nquery: -
Classification task: ไธŽใˆใ‚‰ใ‚ŒใŸใƒฌใƒ“ใƒฅใƒผใ‚’้ฉๅˆ‡ใช่ฉ•ไพกใ‚ซใƒ†ใ‚ดใƒชใซๅˆ†้กžใ—ใฆใใ ใ•ใ„ใ€‚\nquery: -
STS task: ใ‚ฏใ‚จใƒชใ‚’ไธŽใˆใ‚‹ใฎใง๏ผŒใ‚‚ใฃใจใ‚‚ใ‚ฏใ‚จใƒชใซๆ„ๅ‘ณใŒไผผใฆใ„ใ‚‹ไธ€็ฏ€ใ‚’ๆŽขใ—ใฆใใ ใ•ใ„ใ€‚\nquery: task: ใ‚ฏใ‚จใƒชใ‚’ไธŽใˆใ‚‹ใฎใง๏ผŒใ‚‚ใฃใจใ‚‚ใ‚ฏใ‚จใƒชใซๆ„ๅ‘ณใŒไผผใฆใ„ใ‚‹ไธ€็ฏ€ใ‚’ๆŽขใ—ใฆใใ ใ•ใ„ใ€‚\nquery:

Training

Sarashina-Embedding-v2-1B is created through the following three-stage learning process:

Stage 1: Weakly-supervised Learning

To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets.

Step2: Supervised Fine-tuning

To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data.

Stage 3: Model Merging

To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging.

Evaluation Results (*) with JMTEB

Model Avg. Retrieval STS Classification Reranking Clustering
Sarashina-Embedding-v2-1B (This model) 76.38 76.48 84.22 77.14 86.28 52.56
cl-nagoya/ruri-v3-310m 75.85 76.03 81.59 77.65 85.84 50.52
sbintuitions/sarashina-embedding-v1-1b 74.87 74.53 81.71 77.20 84.36 50.30
OpenAI/text-embedding-3-large 73.86 71.95 82.52 77.27 83.06 51.82

(*) Evaluated on July 28, 2025.

License

This model is licensed under Sarashina Model NonCommercial License Agreement.

If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.

Downloads last month
1,159
Safetensors
Model size
1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Model tree for sbintuitions/sarashina-embedding-v2-1b

Finetuned
(5)
this model
Finetunes
5 models
Quantizations
4 models

Spaces using sbintuitions/sarashina-embedding-v2-1b 8