Minnow-Em1-0.6B

Minnow-Em1-0.6B is a compact (0.6B-parameter) multilingual text-embedding model from KiteFish AI, adapted from Qwen/Qwen3-0.6B into a fully bidirectional encoder and fine-tuned for general-purpose embeddings: retrieval, semantic textual similarity (STS), classification, clustering, reranking, and bitext mining.

Version: v1 — the first public release in the Minnow-Em line.


⚠️ Important: this model must be loaded with bidirectional attention

This model was trained with the causal attention mask removed (every token attends to every other token). That change is applied at load time and is not baked into the saved weights, so loading the model the ordinary way leaves it in causal mode and produces poor embeddings. Always apply the patch below after loading.

import types, torch
from sentence_transformers import SentenceTransformer
from transformers import PreTrainedModel

def load_minnow(name="KiteFishAI/Minnow-Em1-0.6B", device="cuda"):
    model = SentenceTransformer(
        name,
        model_kwargs={"torch_dtype": torch.bfloat16, "attn_implementation": "sdpa"},
        device=device,
    )
    # --- make the backbone bidirectional (must match training) ---
    hf = None
    first = model[0]
    for attr in ("auto_model", "model"):
        c = getattr(first, attr, None)
        if isinstance(c, PreTrainedModel):
            hf = c; break
    if hf is None:
        hf = next(m for m in first.modules() if isinstance(m, PreTrainedModel))
    for _, m in hf.named_modules():
        if hasattr(m, "is_causal"):
            m.is_causal = False
    base = getattr(hf, "model", hf)
    if hasattr(base, "_update_causal_mask"):
        def _no_mask(self, attn_mask, inp, *a, **kw):
            if attn_mask is None:
                return None
            if attn_mask.dim() == 2:
                dt = inp.dtype
                return (1.0 - attn_mask[:, None, None, :].to(dt)) * torch.finfo(dt).min
            return attn_mask
        base._update_causal_mask = types.MethodType(_no_mask, base)
    hf.config.is_decoder = False

    # sanity check: token-0 state must change when a later token changes
    tok = first.tokenizer
    with torch.no_grad():
        a = tok(["The quick brown fox"], return_tensors="pt").to(hf.device)
        b = tok(["The quick brown cat"], return_tensors="pt").to(hf.device)
        d = (hf(**a).last_hidden_state[0, 0] - hf(**b).last_hidden_state[0, 0]).abs().max()
    assert d > 1e-4, "Model is still causal — patch did not take effect."
    return model

Usage

The model is instruction-aware. Prepend a task instruction to each query using the format:

Instruct: {task instruction}\nQuery: {text}
  • Retrieval / reranking (asymmetric): instruct the query only; leave documents raw.
  • STS / classification / clustering / bitext (symmetric): instruct all texts.
model = load_minnow()

def with_instruction(instruction, texts):
    return [f"Instruct: {instruction}\nQuery: {t}" for t in texts]

# --- retrieval example ---
queries = with_instruction(
    "Given a query, retrieve documents that answer the query",
    ["What causes the northern lights?"],
)
docs = ["Auroras are produced when charged particles from the sun excite atoms in the upper atmosphere."]

q = model.encode(queries, normalize_embeddings=True)
d = model.encode(docs, normalize_embeddings=True)            # documents: no instruction
print((q @ d.T))

Model details

Base model Qwen/Qwen3-0.6B
Parameters ~0.6B
Attention Bidirectional (causal mask removed)
Pooling Mean pooling
Embedding dim 1024
Max sequence length 512
Instruction-aware Yes (Instruct: … \nQuery: …)
Similarity Cosine

Training

Minnow-Em1 follows the now-standard multi-stage recipe for compact LLM-based embedders (cf. KaLM-Embedding-V2, Qwen3-Embedding, Llama-Embed-Nemotron):

  1. Stage 1 — weakly-supervised contrastive pre-training. Large-scale query/passage pairs, in-batch negatives only, to adapt the bidirectional backbone to representation learning.
  2. Stage 2 — supervised contrastive fine-tuning. Task-homogeneous batches with mined hard negatives, InfoNCE (temperature 0.02) with focal reweighting (γ = 0.5) to emphasize hard examples, false-negative masking, and symmetric/asymmetric instruction routing by task type.

Training data spans retrieval, STS, classification, clustering, reranking, pair classification, and bitext-mining sources across multiple languages.

Evaluation

Evaluation on the MMTEB / MTEB task suite is being finalized with the official mteb harness; a full results table will be added to this card in a subsequent revision. The model is optimized for the multilingual MMTEB task mix.

Numbers will only be published once produced by the official mteb package on the complete benchmark task set (not a partial or custom run).

Limitations and intended use

  • Bidirectional load required (see above) — without the patch the model is effectively causal and underperforms badly.
  • In-domain training data. The training mix includes the train splits of several public benchmark datasets (e.g. MS MARCO, HotpotQA, Natural Questions, NFCorpus, MIRACL). Scores on the corresponding evaluation tasks should be read as in-domain, not zero-shot.
  • Language balance. v1's fine-tuning mix is weighted toward English question-answering retrieval; performance on some low-resource and cross-lingual tasks is correspondingly weaker. Rebalancing is planned for a future version.
  • Intended for embedding/retrieval research and applications; not a generative model.

Acknowledgements

Built on Qwen/Qwen3-0.6B. Methodology informed by KaLM-Embedding-V2, Qwen3-Embedding, and Llama-Embed-Nemotron-8B. Evaluated with the MTEB / MMTEB benchmark suite.

License

Released under Apache-2.0, consistent with the Qwen/Qwen3-0.6B base model. Verify license compatibility for your use case before redistribution.

Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KiteFishAI/Minnow-Em1-0.6B

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(976)
this model
Quantizations
1 model