Byrne-Embed

Byrne-Embed is a compact 85M-parameter sentence-embedding model. It maps text to 768-dimensional unit-norm vectors suitable for semantic similarity, retrieval, clustering, and reranking.

The backbone is a custom SpikeWhale decoder (the "Byrne" line). A mean-pooled representation of its last hidden state is projected to 768 dimensions by a learned head and unit-normalized, so cosine similarity between two embeddings is just a dot product.

Benchmark vs. EmbeddingGemma-300M

We benchmarked Byrne-Embed against Google's EmbeddingGemma-300M on 4,000 held-out sentences spanning educational web text, encyclopedic text, and instruction/chat text. Byrne-Embed's embedding geometry tracks closely with EmbeddingGemma's at roughly 1/3.5 the parameter count:

Metric (Byrne-Embed vs EmbeddingGemma) Result
Mean per-sentence cosine 0.9415 (median 0.945, p10 0.912)
Sentences within 0.90 cosine 94.7%
Similarity-structure agreement (Pearson) 0.9702
Similarity-structure agreement (Spearman) 0.9599
Per-anchor neighbour-ranking correlation 0.9494
Retrieval top-1 nearest-neighbour agreement 72.8%
Retrieval Recall@10 overlap 78.2%

Reading the numbers. The two most important measures — how closely the two models agree on which sentences are similar — land at Pearson 0.97 / Spearman 0.96: when EmbeddingGemma judges two sentences similar, Byrne-Embed agrees almost identically. 94.7% of all sentences sit within 0.90 cosine. The lower top-1 retrieval number is expected and not a quality gap: in a dense pool of real sentences many neighbours are near-ties (0.88 vs 0.87), so the single #1 slot flips easily between near-duplicates — which is why Recall@10 stays at ~78% and the neighbour-ranking correlation is 0.95. Both models find the same neighbourhood; they just occasionally swap rank 1 and rank 2 among near-identical candidates.

Reproduce these numbers with the bundled run_tests.py (it loads both models and prints the full table).

MTEB English Benchmark — MTEB(eng, v2)

Evaluated with the official mteb library on the full MTEB(eng, v2) suite (41/41 tasks). Raw results are in mteb_results/; machine-readable scores are in the model-index metadata above.

Overall MTEB(eng, v2) mean: 50.79

Category Mean Tasks
STS 71.93 9
Classification 70.57 8
PairClassification 74.07 3
Clustering 37.32 8
Reranking 40.48 2
Retrieval 24.64 10
Summarization 22.39 1

STS

Task Score
BIOSSES 75.56
SICK-R 69.08
STS12 64.88
STS13 72.08
STS14 67.76
STS15 77.13
STS17 83.23
STS22.v2 60.53
STSBenchmark 77.08

Classification

Task Score
AmazonCounterfactualClassification 80.12
Banking77Classification 74.64
ImdbClassification 60.97
MTOPDomainClassification 92.29
MassiveIntentClassification 63.23
MassiveScenarioClassification 73.05
ToxicConversationsClassification 62.94
TweetSentimentExtractionClassification 57.29

PairClassification

Task Score
SprintDuplicateQuestions 86.47
TwitterSemEval2015 53.19
TwitterURLCorpus 82.55

Clustering

Task Score
ArXivHierarchicalClusteringP2P 53.15
ArXivHierarchicalClusteringS2S 50.39
BiorxivClusteringP2P.v2 33.73
MedrxivClusteringP2P.v2 32.70
MedrxivClusteringS2S.v2 29.04
StackExchangeClustering.v2 41.93
StackExchangeClusteringP2P.v2 35.22
TwentyNewsgroupsClustering.v2 22.39

Reranking

Task Score
AskUbuntuDupQuestions 52.88
MindSmallReranking 28.07

Retrieval

Task Score
ArguAna 37.67
CQADupstackGamingRetrieval 37.14
CQADupstackUnixRetrieval 23.48
ClimateFEVERHardNegatives 13.60
FEVERHardNegatives 28.70
FiQA2018 11.38
HotpotQAHardNegatives 30.47
SCIDOCS 10.15
TRECCOVID 29.30
Touche2020Retrieval.v3 24.50

Summarization

Task Score
SummEvalSummarization.v2 22.39

Usage

The model loads with standard transformers via trust_remote_code (the projection head is fused into the weights, so a single from_pretrained loads everything):

import torch
from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained("Quazim0t0/Byrne-Embed", trust_remote_code=True)
model = AutoModel.from_pretrained("Quazim0t0/Byrne-Embed", trust_remote_code=True).eval()

texts = ["The cat sat on the windowsill.", "A feline rested by the window."]
enc = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
    emb = model(**enc).last_hidden_state          # (2, 768), L2-normalized

print(float(emb[0] @ emb[1]))                      # cosine similarity ~ 0.83

forward() returns L2-normalized 768-dim sentence embeddings, so cosine similarity is just a dot product.

Files

File Purpose
model.safetensors, config.json fused SpikeWhale backbone + projection head + config
modeling_byrne_embed.py self-contained custom AutoModel class (SpikeWhale arch inlined; loaded via trust_remote_code)
tokenizer.json, tokenizer_config.json, spike_tokenizer.py byte-level SpikeTokenizer + its code

Limitations

  • English-centric evaluation; non-English performance is untested.
  • The single residual weak spot observed during evaluation is finance/economics paraphrase retrieval; general semantic similarity is strong.
  • Custom architecture: load via the bundled byrne_embedder.py (local modeling code — no remote code execution).

Citation

If you use Byrne-Embed, please cite:

@misc{byrne2026byrneembed,
  title        = {Byrne-Embed: A Compact 85M Sentence-Embedding Model},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Quazim0t0/Byrne-Embed}},
}

License

Apache-2.0.

Downloads last month
138
Safetensors
Model size
98M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Quazim0t0/Byrne-Embed 1

Collection including Quazim0t0/Byrne-Embed

Evaluation results