ogma-micro  ·  2.3M efficient text embedding model  ·  MTEB 52.19

Ultra-small English text embedding model for semantic search, RAG, vector search, clustering, classification, and agent memory — MTEB 52.19, 2.3M parameters, 128d output

Ogma Micro is the most compact model in the Ogma family. At 2.3M parameters and 8.9 MB it scores 52.19 MTEB — still beating Potion-32M (51.66) while using less than 1/14th of its parameters. Outputs 128-dimensional embeddings for maximum indexing efficiency. For extreme latency, edge, and browser workloads.

Why the name Ogma?

Ogma is named after Ogma (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing Ogham, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over.


Use cases

ogma-micro is the smallest Ogma model, built for on-device embedding, edge search, browser-side retrieval, local semantic search, agent memory, deduplication, classification, clustering, and privacy-sensitive applications where sending text to an external embedding API is undesirable.

Good fits:

  • Mobile and desktop apps that need local text embeddings without a large model download.
  • Browser, WebAssembly, and extension-style workflows where package size and vector index size matter.
  • Serverless and high-fanout applications that need many cheap embedding calls with predictable memory use.
  • Local-first search over notes, messages, logs, support tickets, snippets, or small document collections.
  • Efficient vector databases where 128-dimensional embeddings reduce storage, bandwidth, and ANN latency.

Choose ogma-micro when footprint matters more than absolute benchmark quality. Move up to ogma-mini or ogma-small when you can spend more memory for stronger representations.


Highlights

  • 🏆 MTEB avg 52.19 — beats Potion-32M (51.66) at 2.3M parameters (14× fewer)
  • 📦 8.9 MB — smallest in the family
  • 📐 128-dim output — half the index size of other Ogma models
  • 📏 1024-token context — 4× longer than all-MiniLM-L6-v2 (256 tokens)
  • 🔀 Asymmetric encoding via task tokens: [QRY], [DOC], [SYM]
  • 📐 Matryoshka dims: [128, 64, 32] — compress to 32d for ultra-low memory indexing

Performance

MTEB English — 54/54 tasks (category-averaged)

Benchmarked with MTEB v2.10.7 on the standard 54-task English benchmark using category averaging (same methodology as the MTEB leaderboard).

Category ogma-micro all-MiniLM-L6-v2 Δ vs MiniLM
Classification 59.57 62.62 -3.05
Clustering 36.88 41.94 -5.06
PairClassification 78.62 82.37 -3.75
Reranking 49.74 58.04 -8.30
Retrieval 33.09 41.95 -8.86
STS 75.63 78.90 -3.27
Summarization 31.77 30.81 +0.96
Overall 52.19 56.09 -3.90

MiniLM and Potion reference scores from the Model2Vec results page.

Why choose Ogma Micro?

ogma-micro is for when you need the absolute smallest possible model that still achieves competitive MTEB scores. Note the 128-dim output — your vector index will be half the size of other Ogma models. Use ogma-mini if you can afford 3.5M parameters.

CPU Inference Benchmark

Benchmarked on AMD Ryzen Threadripper PRO 3955WX (16-core/32-thread), PyTorch 2.10, batch of 100 mixed-length documents.

Model Params 1T·bs1 (docs/s) 1T·bs1 latency 1T·bs32 (docs/s) 16T·bs32 (docs/s)
potion-base-8M 7.6M 6,892 0.14 ms 18,021 17,040
potion-base-32M 32.3M 6,826 0.15 ms 17,984 17,328
ogma-small 8.6M 92.9 10.8 ms 60.9 255.6
all-MiniLM-L6-v2 22.7M 53.1 18.8 ms 40.5 227.9
ogma-base 13.3M 48.3 20.7 ms 28.9 121.6
bge-small-en-v1.5 33.4M 26.8 37.3 ms 19.8 115.3
ogma-micro 2.3M ~280 ~3.6 ms ~200 ~650
bge-base-en-v1.5 109.5M 7.6 131.7 ms 4.8 30.2

Potion models are static (lookup-based); their near-zero inference cost is the trade-off for no contextual understanding and fixed 256-token context. Transformer models like Ogma and MiniLM understand context. ogma-small is 1.75× faster than MiniLM single-threaded and 1.12× faster batched.

Safety — Toxicity & Prompt Injection Detection

Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against all-MiniLM-L6-v2 as baseline.

1. Jigsaw Toxic Comment Classification

Dataset: Arsive/toxicity_classification_jigsaw — Binary toxicity classification Train: 25,960 · Test: 6,490

Model Classifier Accuracy F1 Precision Recall AUC-ROC
Ogma LogReg 89.12% 88.26% 89.09% 87.44% 95.74%
Ogma MLP 88.91% 87.98% 89.14% 86.85% 95.92%
MiniLM LogReg 87.32% 86.25% 87.46% 85.07% 94.96%
MiniLM MLP 91.71% 91.24% 90.13% 92.39% 97.16%

Ogma (LR) leads MiniLM (LR) by +2.01% F1. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations.

2. Prompt Injection Detection — deepset/prompt-injections

Dataset: deepset/prompt-injections — Binary injection detection Train: 546 · Test: 116 (low-data regime)

Model Classifier Accuracy F1 Precision Recall AUC-ROC
Ogma LogReg 86.21% 84.62% 100.0% 73.33% 97.77%
Ogma MLP 90.52% 90.27% 96.23% 85.0% 98.1%
MiniLM LogReg 82.76% 80.39% 97.62% 68.33% 94.52%
MiniLM MLP 87.07% 86.24% 95.92% 78.33% 93.96%

Ogma leads across both classifiers: +4.03% F1 (MLP), +4.23% F1 (LogReg). Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives.

3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset

Dataset: neuralchemy/Prompt-injection-dataset — Binary injection detection Train: 4,391 · Test: 942

Model Classifier Accuracy F1 Precision Recall AUC-ROC
Ogma LogReg 95.22% 95.93% 95.84% 96.01% 99.30%
Ogma MLP 95.44% 96.16% 94.89% 97.46% 99.37%
MiniLM LogReg 94.59% 95.38% 95.46% 95.29% 98.92%
MiniLM MLP 93.95% 94.85% 94.59% 95.11% 98.92%

Ogma leads across all metrics: +0.78% F1 (MLP), +0.55% F1 (LR). Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%).

Summary

Task Ogma best F1 MiniLM best F1 Δ
Jigsaw Toxicity 88.26% (LR) 91.24% (MLP) −2.98%
deepset Injection 90.27% (MLP) 86.24% (MLP) +4.03%
neuralchemy Injection 96.16% (MLP) 95.38% (LR) +0.78%

Ogma is a stronger feature extractor for prompt injection detection — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice.

Architecture

Property Value
Architecture Custom Transformer
Internal dim (d_model) 128
Output dim (d_output) 128
Transformer layers 2
Attention heads 2
Vocabulary 30,000 (SentencePiece / AlbertTokenizer)
Max sequence length 1,024 tokens
Pooling Mean pooling
Task tokens [QRY] (query), [DOC] (document), [SYM] (symmetric)
Matryoshka dims [32, 64, 128]
Output normalisation L2 (unit sphere)
Parameters 2.3M
Model file model.safetensors (8.9 MB)

Key design choices:

  • Task token prepend: A learnable task token ([QRY], [DOC], or [SYM]) is prepended to the input sequence before the transformer. This enables true asymmetric encoding in a single model with a single forward pass.
  • Matryoshka training: The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
  • Mean pooling: The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
  • L2 normalisation: All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.

Usage

Installation

pip install torch tokenizers huggingface_hub pyyaml

Basic Encoding

from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
import sys, torch

# 1. Download model files
model_path = snapshot_download("axiotic/ogma-micro")

# 2. Load model (bundled source code)
sys.path.insert(0, model_path)
from ogma_model import OgmaModel
model = OgmaModel.from_checkpoint(model_path, device="cpu")
model.eval()

# 3. Tokenizer
N_SPECIAL = 7
_tok = Tokenizer.from_file(f"{model_path}/tokenizer.json")

def encode(texts: list, max_length: int = 1024):
    all_ids = []
    for text in texts:
        enc = _tok.encode(text)
        ids, toks = enc.ids, enc.tokens
        # Strip CLS/SEP added by tokenizer
        if toks and toks[0] in ("[CLS]", "<s>"):
            ids, toks = ids[1:], toks[1:]
        if toks and toks[-1] in ("[SEP]", "</s>"):
            ids = ids[:-1]
        # Shift into Ogma's vocabulary space and add BOS/EOS
        ogma_ids = [2] + [rid + N_SPECIAL for rid in ids] + [3]
        all_ids.append(ogma_ids[:max_length])

    ml = max(len(ids) for ids in all_ids)
    token_ids = torch.zeros(len(texts), ml, dtype=torch.long)
    attn_mask = torch.zeros(len(texts), ml, dtype=torch.long)
    for i, ids in enumerate(all_ids):
        token_ids[i, :len(ids)] = torch.tensor(ids)
        attn_mask[i, :len(ids)] = 1
    return token_ids, attn_mask

# 4. Encode (symmetric mode — good for clustering, classification, STS)
from config import TaskToken

sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn vulpine leaps over an idle canine",
]
with torch.no_grad():
    token_ids, attn_mask = encode(sentences)
    embeddings = model.encode(token_ids, attn_mask, task=TaskToken.SYM)

print(embeddings.shape)  # (128,)
sim = (embeddings[0] @ embeddings[1]).item()
print(f"Cosine similarity: {sim:.4f}")  # L2-normalised, dot product = cosine

Asymmetric Retrieval (Query / Document)

Use TaskToken.QRY for query embeddings and TaskToken.DOC for document embeddings in retrieval pipelines. This asymmetric encoding is a first-class feature of the Ogma architecture.

# Asymmetric retrieval — encode queries with QRY, passages with DOC
from config import TaskToken

queries = [
    "What is knowledge distillation?",
    "How does retrieval-augmented generation work?",
]
documents = [
    "Knowledge distillation trains a smaller student model to mimic a larger teacher...",
    "Retrieval-Augmented Generation (RAG) combines a dense retriever with a language model...",
]

with torch.no_grad():
    q_ids, q_mask = encode(queries)
    d_ids, d_mask = encode(documents)
    q_emb = model.encode(q_ids, q_mask, task=TaskToken.QRY)  # (N, 128)
    d_emb = model.encode(d_ids, d_mask, task=TaskToken.DOC)  # (M, 128)

# Dot product == cosine similarity (embeddings are L2-normalised)
scores = q_emb @ d_emb.T  # (N, M)
print(scores)

Matryoshka — Flexible Dimensionality

Ogma supports Matryoshka Representation Learning. Truncate and re-normalise to any supported sub-dimension for faster indexing or lower memory usage — no retraining required.

import torch.nn.functional as F

with torch.no_grad():
    token_ids, attn_mask = encode(sentences)
    emb_full = model.encode(token_ids, attn_mask)  # (128d, full precision)

# Truncate to any supported sub-dimension and re-normalise — no retraining needed
# Supported dims: [32, 64, 128]
emb_32  = torch.nn.functional.normalize(emb_full[:, :32],  dim=-1)
emb_64  = torch.nn.functional.normalize(emb_full[:, :64],  dim=-1)

LangChain Integration

# LangChain integration (custom embeddings class)
from langchain.embeddings.base import Embeddings
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from config import TaskToken
import sys, torch

class OgmaEmbeddings(Embeddings):
    def __init__(self, model_name: str = "axiotic/ogma-micro", device: str = "cpu"):
        model_path = snapshot_download(model_name)
        sys.path.insert(0, model_path)
        from ogma_model import OgmaModel
        self.model = OgmaModel.from_checkpoint(model_path, device=device)
        self.model.eval()
        self._tok = Tokenizer.from_file(f"{model_path}/tokenizer.json")
        self._device = device

    def _encode(self, texts, task=TaskToken.SYM):
        # (encode function from Basic Usage above)
        from your_module import encode  # or inline the encode function
        with torch.no_grad():
            ids, mask = encode(texts)
            return self.model.encode(ids.to(self._device), mask.to(self._device), task=task)

    def embed_documents(self, texts):
        return self._encode(texts, task=TaskToken.DOC).cpu().numpy().tolist()

    def embed_query(self, text):
        return self._encode([text], task=TaskToken.QRY).cpu().numpy()[0].tolist()

embeddings = OgmaEmbeddings()

Model Family

Model Params Size MTEB Avg Class Clust PairClass Rerank Ret STS Summ d_out Context
ogma-large 32.4M 124 MB 57.41 68.6 41.6 84.0 53.1 43.7 83.7 30.9 256 1024
ogma-base 13.3M 51 MB 57.04 67.89 41.49 83.73 51.25 42.36 82.84 29.73 256 1024
ogma-small 8.6M 33 MB 56.34 66.67 40.69 82.91 50.51 42.05 82.00 29.59 256 1024
ogma-mini 3.5M 14 MB 53.07 61.80 37.38 79.66 47.39 36.21 77.71 31.33 256 1024
ogma-micro 2.3M 8.9 MB 52.19 59.57 36.88 78.62 49.74 33.09 75.63 31.77 128 1024
all-MiniLM-L6-v2 22.7M 87 MB 56.09 62.62 41.94 82.37 58.04 41.95 78.90 30.81 384 256
potion-base-32M 32.3M 123 MB 51.66 65.97 35.29 78.17 50.92 33.52 74.22 29.78 256 inf
potion-base-8M 7.6M 29 MB 50.03 64.44 32.93 76.62 49.73 31.71 73.24 29.28 256 inf

All Ogma: MTEB 2.10.7, 54-task standard English set, category-averaged. MiniLM/Potion: published scores from Model2Vec results page.


Training Details

Property Value
Teacher model jinaai/jina-embeddings-v5-text-small (CC-BY-NC-4.0)
Training paradigm Knowledge distillation from cached teacher embeddings
Training data ~7M curated English sentence pairs
Tokenizer AlbertTokenizer (SentencePiece, vocab=30,000)
Embedding initialisation PCA of teacher embeddings (128d) projected to d_model
Loss Distillation + contrastive (balanced schedule)
Evaluation framework MTEB 2.10.7

Limitations

  • No text generation. Ogma is an encoder-only embedding model.
  • English only. Training data and evaluation are English-only.
  • Slower than static models. Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences.
  • Non-commercial licence. Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run).
  • Reranking gap. Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -8.3). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking.

Licence & Attribution

This model is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

Required attribution (must be included in all uses):

This model was trained via knowledge distillation from jina-embeddings-v5-text-small (https://huggingface.co/jinaai/jina-embeddings-v5-text-small) by Jina AI, licensed under CC-BY-NC-4.0.


Citation

@misc{ogma2026,
  title     = {Ogma: Efficient Dense Retrieval via Structured Embeddings},
  author    = {Axiotic AI},
  year      = {2026},
  url       = {https://huggingface.co/axiotic/ogma-micro},
}
Downloads last month
176
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results