You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Hyper3-CLIP v0.5

Hyper3-CLIP v0.5 is an open-weight hyperbolic vision-language checkpoint from hyper³labs. It places image and text representations in a Lorentz space and was trained with compositional entailment constraints for hierarchy-sensitive image-text retrieval.

This v0.5 release is intended as an open baseline and research artifact.

Model

Architecture: ViT-B scale vision-language model
Vision backbone: vit_base_patch16_224
Text backbone: openai/clip-vit-base-patch32
Embedding dimension: 512
Training steps: 500,000
Global batch size: 768
Weights artifact: model.safetensors

The original full training checkpoint included optimizer, scheduler, AMP scaler, RNG state, config, and step metadata. This repository publishes the weights-only model.safetensors artifact for inference and downstream research.

Quick Start: Sentence Transformers

The default way to use this checkpoint is through Sentence Transformers. The adapter in this repository returns 512-dimensional L2-normalized tangent-space embeddings for standard cosine/dot-product vector stores.

Install the runtime dependencies:

pip install "sentence-transformers>=5.5.1" timm safetensors pyyaml Pillow

If you are using the gated Hugging Face repository from a fresh machine, accept access on the model page and set HF_TOKEN.

from PIL import Image
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("hyper3labs/hyper3-clip-v0.5", trust_remote_code=True)

image_embedding = model.encode([Image.open("/path/to/image.jpg")], normalize_embeddings=True)
text_embedding = model.encode(["machined metal part"], normalize_embeddings=True)

Transformers

from PIL import Image
import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("hyper3labs/hyper3-clip-v0.5", trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

image = model.preprocess_image(Image.open("/path/to/image.jpg")).unsqueeze(0)
text = tokenizer(
    ["machined metal part"],
    padding=True,
    truncation=True,
    max_length=model.config.max_text_length,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = model(
        pixel_values=image,
        input_ids=text["input_ids"],
        attention_mask=text["attention_mask"],
    )

image_embedding = outputs.image_embeds
text_embedding = outputs.text_embeds

Haystack image retrieval pipeline

For indexing images in a Haystack retrieval pipeline, use SentenceTransformersDocumentImageEmbedder with image paths in Document.meta["file_path"], paired with SentenceTransformersTextEmbedder for text queries.

pip install "haystack-ai>=2.30.1" "sentence-transformers>=5.5.1" timm safetensors pyyaml Pillow

from haystack import Document
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder

model_id = "hyper3labs/hyper3-clip-v0.5"

documents = [
    Document(
        content="front view of a machined metal part",
        meta={"file_path": "/path/to/image.jpg"},
    )
]

image_embedder = SentenceTransformersDocumentImageEmbedder(
    model=model_id,
    trust_remote_code=True,
    batch_size=8,
    normalize_embeddings=True,
)
documents = image_embedder.run(documents=documents)["documents"]

text_embedder = SentenceTransformersTextEmbedder(
    model=model_id,
    trust_remote_code=True,
    normalize_embeddings=True,
)
query_embedding = text_embedder.run("machined metal part")["embedding"]

Evaluation

The numbers below use the official evaluator convention for R@10. Higher is better except for TIE and LCA.

Model	Comparable setting	ImageNet top-1	COCO text R@10	COCO image R@10	Flickr text R@10	Flickr image R@10	TIE	LCA	Jaccard	H-Prec	H-Rec
MERU-B/16	same-family baseline	40.1	82.0	68.6	96.2	90.0	3.630	2.220	0.780	0.850	0.850
HyCoCLIP-B/16	official checkpoint	45.8	82.0	69.3	95.4	90.3	3.172	2.047	0.814	0.874	0.874
UNCHA-B/16	official checkpoint	48.8	82.6	71.0	95.9	91.2	2.945	1.961	0.828	0.883	0.884
PHyCLIP-B/16	related reported result	44.4	80.4	68.7	95.6	89.9	3.285	2.088	0.807	0.868	0.868
Hyper3-CLIP v0.5	this release	48.5	84.0	72.8	97.5	92.4	2.972	1.986	0.828	0.882	0.883

Raw evaluation files are included:

eval_coco_karpathy_final.json
eval_flickr30k_final.json
eval_imagenet_final.json
eval_hycoclip_uncha_intersection_final.json

License And Attribution

The model materials in this repository are released under OpenMDW-1.0. See LICENSE.

Redistributions should preserve NOTICE, LICENSE, and the original model card when practical. Modified or derived checkpoints should use a distinct name and must not imply endorsement by hyper³labs.

Please cite and link to the original hyper³labs model repository when publishing benchmarks, papers, derivative checkpoints, or public demos based on this model.

Intended Use

This release is intended for:

hierarchy-sensitive image-text retrieval research
zero-shot and retrieval evaluation
multimodal embedding baselines
downstream experiments with hyperbolic representation learning

This model has not been validated for safety-critical use.

Citation

If you use Hyper3-CLIP v0.5, cite the original model repository and hyper³labs.

Downloads last month: 110

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for hyper3labs/hyper3-clip-v0.5

Base model

openai/clip-vit-base-patch32

Finetuned

(121)

this model

hyper3labs
/

hyper3-clip-v0.5