TIPSv2
Collection
TIPSv2 foundational vision-language models. Webpage: https://gdm-tipsv2.github.io/ β’ 9 items β’ Updated β’ 4
TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Giant variant with 1.1B vision params and 389M text params. Try the code snippets below or check out the GitHub repo for more use cases and visualizations, including zero-shot segmentation.
| Variant | Vision params | Text params | Embed dim | DPT Heads |
|---|---|---|---|---|
| B/14 | 86M | 110M | 768 | B/14-dpt |
| L/14 | 303M | 184M | 1024 | L/14-dpt |
| SO400m/14 | 412M | 448M | 1152 | SO400m/14-dpt |
| g/14 | 1.1B | 389M | 1536 | g/14-dpt |
pip install transformers torch torchvision sentencepiece scikit-learn
from transformers import AutoModel
model = AutoModel.from_pretrained("google/tipsv2-g14", trust_remote_code=True)
model.eval()
Images should be tensors in [0, 1] range (just ToTensor(), no ImageNet normalization).
from torchvision import transforms
from PIL import Image
import requests
transform = transforms.Compose([
transforms.Resize((448, 448)),
transforms.ToTensor(),
])
url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)
print(out.cls_token.shape) # (1, 1, 1536) β global image embedding
print(out.patch_tokens.shape) # (1, 1024, 1536) β per-patch spatial features
text_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape) # (2, 1536) β one embedding per query
import torch.nn.functional as F
classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()]) # bus β predicted class
import numpy as np
from sklearn.decomposition import PCA
spatial = out.patch_tokens.reshape(1, 32, 32, 1536)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 1536)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb)) # sigmoid for [0, 1] range with good contrast
print(rgb.shape) # (32, 32, 3) β PCA of patch features as RGB
model = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])
[0, 1] (no ImageNet normalization)Apache 2.0
@inproceedings{cao2026tipsv2,
title = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
author = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}