TIPSv2 — g/14

TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Giant variant with 1.1B vision params and 389M text params. Try the code snippets below or check out the GitHub repo for more use cases and visualizations, including zero-shot segmentation.

Variant	Vision params	Text params	Embed dim	DPT Heads
B/14	86M	110M	768	B/14-dpt
L/14	303M	184M	1024	L/14-dpt
SO400m/14	412M	448M	1152	SO400m/14-dpt
g/14	1.1B	389M	1536	g/14-dpt

Usage

pip install transformers torch torchvision sentencepiece scikit-learn

Load the model

from transformers import AutoModel

model = AutoModel.from_pretrained("google/tipsv2-g14", trust_remote_code=True)
model.eval()

Encode images

Images should be tensors in [0, 1] range (just ToTensor(), no ImageNet normalization).

from torchvision import transforms
from PIL import Image
import requests

transform = transforms.Compose([
    transforms.Resize((448, 448)),
    transforms.ToTensor(),
])

url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)

print(out.cls_token.shape)     # (1, 1, 1536) — global image embedding
print(out.patch_tokens.shape)  # (1, 1024, 1536) — per-patch spatial features

Encode text

text_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape)  # (2, 1536) — one embedding per query

Zero-shot classification

import torch.nn.functional as F

classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()])  # bus — predicted class

Visualize spatial features

import numpy as np
from sklearn.decomposition import PCA

spatial = out.patch_tokens.reshape(1, 32, 32, 1536)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 1536)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb))  # sigmoid for [0, 1] range with good contrast
print(rgb.shape)  # (32, 32, 3) — PCA of patch features as RGB

GPU inference

model = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])

Model details

Architecture: ViT vision encoder (40 layers) + Transformer text encoder (12 layers)
Image preprocessing: resize to any resolution, convert to [0, 1] (no ImageNet normalization)
Text preprocessing: SentencePiece tokenizer, lowercased, max 64 tokens
Patch size: 14x14 pixels

License

Apache 2.0

Citation

@inproceedings{cao2026tipsv2,
  title     = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
  author    = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Downloads last month: 3,981

Safetensors

Model size

2B params

Tensor type

F32

Collection including google/tipsv2-g14

TIPSv2

Collection

TIPSv2 foundational vision-language models. Webpage: https://gdm-tipsv2.github.io/ • 9 items • Updated Apr 14 • 35