jxie/flickr8k
Viewer β’ Updated β’ 8k β’ 4.01k β’ 6
A lightweight multimodal model combining GPT-2 and Vision Transformer for image captioning, built for Smart India Hackathon 2025.
Sparse Cross-Attention Fusion (inspired by Llama 3.2)
Total: 222M params | Trainable: 11M params (5%)
model_fp32/model_checkpoint.pthmodel_fp16/model_checkpoint.pthpip install torch torchvision transformers pillow huggingface-hub
import torch
from huggingface_hub import hf_hub_download
# Download checkpoint
checkpoint_path = hf_hub_download(
repo_id="gurumurthy3/vision-gpt-flickr8k_v2",
filename="model_fp32/model_checkpoint.pth"
)
# Load checkpoint
checkpoint = torch.load(checkpoint_path, map_location="cpu")
model_state_dict = checkpoint['model_state_dict']
# Load your model architecture and weights
# model.load_state_dict(model_state_dict)
# model.eval()
import torch
from huggingface_hub import hf_hub_download
# Download FP16 checkpoint
checkpoint_path = hf_hub_download(
repo_id="gurumurthy3/vision-gpt-flickr8k_v2",
filename="model_fp16/model_checkpoint.pth"
)
# Load checkpoint
checkpoint = torch.load(checkpoint_path, map_location="cpu")
# Load model
# model.load_state_dict(checkpoint['model_state_dict'])
# model.half() # Ensure model is in FP16
# model.eval()
# For GPU inference with FP16
# model = model.to('cuda')
# images = images.to('cuda').half()
from PIL import Image
from torchvision import transforms
# Image preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# Load and preprocess image
image = Image.open("your_image.jpg").convert('RGB')
image_tensor = transform(image).unsqueeze(0)
# Generate caption
with torch.no_grad():
caption = model.generate(image_tensor, max_length=50)
print(f"Caption: {caption}")
Try it live: Multimodal GPT-2 Demo
Input Image (224Γ224)
β
ViT-B/16 Encoder βοΈ
(87M params)
β
Perceiver Resampler π₯
(compress to 64 tokens)
β
Cross-Attention Layers π₯
(at layers 3, 6, 9)
β
GPT-2 βοΈ
(124M params)
β
Generated Caption
@misc{vision-gpt-flickr8k-2025,
author = {gurumurthy3},
title = {Vision-GPT: Multimodal Image Captioning with Sparse Cross-Attention},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/gurumurthy3/vision-gpt-flickr8k_v2}}
}
MIT License
Built with β€οΈ for Smart India Hackathon 2025