smolvlm2-500M-illustration-description
An illustration description generation model that provides richer image descriptions
Fine-tuned based on HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Uses
This model can be used to generate descriptions of illustrations and engage in some simple Q&A related to illustration content
Suggested prompts:
- Write a descriptive caption for this image in a formal tone.
- Write a descriptive caption for this image in a casual tone.
- Analyze this image like an art critic would with information about its composition, style, symbolism, the use of color, light, any artistic movement it might belong to, etc.
- What color is the hair of the character?
- What are the characters wearing?
How to Get Started with the Model
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch
model_name = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
adapter_name = "xco2/smolvlm2-500M-illustration-description"
model = AutoModelForImageTextToText.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2"
)
model = PeftModel.from_pretrained(model, adapter_name)
processor = AutoProcessor.from_pretrained(model_name)
model = model.to('cuda').to(torch.bfloat16)
model = model.merge_and_unload().eval()
messages = [
{
"role": "user",
"content": [
{"type": "image",
"url": "https://cdn.donmai.us/sample/63/e7/__castorice_honkai_and_1_more_drawn_by_yolanda__sample-63e73017612352d472b24056e501656d.jpg"},
{"type": "text",
"text": "Write a descriptive caption for this image in a formal tone."},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, do_sample=True, max_new_tokens=2048)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print("Assistant:", generated_texts[0].split("Assistant:")[-1])
Training Details
Training Data
Image description data:
- Utilized the quantized fancyfeast/joy-caption-pre-alpha model to describe approximately 100,000 illustrations with multiple prompts.
- Filtered out meaningless descriptions with repetitive phrases generated by the model.
- Generated Q&A data related to the content of the illustrations based on the generated descriptions using qwen3-12B.
A total of about 240,000 training data entries were obtained in the end.
Framework versions
- PEFT 0.15.2
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for xco2/smolvlm2-500M-illustration-description
Base model
HuggingFaceTB/SmolLM2-360M
Quantized
HuggingFaceTB/SmolLM2-360M-Instruct
Quantized
HuggingFaceTB/SmolVLM-500M-Instruct