smolvlm2-500M-illustration-description

An illustration description generation model that provides richer image descriptions
Fine-tuned based on HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Uses

This model can be used to generate descriptions of illustrations and engage in some simple Q&A related to illustration content

Suggested prompts:

Write a descriptive caption for this image in a formal tone.
Write a descriptive caption for this image in a casual tone.
Analyze this image like an art critic would with information about its composition, style, symbolism, the use of color, light, any artistic movement it might belong to, etc.
What color is the hair of the character?
What are the characters wearing?

How to Get Started with the Model

from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch

model_name = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
adapter_name = "xco2/smolvlm2-500M-illustration-description"

model = AutoModelForImageTextToText.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
)
model = PeftModel.from_pretrained(model, adapter_name)

processor = AutoProcessor.from_pretrained(model_name)

model = model.to('cuda').to(torch.bfloat16)
model = model.merge_and_unload().eval()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image",
             "url": "https://cdn.donmai.us/sample/63/e7/__castorice_honkai_and_1_more_drawn_by_yolanda__sample-63e73017612352d472b24056e501656d.jpg"},
            {"type": "text",
             "text": "Write a descriptive caption for this image in a formal tone."},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=model.dtype)

generated_ids = model.generate(**inputs, do_sample=True, max_new_tokens=2048)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print("Assistant:", generated_texts[0].split("Assistant:")[-1])

Training Details

Training Data

Image description data:

Utilized the quantized fancyfeast/joy-caption-pre-alpha model to describe approximately 100,000 illustrations with multiple prompts.
Filtered out meaningless descriptions with repetitive phrases generated by the model.
Generated Q&A data related to the content of the illustrations based on the generated descriptions using qwen3-12B.
A total of about 240,000 training data entries were obtained in the end.