smolvlm2-500M-illustration-description

An illustration description generation model that provides richer image descriptions
Fine-tuned based on HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Uses

This model can be used to generate descriptions of illustrations and engage in some simple Q&A related to illustration content

Suggested prompts:

  • Write a descriptive caption for this image in a formal tone.
  • Write a descriptive caption for this image in a casual tone.
  • Analyze this image like an art critic would with information about its composition, style, symbolism, the use of color, light, any artistic movement it might belong to, etc.
  • What color is the hair of the character?
  • What are the characters wearing?

How to Get Started with the Model

from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch

model_name = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
adapter_name = "xco2/smolvlm2-500M-illustration-description"

model = AutoModelForImageTextToText.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
)
model = PeftModel.from_pretrained(model, adapter_name)

processor = AutoProcessor.from_pretrained(model_name)

model = model.to('cuda').to(torch.bfloat16)
model = model.merge_and_unload().eval()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image",
             "url": "https://cdn.donmai.us/sample/63/e7/__castorice_honkai_and_1_more_drawn_by_yolanda__sample-63e73017612352d472b24056e501656d.jpg"},
            {"type": "text",
             "text": "Write a descriptive caption for this image in a formal tone."},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=model.dtype)

generated_ids = model.generate(**inputs, do_sample=True, max_new_tokens=2048)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print("Assistant:", generated_texts[0].split("Assistant:")[-1])

Training Details

Training Data

Image description data:

  1. Utilized the quantized fancyfeast/joy-caption-pre-alpha model to describe approximately 100,000 illustrations with multiple prompts.
  2. Filtered out meaningless descriptions with repetitive phrases generated by the model.
  3. Generated Q&A data related to the content of the illustrations based on the generated descriptions using qwen3-12B.
    A total of about 240,000 training data entries were obtained in the end.

Framework versions

  • PEFT 0.15.2
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xco2/smolvlm2-500M-illustration-description

Space using xco2/smolvlm2-500M-illustration-description 1