Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
image-to-text
conversational

can not believe, but seems 256M is slower then internvl-1B ?

#25
by josefph - opened

As title said, it's hard to believe that smolvlm-256M-instruct is slower then internvl-1B. Even i inspect the input embedding and params still can not figure out why ?

internvl-1B >
inp_embed : (1, 547, 896)
trainable params: 17,596,416 || all params: 647,260,288 || trainable%: 2.7186

smolvlm-256M >
inp_embed : (1, 171, 576)
trainable params: 9,768,960 || all params: 172,742,976 || trainable%: 5.6552

Does someone have similar issue ??

AutoDriving ms :
image

This is a known issue, I guess, I was working with the 2B version, the same slowness exists there as well. It is slowers as compared to any other model of similar param size.

Sign up or log in to comment