Text performance compared to GLM-4.5 Air

#1
by Dampfinchen - opened

Hello,

Thank you for releasing this model. How's the performance in text only tasks compared to GLM4.5? Ideally text performance would be the same or better. Open Source badly needs general purpose models that excel in both multimodal and text only tasks at the same time.

Hello,

Thank you for releasing this model. How's the performance in text only tasks compared to GLM4.5? Ideally text performance would be the same or better. Open Source badly needs general purpose models that excel in both multimodal and text only tasks at the same time.

It's probably worse than 4.5 air. The size is the same, and this model has vision ability, there's got to be some sacrifice in text-only tasks.

the size is not the same - 4.5V has extra vision tower.

image.png

Strangely, 4.5V has 46 layers in its language model - compared to the 47 layers in 4.5-Air

image.png

The last layer of GLM Air is the MTP layer (Multi-Token Prediction) / SPeculative decoding layer, it predicts what the model will output to accelerate inference if the prediction is accurate enough.

This was introduced in DeepSeek V3 - https://arxiv.org/pdf/2412.19437v1
and Nvidia covers it quickly in https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/multi_token_prediction.html

Sign up or log in to comment