Instructions to use google/gemma-4-31B-it-assistant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-31B-it-assistant with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it-assistant") model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it-assistant") - Notebooks
- Google Colab
- Kaggle
What is it for?
Can someone explain to me some use cases for this model? Should we just replace main gemma 4 31b for this models if its faster? Does it work for every task or only for some specific ones? Thank you
this is speculative decoding model. It doesn't work independently, it works with this model 31b
can i run 31b on RX 7900 XTX while running assistant on CPU? how big of an overhead is it if i ran it on GPU?
What I understand is that this model works as an assistant to the 31B model. It suggests the next tokens to the 31B model, and then the 31B model verifies them and uses the valid ones to speed up generation.
can i run 31b on RX 7900 XTX while running assistant on CPU? how big of an overhead is it if i ran it on GPU?
I run gemma-4-31b-it-q4_k_m.gguf on rtx3090 while offloading about 15 layers into CPU (i built llama.cpp locally on ubuntu 22.04 desktop). Thus 128K context window worked.
Hope my experience help you
Hi @Tikhonum , Great Question! To improve the inference speed of the Gemma4 models, a new series of autoregressive drafter models have been released alongside each corresponding main model-E2B, E4B, 31B and 26B-A4B.
The drafter model is not a replacement for the main(target model).It is designed to work with its corresponding target model. It acts as an assistant by rapidly predicting multiple token(MTP) ahead, which the target model verifies the suggested tokens in parallel. This is called speculative decoding and it significantly speeds up the inference process of the model while maintaining output quality.
We can use the drafter models for the same use cases where the target models are used: text, audio, image , video. Please refer to these resource 1, 2 for further details. Thank You.