Instructions to use RedHatAI/gemma-4-31B-it-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RedHatAI/gemma-4-31B-it-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="RedHatAI/gemma-4-31B-it-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("RedHatAI/gemma-4-31B-it-NVFP4") model = AutoModelForImageTextToText.from_pretrained("RedHatAI/gemma-4-31B-it-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RedHatAI/gemma-4-31B-it-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RedHatAI/gemma-4-31B-it-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/gemma-4-31B-it-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/RedHatAI/gemma-4-31B-it-NVFP4
- SGLang
How to use RedHatAI/gemma-4-31B-it-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RedHatAI/gemma-4-31B-it-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/gemma-4-31B-it-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RedHatAI/gemma-4-31B-it-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/gemma-4-31B-it-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use RedHatAI/gemma-4-31B-it-NVFP4 with Docker Model Runner:
docker model run hf.co/RedHatAI/gemma-4-31B-it-NVFP4
Running Gemma 4 Truthfully at 128K on One RTX 5090
Very grateful to the Red Hat AI team and the open-source community for making this model available.
We were able to get the original RedHatAI/gemma-4-31B-it-NVFP4 running truthfully at 131072 context on a single RTX 5090 32 GB, without changing the base model. The key was not a “magic trick,” but adapting the deployment profile to the hardware: fp8 KV cache, no offload, and a validated local serving setup.
On this hardware, the shipping profile reached 53.758s median TTFT, 2454.654 tok/s median prefill, and 44.588 tok/s median decode, with 5/5 shipping runs passed. Thinking is supported as a separate secondary profile.
Main takeaway: these models can often do more locally than people expect, but the runtime has to be tuned to the machine you actually have. One setup will not fit every GPU.
Hi @Mosai-Sys , thanks for the details! Glad to hear you were able to use it and tune to suit your needs
would love more information and your setup !
128K Local Setup
This release does not modify the model weights. It documents a validated local deployment setup for RedHatAI/gemma-4-31B-it-NVFP4 at 131072 context length on a single RTX 5090 32 GB using vLLM.
The 128K result was obtained with a no-offload serving configuration using:--max-model-len 131072, --tensor-parallel-size 1, --max-num-seqs 1, --gpu-memory-utilization 0.90, --quantization compressed-tensors, --kv-cache-dtype fp8, --no-calculate-kv-scales, --max-num-batched-tokens 640, --async-scheduling, --no-enable-prefix-caching, and --limit-mm-per-prompt image=1,video=1.
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--quantization compressed-tensors \
--max-model-len 131072 \
--tensor-parallel-size 1 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--no-calculate-kv-scales \
--max-num-batched-tokens 640 \
--async-scheduling \
--no-enable-prefix-caching \
--limit-mm-per-prompt image=1,video=1
For tool calling, add:
--tool-call-parser gemma4 --enable-auto-tool-choice
For thinking mode, use a separate secondary profile and add:
--reasoning-parser gemma4 --tool-call-parser gemma4 --enable-auto-tool-choice
For repeat local runs, startup is improved by keeping the Hugging Face cache and vLLM compile cache on Linux/Docker-native storage and reusing a running server instead of cold-starting each run.