Instructions to use RedHatAI/gemma-4-31B-it-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/gemma-4-31B-it-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="RedHatAI/gemma-4-31B-it-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("RedHatAI/gemma-4-31B-it-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("RedHatAI/gemma-4-31B-it-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/gemma-4-31B-it-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/gemma-4-31B-it-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/gemma-4-31B-it-NVFP4

SGLang

How to use RedHatAI/gemma-4-31B-it-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/gemma-4-31B-it-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/gemma-4-31B-it-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/gemma-4-31B-it-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/gemma-4-31B-it-NVFP4 with Docker Model Runner:
```
docker model run hf.co/RedHatAI/gemma-4-31B-it-NVFP4
```

Running Gemma 4 Truthfully at 128K on One RTX 5090

by Mosai-Sys - opened Apr 14

Discussion

Mosai-Sys

Apr 14

Very grateful to the Red Hat AI team and the open-source community for making this model available.

We were able to get the original RedHatAI/gemma-4-31B-it-NVFP4 running truthfully at 131072 context on a single RTX 5090 32 GB, without changing the base model. The key was not a “magic trick,” but adapting the deployment profile to the hardware: fp8 KV cache, no offload, and a validated local serving setup.

On this hardware, the shipping profile reached 53.758s median TTFT, 2454.654 tok/s median prefill, and 44.588 tok/s median decode, with 5/5 shipping runs passed. Thinking is supported as a separate secondary profile.

Main takeaway: these models can often do more locally than people expect, but the runtime has to be tuned to the machine you actually have. One setup will not fit every GPU.

bdellabe

Red Hat AI org Apr 15

Hi @Mosai-Sys , thanks for the details! Glad to hear you were able to use it and tune to suit your needs

lolcatsnin

Apr 21

would love more information and your setup !

Mosai-Sys

Apr 21

128K Local Setup

This release does not modify the model weights. It documents a validated local deployment setup for RedHatAI/gemma-4-31B-it-NVFP4 at 131072 context length on a single RTX 5090 32 GB using vLLM.

The 128K result was obtained with a no-offload serving configuration using:
--max-model-len 131072, --tensor-parallel-size 1, --max-num-seqs 1, --gpu-memory-utilization 0.90, --quantization compressed-tensors, --kv-cache-dtype fp8, --no-calculate-kv-scales, --max-num-batched-tokens 640, --async-scheduling, --no-enable-prefix-caching, and --limit-mm-per-prompt image=1,video=1.

vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --tensor-parallel-size 1 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --no-calculate-kv-scales \
  --max-num-batched-tokens 640 \
  --async-scheduling \
  --no-enable-prefix-caching \
  --limit-mm-per-prompt image=1,video=1

For tool calling, add:

--tool-call-parser gemma4 --enable-auto-tool-choice

For thinking mode, use a separate secondary profile and add:

--reasoning-parser gemma4 --tool-call-parser gemma4 --enable-auto-tool-choice

For repeat local runs, startup is improved by keeping the Hugging Face cache and vLLM compile cache on Linux/Docker-native storage and reusing a running server instead of cold-starting each run.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment