Running Gemma 4 Truthfully at 128K on One RTX 5090

#4
by Mosai-Sys - opened

Very grateful to the Red Hat AI team and the open-source community for making this model available.

We were able to get the original RedHatAI/gemma-4-31B-it-NVFP4 running truthfully at 131072 context on a single RTX 5090 32 GB, without changing the base model. The key was not a “magic trick,” but adapting the deployment profile to the hardware: fp8 KV cache, no offload, and a validated local serving setup.

On this hardware, the shipping profile reached 53.758s median TTFT, 2454.654 tok/s median prefill, and 44.588 tok/s median decode, with 5/5 shipping runs passed. Thinking is supported as a separate secondary profile.

Main takeaway: these models can often do more locally than people expect, but the runtime has to be tuned to the machine you actually have. One setup will not fit every GPU.

Red Hat AI org

Hi @Mosai-Sys , thanks for the details! Glad to hear you were able to use it and tune to suit your needs

would love more information and your setup !

128K Local Setup

This release does not modify the model weights. It documents a validated local deployment setup for RedHatAI/gemma-4-31B-it-NVFP4 at 131072 context length on a single RTX 5090 32 GB using vLLM.

The 128K result was obtained with a no-offload serving configuration using:
--max-model-len 131072, --tensor-parallel-size 1, --max-num-seqs 1, --gpu-memory-utilization 0.90, --quantization compressed-tensors, --kv-cache-dtype fp8, --no-calculate-kv-scales, --max-num-batched-tokens 640, --async-scheduling, --no-enable-prefix-caching, and --limit-mm-per-prompt image=1,video=1.

vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --tensor-parallel-size 1 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --no-calculate-kv-scales \
  --max-num-batched-tokens 640 \
  --async-scheduling \
  --no-enable-prefix-caching \
  --limit-mm-per-prompt image=1,video=1

For tool calling, add:

--tool-call-parser gemma4 --enable-auto-tool-choice

For thinking mode, use a separate secondary profile and add:

--reasoning-parser gemma4 --tool-call-parser gemma4 --enable-auto-tool-choice

For repeat local runs, startup is improved by keeping the Hugging Face cache and vLLM compile cache on Linux/Docker-native storage and reusing a running server instead of cold-starting each run.

Sign up or log in to comment