Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2.5
- SGLang
How to use moonshotai/Kimi-K2.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2.5
Kimi-K2.5 INT4 compressed-tensors currently blocked on NVIDIA GB10 / SM 12.1 (vLLM + SGLang)
vLLM issue draft: Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 still binds to Marlin and returns degenerate token repetition
Summary
moonshotai/Kimi-K2.5 on a 16-node DGX Spark cluster with NVIDIA GB10 (SM 12.1) is not usable through public vLLM builds.
On vllm/vllm-openai:nightly, a low-memory launch profile is stable enough to reach:
/health = 200/v1/modelsreturnskimi-k2.5
But completions still collapse into degenerate token repetition:
The capital of France is->foss foss foss ...
Additionally, an explicit --moe-backend triton request reaches the vllm serve command line, but runtime still resolves the model path to Marlin:
Using CompressedTensorsWNA16MarlinMoEMethodUsing Marlin backend for WNA16 MoE (group_size=32, num_bits=4)
This looks like either:
compressed-tensorsKimi path ignoring the requested MoE backend, or- Marlin WNA16 MoE producing numerically incorrect results on GB10 / SM 12.1.
Hardware
- 16x DGX Spark
- GPU:
NVIDIA GB10 - Compute capability:
SM 12.1
Model
moonshotai/Kimi-K2.5- revision:
54383e83fa343a1331754112fb9e3410c55efa2f compressed-tensorspack-quantizedgroup_size=32num_bits=4type=int
Model integrity was verified across all 16 nodes:
64/64shards present- safetensors total size consistent with the index
We also checked a representative packed expert tensor and its BF16 scales inside the live container:
- shard:
model-00030-of-000064.safetensors - tensor:
language_model.model.layers.29.mlp.experts.0.gate_proj.weight_packed - scale:
language_model.model.layers.29.mlp.experts.0.gate_proj.weight_scale - observed:
- packed shape
[2048, 896], dtypetorch.int32 - packed min/max
-2129572924/2147462455 - scale shape
[2048, 224], dtypetorch.bfloat16 - scale min/max
0.002411/0.014526 - scale mean/std
0.005768/0.001194 - no NaN, no Inf, not all-zero
- packed shape
So this does not look like a corrupted cache or obviously broken scale metadata.
Dequantization metadata verified
The packed weights and BF16 scales for a representative MoE expert tensor were
read directly from the snapshot inside the running vLLM container.
Values are within the expected numerical range for INT4 pack-quantized
compressed-tensors format:
- packed int32 range covers full representation space, which is expected for
bit-packed INT4 storage - BF16 scales are in the
[0.0024, 0.0145]range with a non-degenerate
distribution - no NaN, no Inf, not all-zero
This confirms that the degenerate output is not caused by broken weights or
broken dequantization metadata. The input to the Marlin WNA16 MoE kernel looks
correct, but the kernel path produces numerically incorrect output on GB10 /
SM 12.1.
Images tested
Known-bad public builds
nvcr.io/nvidia/vllm:26.03-py3- image id:
sha256:4c5e61c590207edb771c294014193c719ca9eee330c0b51756f4b6a25951360d - symptom: service runs but outputs are corrupted (
foss foss foss ...)
- image id:
vllm/vllm-openai:nightly- image id:
sha256:d78917343e4159618d8fd766d800809246a68b56cf132464b0eeec91a23bc5ca torch.cuda.get_arch_list():['sm_80', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
- image id:
For cluster rollout we built a derived image with Ray installed:
big3/vllm-openai:nightly-ray
Stable low-memory profile
The nightly image needed a reduced-memory profile to avoid tail-worker instability:
--max-model-len 32768
--gpu-memory-utilization 0.85
--enforce-eager
--max-num-batched-tokens 512
With that profile, /health and /v1/models work, but generation still collapses into degenerate token repetition.
Launch command shape
We launch with:
vllm serve /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
--served-model-name kimi-k2.5 \
--tensor-parallel-size 16 \
--distributed-executor-backend ray \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enforce-eager \
--max-num-batched-tokens 512 \
--disable-custom-all-reduce \
--host 0.0.0.0 --port 8001
Reproduction result
After startup:
curl -s http://127.0.0.1:8001/health
curl -s http://127.0.0.1:8001/v1/models
curl -s http://127.0.0.1:8001/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"kimi-k2.5",
"prompt":"The capital of France is",
"max_tokens":10,
"temperature":0.0
}'
Observed completion text:
foss foss foss foss foss foss foss foss foss foss
Triton override attempt
We patched our launcher so that --moe-backend triton is explicitly present in the final vllm serve command line.
Confirmed running command line included:
--moe-backend triton
But runtime logs still showed:
Using CompressedTensorsWNA16MarlinMoEMethod
Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)
So the requested backend does not appear to take effect for this Kimi compressed-tensors path.
Quick reproduction check
Anyone with access to GB10 or other SM 12.x hardware can first confirm the
runtime target with:
docker run --rm --gpus all vllm/vllm-openai:nightly \
python3 -c "
import torch
d = torch.cuda.get_device_properties(0)
print(f'SM {d.major}.{d.minor}, arch_list={torch.cuda.get_arch_list()}')
"
Expected on this platform:
SM 12.1on GB10- arch list including
sm_120
Expected behavior
--moe-backend tritonshould either:- switch the Kimi compressed-tensors MoE path away from Marlin, or
- fail clearly if that model path cannot use Triton.
- The default runtime should not collapse into degenerate token repetition on GB10 / SM 12.1.
Actual behavior
- Runtime becomes healthy enough to serve
/healthand/v1/models - Generation still collapses into degenerate token repetition
- Requested
--moe-backend tritonstill resolves to Marlin at runtime
Request
Please clarify whether moonshotai/Kimi-K2.5 compressed-tensors on GB10 / SM 12.1 is expected to support:
- correct generation on Marlin WNA16 MoE
- explicit fallback to Triton via
--moe-backend triton
At the moment this looks like a correctness bug on GB10 / SM 12.1 rather than a pure startup bug.
I met the same error on H20 devices, and this problem may cause GPU OOM. Did you find any solutions?