Cannot reproduce claimed 96.3 AIME 2025 score

#7
by NonoRiri-7 - opened

Hi, thanks for releasing this checkpoint. I'm unable to reproduce the AIME 2025 score of 96.3% and consistently get ~90% across both vLLM v0.18.1 and SGLang.

Setup:

  • Hardware: 8xB200
  • Sampling: temperature=1.0, top_p=0.95, max_tokens=98304, avg@32 (following Kimi-K2.5 model card)
  • Inference: vLLM latest + SGLang, both give the same result

Per-question avg@32 scores (INT4 baseline vs NVFP4):

Q INT4 NVFP4 Δ
0 1.000 1.000 0
1 1.000 0.969 -0.031
2 1.000 1.000 0
3 1.000 1.000 0
4 1.000 1.000 0
5 1.000 1.000 0
6 1.000 0.969 -0.031
7 1.000 1.000 0
8 1.000 1.000 0
9 1.000 0.969 -0.031
10 1.000 1.000 0
11 1.000 1.000 0
12 0.813 0.688 -0.125
13 0.531 0.281 -0.250
14 0.281 0.188 -0.094
15 1.000 0.969 -0.031
16 1.000 1.000 0
17 1.000 1.000 0
18 1.000 1.000 0
19 0.969 1.000 +0.031
20 1.000 1.000 0
21 1.000 1.000 0
22 1.000 1.000 0
23 1.000 1.000 0
24 1.000 0.938 -0.063
25 1.000 1.000 0
26 1.000 1.000 0
27 0.844 0.719 -0.125
28 0.938 0.969 +0.031
29 0.844 0.281 -0.563
Total 94.1% 89.8% -4.3%

The gap concentrates on (Q12, Q13, Q27, Q29).

Could you share:

  1. The exact vLLM version/docker image used for evaluation?
  2. Any additional flags beyond what's in the model card?

Happy to provide more details or logs if helpful.

I get overall 92.6% scores using this NVFP4 version, but it is based on a different inference engine instead of vLLM/SGLang.

Sign up or log in to comment