Cannot reproduce claimed 96.3 AIME 2025 score

by NonoRiri-7 - opened Apr 3

Apr 3

Hi, thanks for releasing this checkpoint. I'm unable to reproduce the AIME 2025 score of 96.3% and consistently get ~90% across both vLLM v0.18.1 and SGLang.

Setup:

Hardware: 8xB200
Sampling: temperature=1.0, top_p=0.95, max_tokens=98304, avg@32 (following Kimi-K2.5 model card)
Inference: vLLM latest + SGLang, both give the same result

Per-question avg@32 scores (INT4 baseline vs NVFP4):

Q	INT4	NVFP4	Δ
0	1.000	1.000	0
1	1.000	0.969	-0.031
2	1.000	1.000	0
3	1.000	1.000	0
4	1.000	1.000	0
5	1.000	1.000	0
6	1.000	0.969	-0.031
7	1.000	1.000	0
8	1.000	1.000	0
9	1.000	0.969	-0.031
10	1.000	1.000	0
11	1.000	1.000	0
12	0.813	0.688	-0.125
13	0.531	0.281	-0.250
14	0.281	0.188	-0.094
15	1.000	0.969	-0.031
16	1.000	1.000	0
17	1.000	1.000	0
18	1.000	1.000	0
19	0.969	1.000	+0.031
20	1.000	1.000	0
21	1.000	1.000	0
22	1.000	1.000	0
23	1.000	1.000	0
24	1.000	0.938	-0.063
25	1.000	1.000	0
26	1.000	1.000	0
27	0.844	0.719	-0.125
28	0.938	0.969	+0.031
29	0.844	0.281	-0.563
Total	94.1%	89.8%	-4.3%

The gap concentrates on (Q12, Q13, Q27, Q29).

Could you share:

The exact vLLM version/docker image used for evaluation?
Any additional flags beyond what's in the model card?

Happy to provide more details or logs if helpful.

ghostplant

8 days ago

I get overall 92.6% scores using this NVFP4 version, but it is based on a different inference engine instead of vLLM/SGLang.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment