File size: 3,141 Bytes
e51cd2a
 
 
 
 
 
 
 
 
 
 
 
 
 
2d093da
e51cd2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
tags:
  - reward-model
  - code-generation
  - rlhf
pipeline_tag: text-classification
language:
  - en
library_name: transformers
---

# CodeRM-NT
[Paper](https://github.com/THUDM/CodeRM-NT/blob/main/assets/CodeRM-NT.pdf) | 
[Github](https://github.com/THUDM/CodeRM-NT)

Providing accurate reward signals for code generated by LLMs is a significant challenge in applying reinforcement learning (RL) to code generation. Existing methods rely on unit tests, which are expensive to curate and unreliable when automatically synthesized.

**CodeRM-NT** is a code reward model with **no reliance on unit tests**. Instead of executing test cases, it learns to estimate the functional correctness of generated Python code from rewards that are collected via Monte Carlo Tree Search (MCTS) guided by LLM-as-a-Judge.

## Usage

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "Rishubi/CodeRM-NT",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Rishubi/CodeRM-NT")

question = "Write a Python function `add(a, b)` that returns the sum of two integers."
response = "def add(a, b):\n    return a + b"

messages = [
    {"role": "user", "content": question},
    {"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
    reward = model(input_ids).logits.squeeze().float().item()
print(reward)  # higher is better
```

## Results

## Key Results

Training with CodeRM-NT consistently outperforms synthetic unit tests and other reward models across multiple code generation benchmarks:

| Model              | Reward        | HumanEval | HumanEval+ | MBPP | MBPP+ | LCB-v5 | BCB-I-Hard |   Avg.   |
| :----------------- | :------------ | :-------: | :--------: | :--: | :---: | :----: | :--------: | :------: |
| Qwen2.5-Coder-1.5B | Unit Tests    |   73.2    |    67.7    | 70.9 | 61.1  |  5.1   |    6.1     |   47.4   |
|                    | **CodeRM-NT** |   75.0    |    69.5    | 72.0 | 60.8  |  5.5   |    7.4     | **48.4** |
| Qwen2.5-Coder-3B   | Unit Tests    |   86.6    |    82.3    | 74.9 | 64.6  |  13.0  |    15.5    |   56.2   |
|                    | **CodeRM-NT** |   88.4    |    82.3    | 75.9 | 66.1  |  13.6  |    14.2    | **56.8** |
| Qwen2.5-Coder-7B   | Unit Tests    |   90.9    |    87.8    | 85.4 | 73.0  |  17.3  |    18.2    |   62.1   |
|                    | **CodeRM-NT** |   90.2    |    86.0    | 86.8 | 74.6  |  17.5  |    18.2    | **62.2** |
| GLM-4-9B-0414      | Unit Tests    |   84.1    |    79.9    | 81.0 | 69.0  |  15.4  |    15.5    |   57.5   |
|                    | **CodeRM-NT** |   87.2    |    81.7    | 79.9 | 67.2  |  15.3  |    18.2    | **58.3** |
| Qwen3-4B-Thinking  | Unit Tests    |   97.6    |    92.7    | 91.0 | 75.1  |  50.3  |    25.7    |   72.1   |
|                    | **CodeRM-NT** |   97.6    |    94.5    | 92.6 | 77.2  |  52.1  |    22.3    | **72.7** |

## Citation

TODO