CodeRM-NT

Paper | Github

Providing accurate reward signals for code generated by LLMs is a significant challenge in applying reinforcement learning (RL) to code generation. Existing methods rely on unit tests, which are expensive to curate and unreliable when automatically synthesized.

CodeRM-NT is a code reward model with no reliance on unit tests. Instead of executing test cases, it learns to estimate the functional correctness of generated Python code from rewards that are collected via Monte Carlo Tree Search (MCTS) guided by LLM-as-a-Judge.

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "Rishubi/CodeRM-NT",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Rishubi/CodeRM-NT")

question = "Write a Python function `add(a, b)` that returns the sum of two integers."
response = "def add(a, b):\n    return a + b"

messages = [
    {"role": "user", "content": question},
    {"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
    reward = model(input_ids).logits.squeeze().float().item()
print(reward)  # higher is better

Results

Key Results

Training with CodeRM-NT consistently outperforms synthetic unit tests and other reward models across multiple code generation benchmarks:

Model	Reward	HumanEval	HumanEval+	MBPP	MBPP+	LCB-v5	BCB-I-Hard	Avg.
Qwen2.5-Coder-1.5B	Unit Tests	73.2	67.7	70.9	61.1	5.1	6.1	47.4
	CodeRM-NT	75.0	69.5	72.0	60.8	5.5	7.4	48.4
Qwen2.5-Coder-3B	Unit Tests	86.6	82.3	74.9	64.6	13.0	15.5	56.2
	CodeRM-NT	88.4	82.3	75.9	66.1	13.6	14.2	56.8
Qwen2.5-Coder-7B	Unit Tests	90.9	87.8	85.4	73.0	17.3	18.2	62.1
	CodeRM-NT	90.2	86.0	86.8	74.6	17.5	18.2	62.2
GLM-4-9B-0414	Unit Tests	84.1	79.9	81.0	69.0	15.4	15.5	57.5
	CodeRM-NT	87.2	81.7	79.9	67.2	15.3	18.2	58.3
Qwen3-4B-Thinking	Unit Tests	97.6	92.7	91.0	75.1	50.3	25.7	72.1
	CodeRM-NT	97.6	94.5	92.6	77.2	52.1	22.3	72.7

Citation

TODO

Downloads last month: 19

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for Rishubi/CodeRM-NT

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

Qwen/Qwen2.5-Coder-7B-Instruct

Finetuned

(399)

this model