Text Classification
Transformers
Safetensors
English
qwen2
reward-model
code-generation
rlhf
text-embeddings-inference
Instructions to use Rishubi/CodeRM-NT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rishubi/CodeRM-NT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Rishubi/CodeRM-NT")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Rishubi/CodeRM-NT") model = AutoModelForSequenceClassification.from_pretrained("Rishubi/CodeRM-NT") - Notebooks
- Google Colab
- Kaggle
CodeRM-NT
Providing accurate reward signals for code generated by LLMs is a significant challenge in applying reinforcement learning (RL) to code generation. Existing methods rely on unit tests, which are expensive to curate and unreliable when automatically synthesized.
CodeRM-NT is a code reward model with no reliance on unit tests. Instead of executing test cases, it learns to estimate the functional correctness of generated Python code from rewards that are collected via Monte Carlo Tree Search (MCTS) guided by LLM-as-a-Judge.
Usage
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(
"Rishubi/CodeRM-NT",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Rishubi/CodeRM-NT")
question = "Write a Python function `add(a, b)` that returns the sum of two integers."
response = "def add(a, b):\n return a + b"
messages = [
{"role": "user", "content": question},
{"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
reward = model(input_ids).logits.squeeze().float().item()
print(reward) # higher is better
Results
Key Results
Training with CodeRM-NT consistently outperforms synthetic unit tests and other reward models across multiple code generation benchmarks:
| Model | Reward | HumanEval | HumanEval+ | MBPP | MBPP+ | LCB-v5 | BCB-I-Hard | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-Coder-1.5B | Unit Tests | 73.2 | 67.7 | 70.9 | 61.1 | 5.1 | 6.1 | 47.4 |
| CodeRM-NT | 75.0 | 69.5 | 72.0 | 60.8 | 5.5 | 7.4 | 48.4 | |
| Qwen2.5-Coder-3B | Unit Tests | 86.6 | 82.3 | 74.9 | 64.6 | 13.0 | 15.5 | 56.2 |
| CodeRM-NT | 88.4 | 82.3 | 75.9 | 66.1 | 13.6 | 14.2 | 56.8 | |
| Qwen2.5-Coder-7B | Unit Tests | 90.9 | 87.8 | 85.4 | 73.0 | 17.3 | 18.2 | 62.1 |
| CodeRM-NT | 90.2 | 86.0 | 86.8 | 74.6 | 17.5 | 18.2 | 62.2 | |
| GLM-4-9B-0414 | Unit Tests | 84.1 | 79.9 | 81.0 | 69.0 | 15.4 | 15.5 | 57.5 |
| CodeRM-NT | 87.2 | 81.7 | 79.9 | 67.2 | 15.3 | 18.2 | 58.3 | |
| Qwen3-4B-Thinking | Unit Tests | 97.6 | 92.7 | 91.0 | 75.1 | 50.3 | 25.7 | 72.1 |
| CodeRM-NT | 97.6 | 94.5 | 92.6 | 77.2 | 52.1 | 22.3 | 72.7 |
Citation
TODO
- Downloads last month
- 19