Text Classification
Transformers
Safetensors
English
qwen2
reward-model
code-generation
rlhf
text-embeddings-inference
Instructions to use Rishubi/CodeRM-NT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Rishubi/CodeRM-NT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Rishubi/CodeRM-NT")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Rishubi/CodeRM-NT") model = AutoModelForSequenceClassification.from_pretrained("Rishubi/CodeRM-NT") - Notebooks
- Google Colab
- Kaggle
File size: 3,141 Bytes
e51cd2a 2d093da e51cd2a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | ---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
tags:
- reward-model
- code-generation
- rlhf
pipeline_tag: text-classification
language:
- en
library_name: transformers
---
# CodeRM-NT
[Paper](https://github.com/THUDM/CodeRM-NT/blob/main/assets/CodeRM-NT.pdf) |
[Github](https://github.com/THUDM/CodeRM-NT)
Providing accurate reward signals for code generated by LLMs is a significant challenge in applying reinforcement learning (RL) to code generation. Existing methods rely on unit tests, which are expensive to curate and unreliable when automatically synthesized.
**CodeRM-NT** is a code reward model with **no reliance on unit tests**. Instead of executing test cases, it learns to estimate the functional correctness of generated Python code from rewards that are collected via Monte Carlo Tree Search (MCTS) guided by LLM-as-a-Judge.
## Usage
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(
"Rishubi/CodeRM-NT",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Rishubi/CodeRM-NT")
question = "Write a Python function `add(a, b)` that returns the sum of two integers."
response = "def add(a, b):\n return a + b"
messages = [
{"role": "user", "content": question},
{"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
with torch.no_grad():
reward = model(input_ids).logits.squeeze().float().item()
print(reward) # higher is better
```
## Results
## Key Results
Training with CodeRM-NT consistently outperforms synthetic unit tests and other reward models across multiple code generation benchmarks:
| Model | Reward | HumanEval | HumanEval+ | MBPP | MBPP+ | LCB-v5 | BCB-I-Hard | Avg. |
| :----------------- | :------------ | :-------: | :--------: | :--: | :---: | :----: | :--------: | :------: |
| Qwen2.5-Coder-1.5B | Unit Tests | 73.2 | 67.7 | 70.9 | 61.1 | 5.1 | 6.1 | 47.4 |
| | **CodeRM-NT** | 75.0 | 69.5 | 72.0 | 60.8 | 5.5 | 7.4 | **48.4** |
| Qwen2.5-Coder-3B | Unit Tests | 86.6 | 82.3 | 74.9 | 64.6 | 13.0 | 15.5 | 56.2 |
| | **CodeRM-NT** | 88.4 | 82.3 | 75.9 | 66.1 | 13.6 | 14.2 | **56.8** |
| Qwen2.5-Coder-7B | Unit Tests | 90.9 | 87.8 | 85.4 | 73.0 | 17.3 | 18.2 | 62.1 |
| | **CodeRM-NT** | 90.2 | 86.0 | 86.8 | 74.6 | 17.5 | 18.2 | **62.2** |
| GLM-4-9B-0414 | Unit Tests | 84.1 | 79.9 | 81.0 | 69.0 | 15.4 | 15.5 | 57.5 |
| | **CodeRM-NT** | 87.2 | 81.7 | 79.9 | 67.2 | 15.3 | 18.2 | **58.3** |
| Qwen3-4B-Thinking | Unit Tests | 97.6 | 92.7 | 91.0 | 75.1 | 50.3 | 25.7 | 72.1 |
| | **CodeRM-NT** | 97.6 | 94.5 | 92.6 | 77.2 | 52.1 | 22.3 | **72.7** |
## Citation
TODO
|