File size: 2,295 Bytes
35238c1
826fc97
4d810c3
 
826fc97
4d810c3
826fc97
4d810c3
826fc97
 
 
 
 
 
 
 
 
 
 
4d810c3
826fc97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d810c3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
base_model: Qwen/Qwen3-4B-Instruct-2507
language:
- en
library_name: transformers
license: apache-2.0
model_name: qwen-json
pipeline_tag: text-generation
tags:
- unsloth
- trl
- grpo
- reinforcement-learning
- json
- recipe
---

# RL-Struct: Bridging the Structure Gap

[δΈ­ζ–‡η‰ˆζœ¬](./README_CN.md) | [πŸ“š Paper](https://huggingface.co/papers/2512.00319)

We introduce **RL-Struct**, a lightweight Reinforcement Learning framework designed to solve the "Structure Gap"β€”the tension between probabilistic token generation and deterministic structured formats (e.g., JSON). By leveraging **GRPO (Gradient Regularized Policy Optimization)** and a **Multi-dimensional Reward Function**, our model achieves superior structural reliability without the high inference latency of constrained decoding.

## πŸš€ Key Features

-   **Multi-dimensional Reward Function**: Decomposes the objective into Structure, Format, Validity, Correctness, and Length.
-   **Efficient Training**: Uses GRPO to eliminate the critic network, reducing VRAM usage by ~40% compared to PPO.
-   **Emergent Curriculum**: The model spontaneously learns syntax (how to speak) before semantics (what to say).
-   **High Performance**: Achieves **89.7% Structural Accuracy** and **92.1% JSON Validity** on complex recipe generation, outperforming LLaMA-3-8B and GPT-3.5.

## πŸ“Š Model Details

-   **Base Model:** [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
-   **Training Method:** GRPO (Reinforcement Learning) + LoRA
-   **Task:** Structured Output Generation (JSON Recipes, GSM8K-JSON, ToolUse)
-   **License:** Apache-2.0

## πŸ› οΈ Usage

The following is the system prompt:

```text
You are a precise recipe assistant. Always respond in the following JSON format:
{
  "reasoning": "Your step-by-step reasoning here...",
  "answer": "{\"name\": \"Recipe Name\", \"nutrition\": \"Calories: ..., Protein: ..., Fat: ...\"}"
}
Do not include any other text, explanations, or markdown. Only output valid JSON.
```

## πŸ“ˆ Performance

| Method | Structural Acc. | JSON Validity | Content Acc. |
| :--- | :---: | :---: | :---: |
| GPT-3.5 (Zero-shot) | 45.5% | 82.1% | 88.0% |
| LLaMA-3-8B (SFT) | 78.2% | 85.4% | 86.0% |
| **RL-Struct (Ours)** | **89.7%** | **92.1%** | **84.5%** |