File size: 2,200 Bytes

10f1653

# RL-Struct: 弥合结构鸿沟

[English Version](./README.md)

本仓库包含论文 **"Bridging the Structure Gap: A Lightweight RL Framework for Reliable Structured Output Generation in LLMs"** 的模型和代码。

我们提出了 **RL-Struct**，这是一个轻量级的强化学习框架，旨在解决“结构鸿沟”问题——即概率性 Token 生成与确定性结构化格式（如 JSON）之间的矛盾。通过利用 **GRPO（梯度正则化策略优化）** 和新颖的 **多维奖励函数**，我们的模型在无需高延迟约束解码的情况下，实现了卓越的结构可靠性。

## 🚀 核心特性

-   **多维奖励函数**：将目标分解为结构（Structure）、格式（Format）、有效性（Validity）、正确性（Correctness）和长度（Length）。
-   **高效训练**：使用 GRPO 消除 Critic 网络，相比 PPO 减少约 40% 的显存占用。
-   **涌现课程学习**：模型自发地先学习语法（如何说），再学习语义（说什么）。
-   **高性能**：在复杂的食谱生成任务上实现了 **89.7% 的结构准确率** 和 **92.1% 的 JSON 有效性**，优于 LLaMA-3-8B 和 GPT-3.5。

## 📊 模型详情

-   **基座模型：** [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
-   **训练方法：** GRPO (强化学习) + LoRA
-   **任务：** 结构化输出生成 (JSON 食谱, GSM8K-JSON, ToolUse)
-   **协议：** Apache-2.0

## 🛠️ 使用方法

### 系统提示词 (System Prompt)
为确保正确的 JSON 输出，请使用以下系统提示词：

```text
You are a precise recipe assistant. Always respond in the following JSON format:
{
  "reasoning": "Your step-by-step reasoning here...",
  "answer": "{\"name\": \"Recipe Name\", \"nutrition\": \"Calories: ..., Protein: ..., Fat: ...\"}"
}
Do not include any other text, explanations, or markdown. Only output valid JSON.
```

## 📈 性能表现

| 方法 | 结构准确率 | JSON 有效性 | 内容准确率 |
| :--- | :---: | :---: | :---: |
| GPT-3.5 (Zero-shot) | 45.5% | 82.1% | 88.0% |
| LLaMA-3-8B (SFT) | 78.2% | 85.4% | 86.0% |
| **RL-Struct (Ours)** | **89.7%** | **92.1%** | **84.5%** |