Model Card: Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format
Model summary
- Name: Qwen2.5-3B (GRPO-GSM8K)
- Base model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
- Developer/Maintainer: akratz
- License: Apache-2.0
- Languages: English
- Status: Experimental; trained to emit explicit reasoning traces
- Primary change: Reinforcement learning with GRPO on GSM8K to induce chain-of-thought style outputs
What's in the box
This is a small Qwen 2.5 3B-parameter instruct model fine-tuned with GRPO following Unsloth’s RL tutorial to produce responses in a two-block format:
> Tom’s cat is 8 years old. His rabbit is half the age of his cat. His dog is three times as old as his rabbit. How old is the
dog?
<reasoning>
To find out how old Tom's dog is, we need to follow these steps:
1. Determine the age of Tom's rabbit.
2. Use the rabbit's age to find out the dog's age.
First, we know that Tom's cat is 格格 (which I assume was supposed to be '8' years old).
Next, the rabbit is half the age of the cat. Therefore, the rabbit's age is:
\[ \text{Rabbit's age} = \frac{1}{2} \times \text{Cat's age} = \frac{1}{2} \times 8 = 4 \text{ years old} \]
Now, the dog is three times as old as the rabbit. So, the dog's age is:
\[ \text{Dog's age} = 3 \times \text{Rabbit's age} = 3 \times 4 = 12 \text{ years old} \]
Thus, the dog is 12 years old.
...
<answer>
12
</answer>
The goal was to validate the full RL pipeline end-to-end, not to achieve state-of-the-art accuracy. The model does emit structured reasoning, but overall quality is low compared with larger or better-trained models.
Intended use
- Intended: Educational and reproducibility demos of GRPO on a small model; sandbox experiments on formatting, reward shaping, and inference tooling.
- Not intended: Production use; safety-critical domains; factual or mathematical reliability; privacy-sensitive data.
Limitations and risks
- Trained only on GSM8K signals; narrow domain coverage.
- Reasoning traces can be verbose, inconsistent, or incorrect.
- Hallucinations and arithmetic mistakes are common.
- Safety and refusal behaviors were not a focus of training.
- Reward overfitting to the target format and short-term hacks is likely.
Training data
- Supervision signal: GSM8K math word problems (for rewards/formatting).
- Data scope: Math grade-school style problems only.
- Provenance: Public dataset (standard GSM8K release). No private data added.
Training procedure
- Method: GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth.
- Starting weights: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
- Objective: Encourage the required output format and correct final answers on GSM8K.
- Notes: This run was sized and configured to “make it work,” not tuned for peak performance. Hyperparameters were kept close to the Unsloth tutorial defaults.
Inference
llama.cpp example
Invoke with a system prompt that enforces the format:
~/work/llama.cpp/build/bin/llama-cli \
-m ~/work/llama.cpp/models/unsloth.Q8_0.gguf \
-sys "Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>"
Output contract
- Always emits ... followed by ....
- Consumers should parse only the block as the final answer.
- The block is untrusted and may contain errors.
Practical tips
- Use deterministic decoding for evaluation (temp≈0.0–0.2, small top_p).
- Set a stop sequence on to avoid trailing chatter.
- For non-math tasks, expect degraded quality; re-prompt with very explicit instructions if you must try.
Evaluation
- Benchmarking: No formal leaderboard-quality evaluation was performed.
- Sanity checks: The model produces the requested two-block format on GSM8K-style prompts and can reach correct answers on simple instances.
- Known behavior: Accuracy is inconsistent; failure modes include arithmetic slips and brittle parsing.
Compatibility
- Quantization used for deployment: unsloth.Q8_0.gguf tested with llama.cpp.
- Libraries: Works with standard GGUF-compatible runtimes. Training used TRL + Unsloth.
Responsible use
- Do not deploy in any domain where mistakes cause harm.
- Do not rely on the text for truth; treat it as a by-product of training.
- Filter or redact sensitive inputs. Add your own safety layers if experimenting in the wild.
How to reproduce
- Tutorial followed: Unsloth “Train your own reasoning model with GRPO” (the GSM8K walkthrough).
- Replicate the pipeline, then enforce the same output format in the system prompt at inference time.
Versioning
- v0.1 (2025-08-09): Initial GRPO finetune from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit on GSM8K; emits reasoning/answer blocks; experimental quality.
Citation
If you use this model in experiments, please cite:
- Qwen2.5 model family (Qwen team)
- TRL library (Hugging Face)
- Unsloth library and GRPO tutorial
- GSM8K dataset
@misc{akratz_qwen25_3b_grpo_gsm8k_2025, title = {Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format}, author = {akratz}, year = {2025}, note = {Fine-tuned from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit using GRPO on GSM8K; Apache-2.0} }
Card metadata (for Hub front-matter)
base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit tags:
- text-generation-inference
- transformers
- unsloth
- qwen2
- reasoning
- grpo license: apache-2.0 language:
- en
- Downloads last month
- 4
8-bit