Model Card: Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format

Model summary

Name: Qwen2.5-3B (GRPO-GSM8K)
Base model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
Developer/Maintainer: akratz
License: Apache-2.0
Languages: English
Status: Experimental; trained to emit explicit reasoning traces
Primary change: Reinforcement learning with GRPO on GSM8K to induce chain-of-thought style outputs

What's in the box

This is a small Qwen 2.5 3B-parameter instruct model fine-tuned with GRPO following Unsloth’s RL tutorial to produce responses in a two-block format:

> Tom’s cat is 8 years old.  His rabbit is half the age of his cat.  His dog is three times as old as his rabbit.  How old is the
 dog? 
<reasoning>
To find out how old Tom's dog is, we need to follow these steps:

1. Determine the age of Tom's rabbit.
2. Use the rabbit's age to find out the dog's age.

First, we know that Tom's cat is 格格 (which I assume was supposed to be '8' years old).

Next, the rabbit is half the age of the cat. Therefore, the rabbit's age is:
\[ \text{Rabbit's age} = \frac{1}{2} \times \text{Cat's age} = \frac{1}{2} \times 8 = 4 \text{ years old} \]

Now, the dog is three times as old as the rabbit. So, the dog's age is:
\[ \text{Dog's age} = 3 \times \text{Rabbit's age} = 3 \times 4 = 12 \text{ years old} \]

Thus, the dog is 12 years old.
...
<answer>
12
</answer>

The goal was to validate the full RL pipeline end-to-end, not to achieve state-of-the-art accuracy. The model does emit structured reasoning, but overall quality is low compared with larger or better-trained models.

Intended use

Intended: Educational and reproducibility demos of GRPO on a small model; sandbox experiments on formatting, reward shaping, and inference tooling.
Not intended: Production use; safety-critical domains; factual or mathematical reliability; privacy-sensitive data.

Limitations and risks

Trained only on GSM8K signals; narrow domain coverage.
Reasoning traces can be verbose, inconsistent, or incorrect.
Hallucinations and arithmetic mistakes are common.
Safety and refusal behaviors were not a focus of training.
Reward overfitting to the target format and short-term hacks is likely.

Training data

Supervision signal: GSM8K math word problems (for rewards/formatting).
Data scope: Math grade-school style problems only.
Provenance: Public dataset (standard GSM8K release). No private data added.

Training procedure

Method: GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth.
Starting weights: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
Objective: Encourage the required output format and correct final answers on GSM8K.
Notes: This run was sized and configured to “make it work,” not tuned for peak performance. Hyperparameters were kept close to the Unsloth tutorial defaults.

Inference

llama.cpp example

Invoke with a system prompt that enforces the format:

~/work/llama.cpp/build/bin/llama-cli \
  -m ~/work/llama.cpp/models/unsloth.Q8_0.gguf \
  -sys "Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>"

Output contract

Always emits ... followed by ....
Consumers should parse only the block as the final answer.
The block is untrusted and may contain errors.

Practical tips

Use deterministic decoding for evaluation (temp≈0.0–0.2, small top_p).
Set a stop sequence on to avoid trailing chatter.
For non-math tasks, expect degraded quality; re-prompt with very explicit instructions if you must try.

Evaluation

Benchmarking: No formal leaderboard-quality evaluation was performed.
Sanity checks: The model produces the requested two-block format on GSM8K-style prompts and can reach correct answers on simple instances.
Known behavior: Accuracy is inconsistent; failure modes include arithmetic slips and brittle parsing.

Compatibility

Quantization used for deployment: unsloth.Q8_0.gguf tested with llama.cpp.
Libraries: Works with standard GGUF-compatible runtimes. Training used TRL + Unsloth.

Responsible use

Do not deploy in any domain where mistakes cause harm.
Do not rely on the text for truth; treat it as a by-product of training.
Filter or redact sensitive inputs. Add your own safety layers if experimenting in the wild.

How to reproduce

Tutorial followed: Unsloth “Train your own reasoning model with GRPO” (the GSM8K walkthrough).
Replicate the pipeline, then enforce the same output format in the system prompt at inference time.

Versioning

v0.1 (2025-08-09): Initial GRPO finetune from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit on GSM8K; emits reasoning/answer blocks; experimental quality.

Citation

If you use this model in experiments, please cite:

Qwen2.5 model family (Qwen team)
TRL library (Hugging Face)
Unsloth library and GRPO tutorial
GSM8K dataset

@misc{akratz_qwen25_3b_grpo_gsm8k_2025, title = {Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format}, author = {akratz}, year = {2025}, note = {Fine-tuned from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit using GRPO on GSM8K; Apache-2.0} }

Card metadata (for Hub front-matter)

base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit tags:

text-generation-inference
transformers
unsloth
qwen2
reasoning
grpo license: apache-2.0 language:
en

Downloads last month: 4

GGUF

Model size

3B params

Architecture

qwen2

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support