Model Card: Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format

Model summary

  • Name: Qwen2.5-3B (GRPO-GSM8K)
  • Base model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
  • Developer/Maintainer: akratz
  • License: Apache-2.0
  • Languages: English
  • Status: Experimental; trained to emit explicit reasoning traces
  • Primary change: Reinforcement learning with GRPO on GSM8K to induce chain-of-thought style outputs

What's in the box

This is a small Qwen 2.5 3B-parameter instruct model fine-tuned with GRPO following Unsloth’s RL tutorial to produce responses in a two-block format:

> Tom’s cat is 8 years old.  His rabbit is half the age of his cat.  His dog is three times as old as his rabbit.  How old is the
 dog? 
<reasoning>
To find out how old Tom's dog is, we need to follow these steps:

1. Determine the age of Tom's rabbit.
2. Use the rabbit's age to find out the dog's age.

First, we know that Tom's cat is 格格 (which I assume was supposed to be '8' years old).

Next, the rabbit is half the age of the cat. Therefore, the rabbit's age is:
\[ \text{Rabbit's age} = \frac{1}{2} \times \text{Cat's age} = \frac{1}{2} \times 8 = 4 \text{ years old} \]

Now, the dog is three times as old as the rabbit. So, the dog's age is:
\[ \text{Dog's age} = 3 \times \text{Rabbit's age} = 3 \times 4 = 12 \text{ years old} \]

Thus, the dog is 12 years old.
...
<answer>
12
</answer>

The goal was to validate the full RL pipeline end-to-end, not to achieve state-of-the-art accuracy. The model does emit structured reasoning, but overall quality is low compared with larger or better-trained models.

Intended use

  • Intended: Educational and reproducibility demos of GRPO on a small model; sandbox experiments on formatting, reward shaping, and inference tooling.
  • Not intended: Production use; safety-critical domains; factual or mathematical reliability; privacy-sensitive data.

Limitations and risks

  • Trained only on GSM8K signals; narrow domain coverage.
  • Reasoning traces can be verbose, inconsistent, or incorrect.
  • Hallucinations and arithmetic mistakes are common.
  • Safety and refusal behaviors were not a focus of training.
  • Reward overfitting to the target format and short-term hacks is likely.

Training data

  • Supervision signal: GSM8K math word problems (for rewards/formatting).
  • Data scope: Math grade-school style problems only.
  • Provenance: Public dataset (standard GSM8K release). No private data added.

Training procedure

  • Method: GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth.
  • Starting weights: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
  • Objective: Encourage the required output format and correct final answers on GSM8K.
  • Notes: This run was sized and configured to “make it work,” not tuned for peak performance. Hyperparameters were kept close to the Unsloth tutorial defaults.

Inference

llama.cpp example

Invoke with a system prompt that enforces the format:

~/work/llama.cpp/build/bin/llama-cli \
  -m ~/work/llama.cpp/models/unsloth.Q8_0.gguf \
  -sys "Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>"

Output contract

  • Always emits ... followed by ....
  • Consumers should parse only the block as the final answer.
  • The block is untrusted and may contain errors.

Practical tips

  • Use deterministic decoding for evaluation (temp≈0.0–0.2, small top_p).
  • Set a stop sequence on to avoid trailing chatter.
  • For non-math tasks, expect degraded quality; re-prompt with very explicit instructions if you must try.

Evaluation

  • Benchmarking: No formal leaderboard-quality evaluation was performed.
  • Sanity checks: The model produces the requested two-block format on GSM8K-style prompts and can reach correct answers on simple instances.
  • Known behavior: Accuracy is inconsistent; failure modes include arithmetic slips and brittle parsing.

Compatibility

  • Quantization used for deployment: unsloth.Q8_0.gguf tested with llama.cpp.
  • Libraries: Works with standard GGUF-compatible runtimes. Training used TRL + Unsloth.

Responsible use

  • Do not deploy in any domain where mistakes cause harm.
  • Do not rely on the text for truth; treat it as a by-product of training.
  • Filter or redact sensitive inputs. Add your own safety layers if experimenting in the wild.

How to reproduce

  • Tutorial followed: Unsloth “Train your own reasoning model with GRPO” (the GSM8K walkthrough).
  • Replicate the pipeline, then enforce the same output format in the system prompt at inference time.

Versioning

  • v0.1 (2025-08-09): Initial GRPO finetune from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit on GSM8K; emits reasoning/answer blocks; experimental quality.

Citation

If you use this model in experiments, please cite:

  • Qwen2.5 model family (Qwen team)
  • TRL library (Hugging Face)
  • Unsloth library and GRPO tutorial
  • GSM8K dataset

@misc{akratz_qwen25_3b_grpo_gsm8k_2025, title = {Qwen2.5-3B (GRPO-GSM8K) — Reasoning Format}, author = {akratz}, year = {2025}, note = {Fine-tuned from unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit using GRPO on GSM8K; Apache-2.0} }

Card metadata (for Hub front-matter)

base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit tags:

  • text-generation-inference
  • transformers
  • unsloth
  • qwen2
  • reasoning
  • grpo license: apache-2.0 language:
  • en
Downloads last month
4
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support