SPARK-Code · Condition A (Exec-Only GRPO) · Qwen2.5-Coder-3B QLoRA

QLoRA adapter trained with execution-grounded GRPO. The strongest and most stable cross-benchmark performer in the SPARK-Code study.

TL;DR

spark-code-A-3b is a LoRA adapter for Qwen/Qwen2.5-Coder-3B-Instruct produced by 3 iterations of Group Relative Policy Optimization (GRPO) on 200 MBPP problems, using partial per-test execution feedback as the only reward signal. It moves HumanEval pass@1 from 0.796 → 0.805 (+0.85 pp) monotonically while keeping the KL to the frozen reference well under 1.1e-3, and it generalizes cleanly to held-out MBPP (0.634 → 0.636 pass@1; 0.68 → 0.69 pass@5 with an intermediate peak at 0.71). In the three-arm comparison, Condition A is the only run that improves on both benchmarks without policy drift.

Training Setup

Base model: Qwen/Qwen2.5-Coder-3B-Instruct
Method: Execution-grounded GRPO. For each MBPP problem we generate a group of rollouts, score each rollout by the fraction of unit tests it passes (with explicit penalties for syntax/runtime/timeout errors), normalize rewards within the group, and apply a clipped PPO-style policy-gradient update. No auxiliary SFT objective is used in this condition — this is the exec-only baseline.
Training data: MBPP-sanitized, 200 problems, 3 iterations, K=4 adaptive rollouts (up to 8 when the group has zero advantage variance), partial per-test rewards with syntax_penalty=-0.2, runtime_penalty=-0.1, timeout_penalty=-0.3, wrong_answer_floor=0.0.
LoRA: r=16, alpha=32, dropout=0.05, target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
Quantization: 4-bit NF4 with double quantization, bf16 compute.
Optimizer: AdamW, lr=5e-6, grad_accum=4, clip_ratio=0.2, max_grad_norm=1.0.
KL regularization: kl_coeff=0.01 against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time).
Auxiliary objective: none (this is Condition A).
Seed: 42.

Training script: run_experiment_with_mbpp_heldout.py in the GitHub repo.

Evaluation Results

HumanEval is evaluated with 5 samples per problem at temperature=0.2, top_p=0.95. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts.

Iter	HumanEval pass@1	HumanEval pass@5	MBPP-held pass@1	MBPP-held pass@5	Train pass rate	GRPO KL
0	0.796	0.854	0.634	0.680	—	—
1	0.798	0.860	0.624	0.690	0.603	0.0002
2	0.799	0.848	0.632	0.710	0.640	0.0005
3	0.805	0.854	0.636	0.690	0.639	0.0011

Trajectory. HumanEval pass@1 climbs monotonically across all three iterations (+0.85 pp end-to-end), and KL stays bounded below 1.1e-3, indicating that the policy is improving without drifting from the base distribution. MBPP held-out pass@5 peaks at iter 2 (0.71) and settles to 0.69 at iter 3, while pass@1 ends slightly above baseline (+0.2 pp). Train pass rate rises from 0.603 to 0.639, consistent with the eval gains. Mean tokens per GRPO sequence stays in the 177–182 range across iterations — no completion-length collapse.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-3B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")

prompt = tok.apply_chat_template(
    [{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
     {"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Comparison to Other Conditions

All three adapters share the same base model, training pool, seed, and rollout budget. They differ only in the auxiliary objective and KL strength.

Condition	aux_loss_scale	kl_coeff	HumanEval pass@1 (it 3)	MBPP-held pass@5 (it 3)	GRPO KL (it 3)
A (exec-only) — this card	0.00	0.01	0.805	0.690	0.0011
C-light (naive co-evolve)	0.10	0.01	0.773	0.680	0.0941
C-reg (regularized co-evolve)	0.03	0.02	0.800	0.720	0.0136

Condition A delivers the highest HumanEval pass@1 and the lowest reference-policy drift; C-reg is the only condition that beats it on MBPP pass@5 (+3 pp), and C-light demonstrates the policy-drift failure mode.

Findings Summary

Simplest method wins on the primary cross-benchmark metric. Exec-only GRPO produced the largest, most stable HumanEval pass@1 gain in the study; no auxiliary SFT was required.
Drift control is essentially free here. With kl_coeff=0.01 and no auxiliary loss pulling the policy off-distribution, KL stays ≤1.1e-3 and completion lengths stay flat across iterations.
Sample efficiency is modest but real. 200 MBPP problems × 3 iterations on a single 3B-parameter base was enough to produce a small but monotonic HumanEval improvement and a peaked MBPP pass@5 gain.

Related Artifacts

Sibling adapters: spark-code-C-light-3b · spark-code-C-reg-3b
GitHub repository: https://github.com/amarsaikhanb/spark-code
Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under condition_A/eval/ in the repository
Interactive demo Space: [SPACES_URL]

Citation

@misc{batjargal2026sparkcode,
  title  = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
  author = {Amarsaikhan Batjargal},
  year   = {2026},
}

License

The LoRA adapter weights in this repository are released under the Apache 2.0 license. The base model, Qwen/Qwen2.5-Coder-3B-Instruct, is distributed under the Tongyi Qianwen LICENSE; any downstream use must comply with its terms.

Downloads last month: 25

Model tree for amarsaikhan/spark-code-A-3b

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-Coder-3B

Finetuned

Qwen/Qwen2.5-Coder-3B-Instruct

Adapter

(36)

this model

amarsaikhan
/

spark-code-A-3b