SPARK-Code · Condition C-reg (Regularized Co-Evolve) · Qwen2.5-Coder-3B QLoRA
QLoRA adapter trained with regularized SPARK-style co-evolution (GRPO + aux SFT). Bounded drift; matched the exec-only baseline on HumanEval within noise and improved held-out MBPP pass@5 by +4 pp.
TL;DR
spark-code-C-reg-3b is the regularized variant of the SPARK-Code co-evolution recipe: GRPO over execution-grounded rewards plus a small auxiliary SFT phase mining pointwise / pairwise / reflection labels from each iteration's rollouts. Compared to the naive co-evolve run (C-light), this run scales the auxiliary loss down by ~3× (aux_loss_scale=0.03 vs 0.1) and doubles the KL coefficient (kl_coeff=0.02 vs 0.01). The recipe peaks at HumanEval pass@1 = 0.810 at iter 2 and lands at 0.800 at iter 3 — a +0.4 pp end-state delta over the frozen base that is within sampling noise of the exec-only baseline (0.805). The held-out MBPP pass@5 trajectory is the headline: 0.68 → 0.69 → 0.71 → 0.72, monotonic across iterations and +4 pp end-to-end. KL stays bounded at 1.4e-2 — about 7× lower than C-light at the same iteration.
Training Setup
- Base model:
Qwen/Qwen2.5-Coder-3B-Instruct - Method: GRPO (exec-only reward, partial per-test scoring, frozen-reference KL) + auxiliary SFT phase per iteration. Auxiliary examples come in three flavors:
- Pointwise — binary "Correct/Incorrect" judgments over a single candidate
- Pairwise — randomized A/B preference between a passing and a failing rollout
- Reflection — execution-grounded repair, target = a sibling correct rollout or the MBPP canonical solution (
reflection_target_mode=correct_or_canonical)
- Training data: MBPP-sanitized, 200 problems, 3 iterations, K=4 adaptive rollouts (up to 8), partial per-test rewards with
syntax_penalty=-0.2,runtime_penalty=-0.1,timeout_penalty=-0.3. - LoRA:
r=16,alpha=32,dropout=0.05, targetsq_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj. - Quantization: 4-bit NF4 + double quant, bf16 compute.
- Optimizer: AdamW,
lr=5e-6,grad_accum=4,clip_ratio=0.2,max_grad_norm=1.0. - GRPO KL:
kl_coeff=0.02against the frozen reference policy (2× the C-light value). - Aux hyperparameters (this run):
aux_loss_scale=0.03(3.3× lower than C-light),aux_weight_pointwise=0.0,aux_weight_pairwise=0.1,aux_weight_reflection=1.0,aux_epochs=1,aux_max_len=1024. - Aux pool sizes (after caps): iter 1 → 594 (pointwise 200 / pairwise 151 / reflection 243), iter 2 → 590 (200/150/240), iter 3 → 574 (200/139/235).
- Seed: 42.
Training script: run_experiment_with_mbpp_heldout.py in the GitHub repo.
Evaluation Results
HumanEval is evaluated with 5 samples per problem at temperature=0.2, top_p=0.95. Held-out MBPP uses 100 problems disjoint from the training pool. "Reflection fix rate" is measured on the HumanEval held-out problems: for each failed first-pass generation the model is asked to repair its own code, and the fix is re-executed.
| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL | Refl. fix rate |
|---|---|---|---|---|---|---|---|
| 0 | 0.796 | 0.854 | 0.634 | 0.680 | — | — | 0.118 |
| 1 | 0.802 | 0.866 | 0.636 | 0.690 | 0.603 | 0.0002 | 0.063 |
| 2 | 0.810 | 0.848 | 0.648 | 0.710 | 0.614 | 0.0037 | 0.030 |
| 3 | 0.800 | 0.854 | 0.630 | 0.720 | 0.629 | 0.0136 | 0.097 |
Trajectory. HumanEval pass@1 peaks at iter 2 (0.810, +1.4 pp over baseline) and dips slightly to 0.800 at iter 3 — the final value is within sampling noise of the exec-only baseline (0.805) and of the frozen base (0.796). The non-monotonicity is real but small. Held-out MBPP pass@5 is the cleaner trajectory: a monotonic climb from 0.68 to 0.72 (+4 pp), the largest pass@5 gain of the three conditions. GRPO KL stays in a regularized regime — 0.0136 at iter 3, ~13× the exec-only baseline but ~7× lower than the naive C-light run at the same iteration — which is exactly the operating point this configuration was designed to land in. Mean tokens per GRPO sequence drop modestly (182 → 154 → 142), well short of the C-light collapse. The held-out reflection fix-rate metric is noisy (n=31–33 per iter) and dips below the baseline; it should be read with caution at this sample size.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-3B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-C-reg-3b")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")
prompt = tok.apply_chat_template(
[{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
{"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Comparison to Other Conditions
All three adapters share the same base model, training pool, seed, and rollout budget. They differ only in the auxiliary objective and KL strength.
| Condition | aux_loss_scale | kl_coeff | HumanEval pass@1 (it 3) | MBPP-held pass@5 (it 3) | GRPO KL (it 3) |
|---|---|---|---|---|---|
| A (exec-only) | 0.00 | 0.01 | 0.805 | 0.690 | 0.0011 |
| C-light (naive co-evolve) | 0.10 | 0.01 | 0.773 | 0.680 | 0.0941 |
| C-reg (regularized co-evolve) — this card | 0.03 | 0.02 | 0.800 | 0.720 | 0.0136 |
C-reg matches the exec-only HumanEval pass@1 within noise and is the strongest condition on held-out MBPP pass@5.
Findings Summary
- Matching the baseline within noise is a stability result, not an outperformance. C-reg does not beat the simpler exec-only run on the primary HumanEval pass@1 metric. The scientific value is that it recovers the drift-induced regression seen in C-light: cutting
aux_loss_scaleby ~3× and doublingkl_coeffwas sufficient to bring HumanEval pass@1 back to the exec-only neighborhood. - MBPP pass@5 is where the co-evolution recipe pays off. The held-out MBPP pass@5 trajectory is monotonic and ends 4 pp above the baseline (and 3 pp above the exec-only run). This is consistent with the auxiliary signal genuinely helping the model surface a wider distribution of correct solutions on in-distribution problems.
- Regularization signature. GRPO KL stays ~7× lower than C-light at iter 3 and completion length contracts only ~22% (vs ~55% for C-light). The two diagnostic fingerprints of the C-light drift are present but bounded — the regularization is doing what it was designed to do.
Related Artifacts
- Sibling adapters: spark-code-A-3b · spark-code-C-light-3b
- GitHub repository: https://github.com/amarsaikhanb/spark-code
- Full per-problem eval data (HumanEval and held-out MBPP JSONs per iteration) lives under
condition_C/eval/in the repository - Interactive demo Space: [SPACES_URL]
Citation
@misc{batjargal2026sparkcode,
title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
author = {Amarsaikhan Batjargal},
year = {2026},
}
License
The LoRA adapter weights in this repository are released under the Apache 2.0 license. The base model, Qwen/Qwen2.5-Coder-3B-Instruct, is distributed under the Tongyi Qianwen LICENSE; any downstream use must comply with its terms.
- Downloads last month
- 32