SPARK-Code · Condition C-light (Naive Co-Evolve) · Qwen2.5-Coder-3B QLoRA

QLoRA adapter trained with naive SPARK-style auxiliary recycling on top of GRPO. Demonstrates the policy-drift failure mode (−2.3 pp on HumanEval vs baseline).

TL;DR

spark-code-C-light-3b is a LoRA adapter for Qwen/Qwen2.5-Coder-3B-Instruct produced by 3 iterations of GRPO with an interleaved supervised auxiliary objective (pointwise / pairwise / reflection labels mined from each iteration's rollouts). The auxiliary loss scale is left at a "natural" value (0.1) with reflection-weight 1.0 and a low KL coefficient (0.01) — i.e. the SPARK-style recycling is applied without the regularization tweaks of Condition C-reg. The result is a clear regression on HumanEval pass@1 (0.796 → 0.773, −2.3 pp) and pass@5 (0.854 → 0.823, −3.0 pp), with a runaway KL divergence and a near-50% contraction in completion length. This card documents that negative result. It is published for reproducibility and as a calibration point for the regularized variant; it is not recommended for downstream use.

Training Setup

  • Base model: Qwen/Qwen2.5-Coder-3B-Instruct
  • Method: GRPO (exec-only reward, partial per-test scoring, frozen-reference KL) + auxiliary SFT phase per iteration. Auxiliary examples are mined from each iteration's GRPO rollouts in three flavors:
    • Pointwise — binary "Correct/Incorrect" judgments over a single candidate
    • Pairwise — randomized A/B preference between a passing and a failing rollout
    • Reflection — execution-grounded repair, target = a sibling correct rollout or the MBPP canonical solution (reflection_target_mode=correct_or_canonical)
  • Training data: MBPP-sanitized, 200 problems, 3 iterations, K=4 adaptive rollouts (up to 8), partial per-test rewards with syntax_penalty=-0.2, runtime_penalty=-0.1, timeout_penalty=-0.3.
  • LoRA: r=16, alpha=32, dropout=0.05, targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
  • Quantization: 4-bit NF4 + double quant, bf16 compute.
  • Optimizer: AdamW, lr=5e-6, grad_accum=4, clip_ratio=0.2, max_grad_norm=1.0.
  • GRPO KL: kl_coeff=0.01 against the frozen reference policy.
  • Aux hyperparameters (this run): aux_loss_scale=0.1, aux_weight_pointwise=0.0, aux_weight_pairwise=0.1, aux_weight_reflection=1.0, aux_epochs=1, aux_max_len=1024.
  • Aux pool sizes (after caps): iter 1 → 594 (pointwise 200 / pairwise 151 / reflection 243), iter 2 → 600 (200/154/246), iter 3 → 556 (200/137/219).
  • Seed: 42.

Training script: run_experiment_with_mbpp_heldout.py in the GitHub repo.

Evaluation Results

HumanEval is evaluated with 5 samples per problem at temperature=0.2, top_p=0.95. Held-out MBPP uses 100 problems disjoint from the training pool. "Reflection fix rate" is measured on the HumanEval held-out problems: for each failed first-pass generation the model is asked to repair its own code, and the fix is re-executed.

Iter HumanEval pass@1 HumanEval pass@5 MBPP-held pass@1 MBPP-held pass@5 Train pass rate GRPO KL Refl. fix rate
0 0.796 0.854 0.634 0.680 0.118
1 0.757 0.823 0.634 0.690 0.603 0.0002 0.079
2 0.752 0.823 0.636 0.690 0.602 0.0761 0.132
3 0.773 0.823 0.658 0.680 0.658 0.0941 0.139

Trajectory. HumanEval pass@1 drops sharply at iter 1 (−3.9 pp), partially recovers at iter 3, and finishes 2.3 pp below the baseline. HumanEval pass@5 falls and stays flat at 0.823. Held-out MBPP pass@1 actually improves to 0.658 (+2.4 pp) — i.e. the model is becoming better in-distribution on MBPP-style tasks but worse on the cross-benchmark HumanEval suite, the canonical signature of distributional overfitting to the recycled aux pool. Two diagnostic signals corroborate this:

  • KL divergence explodes by iter 2. From 2.1e-4 at iter 1 to 0.094 at iter 3 — roughly a 450× growth within the run, and about 88× the matched Condition A KL at iter 3 (0.0011). The frozen-reference regularizer is being overwhelmed by the aux SFT signal.
  • Completion length collapses. Mean tokens per GRPO sequence drop from 182 → 97 → 82 across iterations (a 55% reduction), consistent with the policy concentrating on a narrow, shorter output mode shaped by the reflection-heavy aux distribution.

The reflection fix-rate metric is noisy (n=32–38 tested per iter) and ends slightly above baseline (0.139 vs 0.118), but not enough to offset the first-pass regression.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-3B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-C-light-3b")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")

prompt = tok.apply_chat_template(
    [{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
     {"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Comparison to Other Conditions

All three adapters share the same base model, training pool, seed, and rollout budget. They differ only in the auxiliary objective and KL strength.

Condition aux_loss_scale kl_coeff HumanEval pass@1 (it 3) MBPP-held pass@5 (it 3) GRPO KL (it 3)
A (exec-only) 0.00 0.01 0.805 0.690 0.0011
C-light (naive co-evolve) — this card 0.10 0.01 0.773 0.680 0.0941
C-reg (regularized co-evolve) 0.03 0.02 0.800 0.720 0.0136

C-light is the lowest-performing condition on HumanEval pass@1 and shows roughly 88× the KL drift of the exec-only baseline at iter 3.

Findings Summary

  • Negative result, reported honestly. Naive SPARK-style aux recycling at aux_loss_scale=0.1 and kl_coeff=0.01 regresses HumanEval pass@1 by 2.3 pp relative to the frozen base model. The auxiliary signal — dominated by reflection examples whose targets are other MBPP rollouts or canonical MBPP code — pulls the policy off the broader HumanEval distribution.
  • Two diagnostic fingerprints of the drift. GRPO KL grows ~450× within-run and ~88× relative to the matched exec-only run at iter 3; mean completion length contracts by ~55%. These two signals together motivated the regularized variant in C-reg.
  • Held-out MBPP keeps improving. This is what makes the run scientifically interesting rather than just broken: in-distribution MBPP pass@1 ends 2.4 pp above baseline. The failure is specifically a cross-benchmark generalization failure, not a training failure.

Related Artifacts

Citation

@misc{batjargal2026sparkcode,
  title  = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
  author = {Amarsaikhan Batjargal},
  year   = {2026},
}

License

The LoRA adapter weights in this repository are released under the Apache 2.0 license. The base model, Qwen/Qwen2.5-Coder-3B-Instruct, is distributed under the Tongyi Qianwen LICENSE; any downstream use must comply with its terms.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amarsaikhan/spark-code-C-light-3b

Base model

Qwen/Qwen2.5-3B
Adapter
(36)
this model

Datasets used to train amarsaikhan/spark-code-C-light-3b

Space using amarsaikhan/spark-code-C-light-3b 1