Papers
arxiv:2605.27276

SIA: Self Improving AI with Harness & Weight Updates

Published on May 26
· Submitted by
Vignesh Baskaran
on Jun 8
Authors:
,
,
,
,
,
,

Abstract

A self-improving AI framework simultaneously updates both model weights and task-specific agent architecture through a language-model feedback agent across legal classification, GPU optimization, and biological data denoising tasks.

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

Community

Paper author Paper submitter

We just released SIA, a self-improving AI loop that updates both an agent’s harness/scaffold and its model weights. Most self-improving-agent work so far turns one knob: either the scaffold changes while weights stay fixed, or weights adapt through a fixed training pipeline. SIA tries to combine both.

The surprising part for us was how different the two levers seem to be: harness updates mostly improved “software engineering” around the model — parsing, retries, tools, search procedures — while weight updates seemed to add task-specific intuition that the scaffold never discovered. In our experiments, SIA-W+H beat harness-only across law classification, CUDA kernel optimization, and scRNA denoising.

Curious what HF thinks: is self-improvement more likely to come from better scaffolds, better test-time training, or systems that co-evolve both? And what would be the cleanest benchmark to tell the difference without Goodharting the verifier?

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27276
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27276 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27276 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27276 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.