laguna xs.2, two hacks

we tried two things.

one was making laguna smaller. one was making laguna better at terminal agents.

neither became magic in 24 hours. both produced useful artifacts.

track 1: 2 bit local laguna

we got laguna xs.2 running locally on a 24 gb macbook.

not a toy model. real laguna xs.2. 33b total params, about 3b active per token.

main local artifact:

models/laguna-gguf/laguna-xs2-arcnext-mixed-IQ1_IQ2.gguf

size:

7.75 gib
1.99 bpw
sha256: fbad385a799897bdba7bbd7d19c1f4f2c3ff7c534326eeac948ea5182d328ec6

note: the gguf exists locally and its manifest is uploaded. the binary itself is not uploaded in this repo because the only copy is on a slow local network path. the imatrix, quant policy, scripts, and checksum are included.

stock mlx and ollama did not support laguna gguf. patched llama.cpp from lucebox did.

runtime:

patched llama.cpp
laguna gguf
llama-server
openai-compatible cleanup proxy
arcnext and opencode experiments

what worked:

laguna loads locally
fits in memory
generates browser-control json
arcnext calibration, sft, and dpo data exists
local proxy cleans some control-token leakage

what did not fully work:

full opencode autonomous tool use is still flaky
1.99 bpw is aggressive and causes some weird output
tool execution needs more hard routing or tuning

the point:

laguna can be squeezed hard enough to run locally.
quality is not free, but the path works.

track 2: terminal agent rl

we built a mini terminal-bench style rl environment for laguna.

the model gets a tiny repo, reads files, patches code, runs pytest, and calls finish.

artifacts:

armantsaturian/shell-agent-bench
237 synthetic repo tasks
42 hard train tasks
9 hard eval tasks
configs for hosted rl on prime
checkpoints and lora adapters

we generated terminal-agent data with gpt-5.5, pushed prime environments, and ran multiple laguna xs.2 rl jobs. the final step 12 adapter from the hard v7 run is uploaded from prime storage.

the useful technical result was not that we crushed terminal-bench.

we did not.

the useful result was finding the failure mode.

binary reward was too sparse. the model got zero advantage and training collapsed.

then we fixed the checker bug and added shaped reward:

reward = 0.7 * full pass + 0.3 * partial hidden checks

that made training work mechanically:

zero advantage: 0.0
effective batch size: 1.0

but it optimized the proxy. the model learned partial static-check progress, not reliable completion.

held-out hard eval:

base: 1/9
final: 0/9

so no fake win.

what worked:

reusable prime rl env
real pytest execution
hidden checks
synthetic terminal-agent dataset
binary vs shaped reward ablation
clear diagnosis of sparse reward failure

what did not work:

held-out hard eval did not improve
tool surface was too small
model hit max turns too often
shaped reward overfit partial progress

the point:

terminal agent rl is not just make tasks and train.
you need calibrated difficulty, real tool affordances, and completion-aligned reward.
we built the scaffold and found the wall.

files

models/laguna-gguf/laguna-xs2-arcnext-mixed-IQ1_IQ2.gguf
models/laguna-gguf/laguna-xs2.imatrix
models/prime-mmspxu-step12/adapter_model.safetensors
models/prime-mmspxu-step12/adapter_config.json
models/prime-mmspxu-step12/SOURCE.json
data/
configs/
scripts/
environments/
envs/
docs/

final take

one track made laguna smaller.

one track made laguna's agent training failure mode clearer.

both are useful. neither is a leaderboard claim.

the submission is honest: artifacts, data, configs, checkpoints, and negative results that are actually reproducible.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-xs2-two-hacks

Adapter
(7)
this model