Instructions to use JonnyJF/laguna-xs2-coding-pruned-9pct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use JonnyJF/laguna-xs2-coding-pruned-9pct with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("poolside/Laguna-XS.2") model = PeftModel.from_pretrained(base_model, "JonnyJF/laguna-xs2-coding-pruned-9pct") - Notebooks
- Google Colab
- Kaggle
Laguna-XS.2 — Coding-Specialised Expert Prune + Heal
LoRA adapter and mask spec that take poolside/Laguna-XS.2 down by ~8.9% of routed experts at preserved coding capability.
- Layer-weighted prune: bottom-25% of routed experts by activation mass in layers 11–18 (the per-layer skew leaderboard's "hot zone"), bottom-5% elsewhere. Total: 884 of 9,984 routed experts removed.
- Healing: LoRA (rank 16, α 32) on the always-on shared expert's
gate_proj/up_proj/down_projat every sparse layer. Trained for 200 SFT steps on 120 MBPP completions generated by the unpruned full Laguna. - Active params per token unchanged (~3B). The win is footprint at equal quality, not speed.
Built for the Poolside × Prime Intellect hackathon on 30 May 2026.
Files
| File | Purpose |
|---|---|
adapter_model.safetensors + adapter_config.json |
peft LoRA adapter on the shared expert |
drops.json |
{moe_block_idx (0..38): [expert_indices_to_mask]} — apply BEFORE loading the adapter |
prune.py |
mask_experts() helper to apply the mask via router bias = -inf |
mask_snap.json |
same as drops.json, saved by the training script as a sanity check |
Usage
import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from prune import mask_experts # included in this repo
REPO = "poolside-laguna-hackathon/<this-repo>"
tok = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"poolside/Laguna-XS.2",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# 1. Mask the cold-tail experts (sets router bias to -inf for dropped experts)
drops = {int(k): v for k, v in json.loads(open("drops.json").read()).items()}
mask_experts(model, drops)
# 2. Attach the LoRA adapter
model = PeftModel.from_pretrained(model, REPO)
model.eval()
Methodology
- Skew measurement — forward-hook every
LagunaTopKRouter(39 of them). Accumulate per-(layer, expert) selection counts and renormalised routing-weight mass over 30 HumanEval prompts ×200 greedy-generated tokens (250k routing events). - Cold-tail identification — pick bottom-N% experts by mass per
layer. Under DeepSeek-V3-style routing (
sigmoid(logits) + e_score_correction_biasfor load balancing) counts get forced toward uniform but mass still skews. Mass is the truth. - Layer-weighted spec — the per-layer 80%-mass leaderboard showed layers 13–17 needing only 52–66 of 256 experts for 80% mass (vs. median 87, max 126 in early layers). So: 25% prune in layers 11–18, 5% elsewhere.
- Mask, don't slice (yet) —
mask_experts()setse_score_correction_biasto-inffor dropped experts. The router's top-k will never pick them, the renormalisation across the surviving top-k handles the rest. Tensors keep their shape — slicing is a separate ship step. - Distillation healing — frozen full Laguna generates teacher completions on 120 MBPP prompts. The pruned-via-mask model gets a LoRA on the shared expert and is trained to match the teacher's completions via SFT (cross-entropy on the completion tokens only).
Architectural note
Laguna's routed experts are stored as batched 3D parameter tensors
(gate_up_proj: (256, 1024, 2048), down_proj: (256, 2048, 512)) and used
via manual matmul, not nn.Linear. peft's LoRA can't target raw
parameter slices, so the adapter goes on the always-on shared expert
(shared_experts.{gate,up,down}_proj, which ARE nn.Linear). The shared
expert sees every token, so a LoRA delta there absorbs the lost routed-expert
contribution into the always-on path.
Honest limitations
- 8.9% routed-expert reduction is modest. DeepSeek-V3 load balancing fights skew effectively; the cold tail is real but not dramatic.
- LoRA targets the shared expert, not the routed survivors. Direct healing of the routed survivors needs a custom LoRA wrapper for the batched-tensor storage — left as future work.
- No HumanEval / SWE-bench numbers reported here — eval pipeline ran into
trim-heuristic issues mid-hackathon (model emitted
</assistant>chat tokens that the original trim missed). The eval harness is fixed in the source repo; numbers will be uploaded as a follow-up.
Source code
Full hackathon repo (probe, prune, eval, distillation, model card): see the
Source code link in this repo's README.
Built by Jonathan Farrow (@JonnyJF) for the 30 May 2026 Poolside × Prime Intellect hackathon.
- Downloads last month
- 12
Model tree for JonnyJF/laguna-xs2-coding-pruned-9pct
Base model
poolside/Laguna-XS.2