Laguna-XS.2 — Coding-Specialised Expert Prune + Heal

LoRA adapter and mask spec that take poolside/Laguna-XS.2 down by ~8.9% of routed experts at preserved coding capability.

  • Layer-weighted prune: bottom-25% of routed experts by activation mass in layers 11–18 (the per-layer skew leaderboard's "hot zone"), bottom-5% elsewhere. Total: 884 of 9,984 routed experts removed.
  • Healing: LoRA (rank 16, α 32) on the always-on shared expert's gate_proj / up_proj / down_proj at every sparse layer. Trained for 200 SFT steps on 120 MBPP completions generated by the unpruned full Laguna.
  • Active params per token unchanged (~3B). The win is footprint at equal quality, not speed.

Built for the Poolside × Prime Intellect hackathon on 30 May 2026.

Files

File Purpose
adapter_model.safetensors + adapter_config.json peft LoRA adapter on the shared expert
drops.json {moe_block_idx (0..38): [expert_indices_to_mask]} — apply BEFORE loading the adapter
prune.py mask_experts() helper to apply the mask via router bias = -inf
mask_snap.json same as drops.json, saved by the training script as a sanity check

Usage

import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from prune import mask_experts  # included in this repo

REPO = "poolside-laguna-hackathon/<this-repo>"

tok = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "poolside/Laguna-XS.2",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# 1. Mask the cold-tail experts (sets router bias to -inf for dropped experts)
drops = {int(k): v for k, v in json.loads(open("drops.json").read()).items()}
mask_experts(model, drops)

# 2. Attach the LoRA adapter
model = PeftModel.from_pretrained(model, REPO)
model.eval()

Methodology

  1. Skew measurement — forward-hook every LagunaTopKRouter (39 of them). Accumulate per-(layer, expert) selection counts and renormalised routing-weight mass over 30 HumanEval prompts × 200 greedy-generated tokens (250k routing events).
  2. Cold-tail identification — pick bottom-N% experts by mass per layer. Under DeepSeek-V3-style routing (sigmoid(logits) + e_score_correction_bias for load balancing) counts get forced toward uniform but mass still skews. Mass is the truth.
  3. Layer-weighted spec — the per-layer 80%-mass leaderboard showed layers 13–17 needing only 52–66 of 256 experts for 80% mass (vs. median 87, max 126 in early layers). So: 25% prune in layers 11–18, 5% elsewhere.
  4. Mask, don't slice (yet)mask_experts() sets e_score_correction_bias to -inf for dropped experts. The router's top-k will never pick them, the renormalisation across the surviving top-k handles the rest. Tensors keep their shape — slicing is a separate ship step.
  5. Distillation healing — frozen full Laguna generates teacher completions on 120 MBPP prompts. The pruned-via-mask model gets a LoRA on the shared expert and is trained to match the teacher's completions via SFT (cross-entropy on the completion tokens only).

Architectural note

Laguna's routed experts are stored as batched 3D parameter tensors (gate_up_proj: (256, 1024, 2048), down_proj: (256, 2048, 512)) and used via manual matmul, not nn.Linear. peft's LoRA can't target raw parameter slices, so the adapter goes on the always-on shared expert (shared_experts.{gate,up,down}_proj, which ARE nn.Linear). The shared expert sees every token, so a LoRA delta there absorbs the lost routed-expert contribution into the always-on path.

Honest limitations

  • 8.9% routed-expert reduction is modest. DeepSeek-V3 load balancing fights skew effectively; the cold tail is real but not dramatic.
  • LoRA targets the shared expert, not the routed survivors. Direct healing of the routed survivors needs a custom LoRA wrapper for the batched-tensor storage — left as future work.
  • No HumanEval / SWE-bench numbers reported here — eval pipeline ran into trim-heuristic issues mid-hackathon (model emitted </assistant> chat tokens that the original trim missed). The eval harness is fixed in the source repo; numbers will be uploaded as a follow-up.

Source code

Full hackathon repo (probe, prune, eval, distillation, model card): see the Source code link in this repo's README.

Built by Jonathan Farrow (@JonnyJF) for the 30 May 2026 Poolside × Prime Intellect hackathon.

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JonnyJF/laguna-xs2-coding-pruned-9pct

Adapter
(7)
this model