Squeeze-Release: Iterative Pruning with Exact Structural Minimization
Abstract
Squeeze-Release compression method combines pruning with structural minimization to create significantly smaller neural networks while maintaining accuracy, extending to transformer architectures through CompensatedLayerNorm.
Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.
Community
Neural networks are often far bigger than they need to be, and "pruning" refers to removing the components that add little to a model's performance. The catch: the most common pruning methods report a large amount of disabled parameters, but the model you actually deploy is often no smaller, because the tensors keep their original/dense shape for better hardware compatibility (and this is how it is implemented in default PyTorch).
Our new preprint, "Squeeze-Release: Iterative Pruning with Exact Structural Minimization," closes that gap. We rebuild a pruned network as a genuinely smaller dense one with the same output, then iterate to keep finding redundancy a single pass would miss. In practice this compresses the deployable model up to ~39× on a fully-connected network and ~14.8× on ConvNeXt-Tiny, at comparable accuracy.
We also propose CompensatedLayerNorm - a modified LayerNorm which allows to prune connections going through LayerNorm in function preserving way.
Neat paper. The idea of turning sparse masked networks into actually smaller dense ones via structural minimization is a nice shift from just having zeros sit in memory. It is interesting to see that the re-introduced noise during the release step actually helps the model find more redundancy in later iterations.
I’m curious, how does the CompensatedLayerNorm handle the math differently to allow for that channel reduction across residual streams?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6de1fcd5-b453-47df-bf44-c38e1a92d27b
Thank you for your kind evaluation.
CompensatedLayerNorm does not relate to residual streams concept, it is a way to prune connections between layers through conventional LayerNorm, when number of channels is reduced, in function preserving way with minimal number of additional parameters stored.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Small LLMs: Pruning vs. Training from Scratch (2026)
- Replacement Learning: Training Neural Networks with Fewer Parameters (2026)
- LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models (2026)
- SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference (2026)
- Pruning Deep Neural Networks via the Marchenko--Pastur Distribution (2026)
- From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation (2026)
- Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.14346 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper