Get an API key · Release blog post · Technical report
Laguna M.1-FP8
Laguna M.1-FP8 is a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token designed for agentic coding and long-horizon work. This is the FP8-quantized variant of Laguna M.1.
This is the FP8 variant. The BF16 and NVFP4 variants are also available on Hugging Face.
Highlights
- Large sparse MoE for agentic coding: Laguna M.1 is a 70-layer MoE transformer with 225B total parameters and 23B activated parameters per token
- High-capacity expert routing: After 3 dense SwiGLU layers, Laguna M.1 uses 67 sparse MoE layers with 256 experts, top-k=16 routing and auxiliary-loss-free load balancing
- Global attention architecture: Laguna M.1 uses global attention across all layers with 64 Q-heads, 8 KV-heads and softplus attention output gating
- Native reasoning support: Interleaved thinking between tool calls with support for enabling and disabling thinking per-request
- Apache 2.0 license: Use and modify freely for commercial and non-commercial purposes
Model overview
- Training: pre-training, post-training and reinforcement learning stages
- Number of parameters: 225B total with 23B activated per token
- Optimizer: Muon
- Layers: 70 layers with global attention
- Experts: 256 experts with 1 shared expert; top-k=16 routing
- Dense layers: first 3 layers are dense SwiGLU; remaining 67 layers are sparse MoE
- Attention: 64 Q-heads, 8 KV-heads, head dimension 128, with softplus attention output gating
- Positional encoding: RoPE with YaRN
- Modality: text-to-text
- Context window: 262,144 tokens
- Reasoning support: interleaved thinking with preserved thinking
- Quantization: FP8 (weights), detected automatically from
quantization_config
Benchmark results
| Model | Parameters | SWE-bench Verified | SWE-bench Multilingual | SWE-bench Pro (Public Dataset) | Terminal-Bench 2.0 |
|---|---|---|---|---|---|
| Laguna M.1 (BF16) | 225B-A23B | 74.6% | 63.1% | 49.2% | 45.8% |
| Devstral 2 | 123B dense | 72.2% | 61.3% | - | 32.6% |
| GLM-4.7 | 355B-A32B | 73.8% | 66.7% | - | 41.0% |
| DeepSeek-V4 Flash | 284B-A13B | 79.0% | 73.3% | 52.6% | 56.9% |
| Qwen3.5-397B-A17B | 397B-A17B | 76.2% | 69.3% | 50.9% | 52.5% |
| Claude Sonnet 4.6 | - | 79.6% | - | - | 59.1% |
Scores shown are for the BF16 reference model; see the main Laguna M.1 model card for full benchmarking methodology. We used the highest publicly-referenced scores for all comparison models across each benchmark.
Usage
Laguna M.1 has upstream support in vLLM, SGLang, and TRT-LLM thanks to the support of the team at NVIDIA.
For complete usage instructions, see the main Laguna M.1 model card.
Deployment
vLLM
The full vLLM recipe is on the main Laguna M.1 model card. Quantization is detected automatically from quantization_config in this checkpoint, so the same command works with poolside/Laguna-M.1-FP8 substituted for the model ID. No extra flags required.
pip install 'vllm>=0.21.0'
vllm serve \
--model poolside/Laguna-M.1-FP8 \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--enable-auto-tool-choice \
--served-model-name laguna \
--default-chat-template-kwargs '{"enable_thinking": true}'
SGLang
Laguna M.1 is supported in SGLang via sgl-project/sglang#28400. Quantization is detected automatically from quantization_config, so no extra flags are required. A full serving recipe will be added to the main Laguna M.1 model card.
TRT-LLM
Laguna is supported in TensorRT-LLM thanks to the team at NVIDIA (NVIDIA/TensorRT-LLM#13559, with partial-RoPE fusion in #15110). The full recipe is on the main Laguna M.1 model card. Quantization is detected automatically from quantization_config in this checkpoint, so no extra flags are required.
Controlling reasoning
Laguna M.1 has native reasoning support and is designed to work best with preserved thinking, where reasoning content from prior assistant messages is preserved in the message history. This model will generally reason before calling tools and between tool calls. See the main Laguna M.1 model card for streaming, tool-call, and preserved-thinking examples.
Disabling reasoning
You can disable thinking by setting enable_thinking to False in a request or by not providing --default-chat-template-kwargs {"enable_thinking": True} or equivalent when starting the server.
License
This model is licensed under the Apache 2.0 License.
Intended and Responsible Use
Laguna M.1 is designed for software engineering and agentic coding use cases, and you are responsible for confirming that it is appropriate for your intended application. Laguna M.1 is subject to the Apache 2.0 License, and should be used consistently with Poolside's Acceptable Use Policy. We advise against circumventing Laguna M.1 safety guardrails without implementing substantially equivalent mitigations appropriate for your use case.
Please report security vulnerabilities or safety concerns to security@poolside.ai.
- Downloads last month
- 281
Model tree for poolside/Laguna-M.1-FP8
Base model
poolside/Laguna-M.1