Instructions to use poolside/Laguna-XS.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use poolside/Laguna-XS.2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="poolside/Laguna-XS.2", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use poolside/Laguna-XS.2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "poolside/Laguna-XS.2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside/Laguna-XS.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/poolside/Laguna-XS.2
- SGLang
How to use poolside/Laguna-XS.2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "poolside/Laguna-XS.2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside/Laguna-XS.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "poolside/Laguna-XS.2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "poolside/Laguna-XS.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use poolside/Laguna-XS.2 with Docker Model Runner:
docker model run hf.co/poolside/Laguna-XS.2
Laguna architecture not yet supported by local Mac inference runtimes (llama.cpp, mlx-lm, LM Studio, Ollama)
Heads-up for the Poolside team — the laguna architecture isn't loadable on any local Mac inference runtime as of May 2026. Only vLLM (GPU, mainline) and the hosted options work.
Tested runtimes
| Runtime | Version | Result |
|---|---|---|
| llama.cpp | build 9330 | unknown model architecture: 'laguna' |
| LM Studio | 0.4.14+4 | Fails to load (both GGUF and MLX variants) |
| mlx-lm | 0.31.3 | Architecture not recognized |
| Ollama | 0.24.0 | Loads but garbled output (#15892) |
| vLLM Metal | 0.2.0 (Apple Silicon plugin) | Not in supported models list |
| HF Transformers | 5.9.0 | LagunaForCausalLM exists but no quantized MPS path (GGUF loader rejects, bitsandbytes is CUDA-only, float16 = 62 GB) |
| vLLM (mainline, GPU) | 0.21.0 | Works (PR #41129) |
Models tested
- GGUF:
Lucebox/Laguna-XS.2-GGUF(Q4_K_M) - MLX 4-bit:
mlx-community/Laguna-XS.2-4bit - MLX MXFP4:
mlx-community/Laguna-XS.2-mxfp4
Feature requests filed
- llama.cpp: Comment on ggml-org/llama.cpp#23249
- LM Studio: lmstudio-ai/lmstudio-bug-tracker#1968
- Ollama: Comment on ollama/ollama#15892
HuggingFace Transformers (5.9.0) detail
Transformers recognizes the architecture — LagunaForCausalLM loads config and downloads weights. But there's no viable path to run it on Apple Silicon MPS:
| Approach | Result |
|---|---|
Local GGUF via from_pretrained(gguf_file=...) |
GGUF model with architecture laguna is not supported yet |
Upstream repo with load_in_4bit=True |
LagunaForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit' (bitsandbytes is CUDA-only) |
| Upstream repo float16 on MPS | Invalid buffer size: 62.29 GiB (full weights exceed 32 GB) |
Reproduction script: thewesjohnson/gists/laguna-xs2-apple-silicon
Context
Laguna-XS.2 is positioned for local agentic coding on developer hardware, but the only way to run it locally today is vLLM on a CUDA GPU. The model card lists Transformers, vLLM, SGLang, and Docker Model Runner as supported — all GPU paths. For the Mac developer audience that the model targets, none of the Apple Silicon runtimes can load it yet.
Ollama (v0.22.0+) does load and run the model, but output quality is unusable — garbled text and repetitive loops (tracked at ollama/ollama#15892). The remaining runtimes don't recognize the architecture at all. llama.cpp support (ggml-org/llama.cpp#23249) would unblock LM Studio. mlx-lm support would unblock MLX-based tooling.
Not a bug on Poolside's end — just flagging that community adoption on Mac is limited until the local runtimes catch up. Happy to re-test when any runtime improves support.
Hardware
MacBook Pro M1 Max, 32 GB unified memory, macOS 26.4.1
Hi!
Thank you for the report.
We're planning on extending our Mac support in the coming days & weeks. I can't give any hard dates unfortunately, but in the short term we're expecting llama.cpp support to land in the coming days (Ollama are moving over to llama.cpp as a native backend).
In terms of the garbled output in particular: we have isolated this issue down to the int4 quantised checkpoint. It appears to be a quantisation artifact, as no other checkpoint we have reproduces the garbled output. The particularities of the garbled output also appear to be backend-dependent (some of the errors we were only able to reproduce on consumer AMD hardware). We are hoping to share a resolution in the near future, but the precise date is still TBD.
PS have you tried the quantised MLX variants of the model in the Laguna-XS.2 collection? Those should work on Macs out of the box.
Update: Mac Testing on M1 Max 32GB + 64GB (2026-06-06)
Tested on two MacBook Pros (M1 Max 32GB / 24 GPU cores, and M1 Max 64GB / 32 GPU cores). Tried every local inference path I could find.
What works
Ollama macOS app (brew install --cask ollama-app, v0.30.6) with laguna-xs.2:q8_0. Clean output, 47 tok/s on 64GB. Also tested code generation (linked list reversal), coherent and well-structured. No garbling on q8_0.
Prompt: "What is 2+2? Answer in one sentence."
Output: "2 + 2 equals 4."
54 tokens | 47.4 tok/s | 22.4s total (20.8s load, 1.1s eval)
Important: the brew formula (brew install ollama) does NOT work. It builds an MLX-only backend without llama-server, so GGUF models fail with llama-server binary not found. Only the macOS app cask has the full llama.cpp backend.
What does not work
- LM Studio 0.4.16+1 (MLX backend) with
mlx-community/Laguna-XS.2-5bit:Model type laguna not supported.mlx-lm0.31.3 does not havelagunain its model registry. The includedmodeling_laguna.pyis the PyTorch version, not MLX-native. - LM Studio 0.4.16+1 (llama.cpp backend) with
Lucebox/Laguna-XS.2-GGUFQ4_K_M: llama.cpp Laguna PR not merged yet. - Direct
mlx-lm0.31.3: Same architecture error as LM Studio MLX. - HF Transformers 5.10.2 with MLX 5-bit variant: weights are MLX safetensors format, not PyTorch-loadable.
- AWQ-INT4, SGLang: CUDA/ROCm only, not applicable on Apple Silicon.
MLX path
@joerowell re: "have you tried the quantised MLX variants?" They download fine but do not load. mlx-lm needs to merge laguna architecture support upstream before any MLX variant works on Mac.
Not yet tested
HF Transformers with trust_remote_code=True on MPS using the full BF16 checkpoint. The BF16 GGUF is 67GB which does not fit in 64GB unified memory. Would need a different checkpoint or a machine with more RAM. Happy to test if you can point me to a variant that fits.
If there is anything else you would like tested on these machines, I am glad to help. I would love to test Laguna XS.2 on some agentic benchmarks once a local path opens up beyond Ollama.
Hardware:
- MacBook Pro (2021) M1 Max, 32 GB unified, 24 GPU cores, macOS 26.5.0
- MacBook Pro (2021) M1 Max, 64 GB unified, 32 GPU cores, macOS 26.5.0
ik_llama just added Laguna-XS.2 support.
That is good news for the Nvidia ecosystem. ik_llama still doesn’t support Mac/ARM and AMD though from their readme