Instructions to use poolside/Laguna-XS.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use poolside/Laguna-XS.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="poolside/Laguna-XS.2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use poolside/Laguna-XS.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "poolside/Laguna-XS.2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/poolside/Laguna-XS.2

SGLang

How to use poolside/Laguna-XS.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "poolside/Laguna-XS.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "poolside/Laguna-XS.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use poolside/Laguna-XS.2 with Docker Model Runner:
```
docker model run hf.co/poolside/Laguna-XS.2
```

Laguna architecture not yet supported by local Mac inference runtimes (llama.cpp, mlx-lm, LM Studio, Ollama)

by skaman5 - opened 15 days ago

Discussion

skaman5

15 days ago

Heads-up for the Poolside team — the laguna architecture isn't loadable on any local Mac inference runtime as of May 2026. Only vLLM (GPU, mainline) and the hosted options work.

Tested runtimes

Runtime	Version	Result
llama.cpp	build 9330	`unknown model architecture: 'laguna'`
LM Studio	0.4.14+4	Fails to load (both GGUF and MLX variants)
mlx-lm	0.31.3	Architecture not recognized
Ollama	0.24.0	Loads but garbled output (#15892)
vLLM Metal	0.2.0 (Apple Silicon plugin)	Not in supported models list
HF Transformers	5.9.0	`LagunaForCausalLM` exists but no quantized MPS path (GGUF loader rejects, bitsandbytes is CUDA-only, float16 = 62 GB)
vLLM (mainline, GPU)	0.21.0	Works (PR #41129)

Models tested

GGUF: Lucebox/Laguna-XS.2-GGUF (Q4_K_M)
MLX 4-bit: mlx-community/Laguna-XS.2-4bit
MLX MXFP4: mlx-community/Laguna-XS.2-mxfp4

Feature requests filed

llama.cpp: Comment on ggml-org/llama.cpp#23249
LM Studio: lmstudio-ai/lmstudio-bug-tracker#1968
Ollama: Comment on ollama/ollama#15892

HuggingFace Transformers (5.9.0) detail

Transformers recognizes the architecture — LagunaForCausalLM loads config and downloads weights. But there's no viable path to run it on Apple Silicon MPS:

Approach	Result
Local GGUF via `from_pretrained(gguf_file=...)`	`GGUF model with architecture laguna is not supported yet`
Upstream repo with `load_in_4bit=True`	`LagunaForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit'` (bitsandbytes is CUDA-only)
Upstream repo float16 on MPS	`Invalid buffer size: 62.29 GiB` (full weights exceed 32 GB)

Reproduction script: thewesjohnson/gists/laguna-xs2-apple-silicon

Context

Laguna-XS.2 is positioned for local agentic coding on developer hardware, but the only way to run it locally today is vLLM on a CUDA GPU. The model card lists Transformers, vLLM, SGLang, and Docker Model Runner as supported — all GPU paths. For the Mac developer audience that the model targets, none of the Apple Silicon runtimes can load it yet.

Ollama (v0.22.0+) does load and run the model, but output quality is unusable — garbled text and repetitive loops (tracked at ollama/ollama#15892). The remaining runtimes don't recognize the architecture at all. llama.cpp support (ggml-org/llama.cpp#23249) would unblock LM Studio. mlx-lm support would unblock MLX-based tooling.

Not a bug on Poolside's end — just flagging that community adoption on Mac is limited until the local runtimes catch up. Happy to re-test when any runtime improves support.

Hardware

MacBook Pro M1 Max, 32 GB unified memory, macOS 26.4.1

joerowell

Poolside org 12 days ago

Hi!

Thank you for the report.

We're planning on extending our Mac support in the coming days & weeks. I can't give any hard dates unfortunately, but in the short term we're expecting llama.cpp support to land in the coming days (Ollama are moving over to llama.cpp as a native backend).

In terms of the garbled output in particular: we have isolated this issue down to the int4 quantised checkpoint. It appears to be a quantisation artifact, as no other checkpoint we have reproduces the garbled output. The particularities of the garbled output also appear to be backend-dependent (some of the errors we were only able to reproduce on consumer AMD hardware). We are hoping to share a resolution in the near future, but the precise date is still TBD.

PS have you tried the quantised MLX variants of the model in the Laguna-XS.2 collection? Those should work on Macs out of the box.

skaman5

4 days ago

•

edited 4 days ago

Update: Mac Testing on M1 Max 32GB + 64GB (2026-06-06)

Tested on two MacBook Pros (M1 Max 32GB / 24 GPU cores, and M1 Max 64GB / 32 GPU cores). Tried every local inference path I could find.

What works

Ollama macOS app (brew install --cask ollama-app, v0.30.6) with laguna-xs.2:q8_0. Clean output, 47 tok/s on 64GB. Also tested code generation (linked list reversal), coherent and well-structured. No garbling on q8_0.

Prompt: "What is 2+2? Answer in one sentence."
Output: "2 + 2 equals 4."
54 tokens | 47.4 tok/s | 22.4s total (20.8s load, 1.1s eval)

Important: the brew formula (brew install ollama) does NOT work. It builds an MLX-only backend without llama-server, so GGUF models fail with llama-server binary not found. Only the macOS app cask has the full llama.cpp backend.

What does not work

LM Studio 0.4.16+1 (MLX backend) with mlx-community/Laguna-XS.2-5bit: Model type laguna not supported. mlx-lm 0.31.3 does not have laguna in its model registry. The included modeling_laguna.py is the PyTorch version, not MLX-native.
LM Studio 0.4.16+1 (llama.cpp backend) with Lucebox/Laguna-XS.2-GGUF Q4_K_M: llama.cpp Laguna PR not merged yet.
Direct mlx-lm 0.31.3: Same architecture error as LM Studio MLX.
HF Transformers 5.10.2 with MLX 5-bit variant: weights are MLX safetensors format, not PyTorch-loadable.
AWQ-INT4, SGLang: CUDA/ROCm only, not applicable on Apple Silicon.

MLX path

@joerowell re: "have you tried the quantised MLX variants?" They download fine but do not load. mlx-lm needs to merge laguna architecture support upstream before any MLX variant works on Mac.

Not yet tested

HF Transformers with trust_remote_code=True on MPS using the full BF16 checkpoint. The BF16 GGUF is 67GB which does not fit in 64GB unified memory. Would need a different checkpoint or a machine with more RAM. Happy to test if you can point me to a variant that fits.

If there is anything else you would like tested on these machines, I am glad to help. I would love to test Laguna XS.2 on some agentic benchmarks once a local path opens up beyond Ollama.

Hardware:

MacBook Pro (2021) M1 Max, 32 GB unified, 24 GPU cores, macOS 26.5.0
MacBook Pro (2021) M1 Max, 64 GB unified, 32 GPU cores, macOS 26.5.0

ji-farthing

1 day ago

•

edited 1 day ago

ik_llama just added Laguna-XS.2 support.

skaman5

1 day ago

That is good news for the Nvidia ecosystem. ik_llama still doesn’t support Mac/ARM and AMD though from their readme

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment