Laguna architecture not yet supported by local Mac inference runtimes (llama.cpp, mlx-lm, LM Studio, Ollama)

#7
by skaman5 - opened

Heads-up for the Poolside team — the laguna architecture isn't loadable on any local Mac inference runtime as of May 2026. Only vLLM (GPU, mainline) and the hosted options work.

Tested runtimes

Runtime Version Result
llama.cpp build 9330 unknown model architecture: 'laguna'
LM Studio 0.4.14+4 Fails to load (both GGUF and MLX variants)
mlx-lm 0.31.3 Architecture not recognized
Ollama 0.24.0 Loads but garbled output (#15892)
vLLM Metal 0.2.0 (Apple Silicon plugin) Not in supported models list
HF Transformers 5.9.0 LagunaForCausalLM exists but no quantized MPS path (GGUF loader rejects, bitsandbytes is CUDA-only, float16 = 62 GB)
vLLM (mainline, GPU) 0.21.0 Works (PR #41129)

Models tested

  • GGUF: Lucebox/Laguna-XS.2-GGUF (Q4_K_M)
  • MLX 4-bit: mlx-community/Laguna-XS.2-4bit
  • MLX MXFP4: mlx-community/Laguna-XS.2-mxfp4

Feature requests filed

HuggingFace Transformers (5.9.0) detail

Transformers recognizes the architecture — LagunaForCausalLM loads config and downloads weights. But there's no viable path to run it on Apple Silicon MPS:

Approach Result
Local GGUF via from_pretrained(gguf_file=...) GGUF model with architecture laguna is not supported yet
Upstream repo with load_in_4bit=True LagunaForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit' (bitsandbytes is CUDA-only)
Upstream repo float16 on MPS Invalid buffer size: 62.29 GiB (full weights exceed 32 GB)

Reproduction script: thewesjohnson/gists/laguna-xs2-apple-silicon

Context

Laguna-XS.2 is positioned for local agentic coding on developer hardware, but the only way to run it locally today is vLLM on a CUDA GPU. The model card lists Transformers, vLLM, SGLang, and Docker Model Runner as supported — all GPU paths. For the Mac developer audience that the model targets, none of the Apple Silicon runtimes can load it yet.

Ollama (v0.22.0+) does load and run the model, but output quality is unusable — garbled text and repetitive loops (tracked at ollama/ollama#15892). The remaining runtimes don't recognize the architecture at all. llama.cpp support (ggml-org/llama.cpp#23249) would unblock LM Studio. mlx-lm support would unblock MLX-based tooling.

Not a bug on Poolside's end — just flagging that community adoption on Mac is limited until the local runtimes catch up. Happy to re-test when any runtime improves support.

Hardware

MacBook Pro M1 Max, 32 GB unified memory, macOS 26.4.1

Poolside org

Hi!

Thank you for the report.

We're planning on extending our Mac support in the coming days & weeks. I can't give any hard dates unfortunately, but in the short term we're expecting llama.cpp support to land in the coming days (Ollama are moving over to llama.cpp as a native backend).

In terms of the garbled output in particular: we have isolated this issue down to the int4 quantised checkpoint. It appears to be a quantisation artifact, as no other checkpoint we have reproduces the garbled output. The particularities of the garbled output also appear to be backend-dependent (some of the errors we were only able to reproduce on consumer AMD hardware). We are hoping to share a resolution in the near future, but the precise date is still TBD.

PS have you tried the quantised MLX variants of the model in the Laguna-XS.2 collection? Those should work on Macs out of the box.

Update: Mac Testing on M1 Max 32GB + 64GB (2026-06-06)

Tested on two MacBook Pros (M1 Max 32GB / 24 GPU cores, and M1 Max 64GB / 32 GPU cores). Tried every local inference path I could find.

What works

Ollama macOS app (brew install --cask ollama-app, v0.30.6) with laguna-xs.2:q8_0. Clean output, 47 tok/s on 64GB. Also tested code generation (linked list reversal), coherent and well-structured. No garbling on q8_0.

Prompt: "What is 2+2? Answer in one sentence."
Output: "2 + 2 equals 4."
54 tokens | 47.4 tok/s | 22.4s total (20.8s load, 1.1s eval)

Important: the brew formula (brew install ollama) does NOT work. It builds an MLX-only backend without llama-server, so GGUF models fail with llama-server binary not found. Only the macOS app cask has the full llama.cpp backend.

What does not work

  • LM Studio 0.4.16+1 (MLX backend) with mlx-community/Laguna-XS.2-5bit: Model type laguna not supported. mlx-lm 0.31.3 does not have laguna in its model registry. The included modeling_laguna.py is the PyTorch version, not MLX-native.
  • LM Studio 0.4.16+1 (llama.cpp backend) with Lucebox/Laguna-XS.2-GGUF Q4_K_M: llama.cpp Laguna PR not merged yet.
  • Direct mlx-lm 0.31.3: Same architecture error as LM Studio MLX.
  • HF Transformers 5.10.2 with MLX 5-bit variant: weights are MLX safetensors format, not PyTorch-loadable.
  • AWQ-INT4, SGLang: CUDA/ROCm only, not applicable on Apple Silicon.

MLX path

@joerowell re: "have you tried the quantised MLX variants?" They download fine but do not load. mlx-lm needs to merge laguna architecture support upstream before any MLX variant works on Mac.

Not yet tested

HF Transformers with trust_remote_code=True on MPS using the full BF16 checkpoint. The BF16 GGUF is 67GB which does not fit in 64GB unified memory. Would need a different checkpoint or a machine with more RAM. Happy to test if you can point me to a variant that fits.

If there is anything else you would like tested on these machines, I am glad to help. I would love to test Laguna XS.2 on some agentic benchmarks once a local path opens up beyond Ollama.

Hardware:

  • MacBook Pro (2021) M1 Max, 32 GB unified, 24 GPU cores, macOS 26.5.0
  • MacBook Pro (2021) M1 Max, 64 GB unified, 32 GPU cores, macOS 26.5.0

ik_llama just added Laguna-XS.2 support.

That is good news for the Nvidia ecosystem. ik_llama still doesn’t support Mac/ARM and AMD though from their readme

Sign up or log in to comment