I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. Itβs a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:
564 tokens/sec on short 100-token sprints 96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.