Spaces:

mlx-community
/

README

Running

App Files Files Community

Is MTP the missing piece for Apple Silicon LLM inference?

#34

by ak959 - opened 5 days ago

Discussion

ak959

5 days ago

•

edited 5 days ago

We've been experimenting with Multi-Token Prediction (MTP) on Apple Silicon and have seen some surprisingly large performance gains with recent models such as Gemma 4.

This got us wondering:

How much of the current inference bottleneck on Apple Silicon is actually due to the decoding process itself?

MLX has already done an excellent job leveraging unified memory and Metal acceleration, but decoding remains one of the most expensive stages for LLM inference. MTP and speculative-style decoding approaches seem promising, especially as models become larger and context windows continue to grow.

Some questions I'd love to hear opinions on:

Have you experimented with MTP or speculative decoding on MLX?
Which models benefit the most?

Gemma 4
Qwen 3.6
DeepSeek
Llama family

Do you think future Apple Silicon inference stacks should prioritize:

Better quantization
MTP / speculative decoding
KV cache optimization
Multi-GPU / distributed inference

For those running M3 Ultra or M4 Max systems, what are your current throughput numbers for Gemma 4 or Qwen 3.6?

I'm particularly interested in real-world experiences from people building inference runtimes on top of MLX.

It feels like there is still significant headroom left in Apple Silicon inference that isn't being fully explored yet.

Curious to hear what others are seeing.

usermma

MLX Community org 4 days ago

not all people interested in MTP and speculative decoding due to loss of capabilities of the model...

or at least thats what alot of people thought more over of...

datagram

MLX Community org 4 days ago

chips are scarce, so from an open access perspective (the little guy), Multi-GPU / distributed inference should be the last priority. Priority should be on 'better quantization' and 'key value caching,' MTP improves AI lifestyle (e.g., better predict = better thinking and output). If you are a F1000 corp, you can buy anything so multi gpu/distributed is the priority, but maybe you are just buying proprietary "frontier" models and their kickers like oai for healthcare or anthropic for mines and minerals. final answer = all 4.

usermma

MLX Community org 3 days ago

is this like this ai who said;

user: who are you?

assistant: I'm just to help out. How can I assist if anyone doesn't understand what my goal is, and not have an actual purpose on this task so it's easier for me than everyone else in the whole universe that could ever find anything better! So here am: myself (or more likely) or any person of you with a deep heart to answer your questions.

me: why did you say at the end "all 4" and what do you mean by it?

chips are scarce, so from an open access perspective (the little guy), Multi-GPU / distributed inference should be the last priority. Priority should be on 'better quantization' and 'key value caching,' MTP improves AI lifestyle (e.g., better predict = better thinking and output). If you are a F1000 corp, you can buy anything so multi gpu/distributed is the priority, but maybe you are just buying proprietary "frontier" models and their kickers like oai for healthcare or anthropic for mines and minerals. final answer = all 4.

datagram

MLX Community org 3 days ago

just meant that all four matter, but prioritize 1-4 the OP asks? ...imho it depends on if you or someone you care about has resources to buy multiple GPUs or not-- if you consider universal access to compute an important principle. if so, optimize for one machine/resources FIRST.

if not, so be it, optimize for chip clusters and parallelization FIRST. e.g., big business SOCs or mom n pop inference providers in your hometown, stack & rack 10 mac minis and serve the mlx-based inference. my experience is i don't know that many peeps with mac mini racks or multiple macstudios in parallel, irl.

usermma

MLX Community org 3 days ago

yeah, one device is always better, if they can merge them in one, its much better than running a cluster, beacuse one device is much faster than alot of devices working together....

ak959

3 days ago

just meant that all four matter, but prioritize 1-4 the OP asks? ...imho it depends on if you or someone you care about has resources to buy multiple GPUs or not-- if you consider universal access to compute an important principle. if so, optimize for one machine/resources FIRST.

if not, so be it, optimize for chip clusters and parallelization FIRST. e.g., big business SOCs or mom n pop inference providers in your hometown, stack & rack 10 mac minis and serve the mlx-based inference. my experience is i don't know that many peeps with mac mini racks or multiple macstudios in parallel, irl.

ak959 changed discussion status to closed 3 days ago

ak959

3 days ago

not all people interested in MTP and speculative decoding due to loss of capabilities of the model...

or at least thats what alot of people thought more over of...

MLX-LM has not utilize the GPU
Test https://github.com/defai-digital/ax-engine
it is much faster even in direct mode

ak959

3 days ago

chips are scarce, so from an open access perspective (the little guy), Multi-GPU / distributed inference should be the last priority. Priority should be on 'better quantization' and 'key value caching,' MTP improves AI lifestyle (e.g., better predict = better thinking and output). If you are a F1000 corp, you can buy anything so multi gpu/distributed is the priority, but maybe you are just buying proprietary "frontier" models and their kickers like oai for healthcare or anthropic for mines and minerals. final answer = all 4.

yes, you are correct.
But i think Mac MLX is doing something
https://www.youtube.com/watch?v=wykPErJ8M-8

ak959

3 days ago

just meant that all four matter, but prioritize 1-4 the OP asks? ...imho it depends on if you or someone you care about has resources to buy multiple GPUs or not-- if you consider universal access to compute an important principle. if so, optimize for one machine/resources FIRST.

if not, so be it, optimize for chip clusters and parallelization FIRST. e.g., big business SOCs or mom n pop inference providers in your hometown, stack & rack 10 mac minis and serve the mlx-based inference. my experience is i don't know that many peeps with mac mini racks or multiple macstudios in parallel, irl.

Thank you for your advise. can you please share your parallelization experience?

ak959 changed discussion status to open 3 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment