MLX

MLX is an array framework for machine learning on Apple silicon that also works with CUDA. On Apple silicon, arrays stay in shared memory to avoid data copies between CPU and GPU. Lazy computation enables graph manipulation and optimizations. Native safetensors support means Transformers language models run directly on MLX.

Install the mlx-lm library.

pip install mlx-lm transformers

Load any Transformers language model from the Hub as long as the model architecture is supported. No weight conversion is required.

from mlx_lm import load, generate

model, tokenizer = load("openai/gpt-oss-20b")
output = generate(
    model,
    tokenizer,
    prompt="The capital of France is",
    max_tokens=100,
)
print(output)

Transformers integration

mlx_lm.load loads safetensor weights and returns a model and tokenizer.
MLX loads weight arrays keyed by tensor names and maps them into an MLX nn.Module parameter tree. This matches how Transformers checkpoints are organized.

The MLX Transformers integration is bidirectional. Transformers can also load and run MLX weights from the Hub.

Resources

MLX documentation
mlx-lm repository containing MLX LLM implementations
mlx-vlm community library with VLM implementations

Update on GitHub