Transformers documentation
MLX
Get started
Base classes
Models
Preprocessors
Inference
Pipeline API
Generate API
Optimization
Chat with models
Serving
Training
Quantization
Ecosystem integrations
Resources
Contribute
API
You are viewing v5.3.0 version. A newer version v5.8.1 is available.
MLX
MLX is an array framework for machine learning on Apple silicon that also works with CUDA. On Apple silicon, arrays stay in shared memory to avoid data copies between CPU and GPU. Lazy computation enables graph manipulation and optimizations. Native safetensors support means Transformers language models run directly on MLX.
Install the mlx-lm library.
pip install mlx-lm transformers
Load any Transformers language model from the Hub as long as the model architecture is supported. No weight conversion is required.
from mlx_lm import load, generate
model, tokenizer = load("openai/gpt-oss-20b")
output = generate(
model,
tokenizer,
prompt="The capital of France is",
max_tokens=100,
)
print(output)Transformers integration
- mlx_lm.load loads safetensor weights and returns a model and tokenizer.
- MLX loads weight arrays keyed by tensor names and maps them into an MLX nn.Module parameter tree. This matches how Transformers checkpoints are organized.
The MLX Transformers integration is bidirectional. Transformers can also load and run MLX weights from the Hub.
Resources
- MLX documentation
- mlx-lm repository containing MLX LLM implementations
- mlx-vlm community library with VLM implementations