Instructions to use zai-org/glm-4-9b-chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/glm-4-9b-chat with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("zai-org/glm-4-9b-chat", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Alternative quantizations.
#57
by ZeroWw - opened
These are my own quantizations (updated almost daily).
The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs
Great discussion! For anyone wanting to quickly test this, Crazyrouter offers API access to this model. No infrastructure setup needed — just an API key and the standard OpenAI SDK.