Instructions to use arunvpp05/zephyr-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use arunvpp05/zephyr-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="arunvpp05/zephyr-gguf", filename="zephyr-q5_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use arunvpp05/zephyr-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf arunvpp05/zephyr-gguf:Q5_K_M # Run inference directly in the terminal: llama-cli -hf arunvpp05/zephyr-gguf:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf arunvpp05/zephyr-gguf:Q5_K_M # Run inference directly in the terminal: llama-cli -hf arunvpp05/zephyr-gguf:Q5_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf arunvpp05/zephyr-gguf:Q5_K_M # Run inference directly in the terminal: ./llama-cli -hf arunvpp05/zephyr-gguf:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf arunvpp05/zephyr-gguf:Q5_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf arunvpp05/zephyr-gguf:Q5_K_M
Use Docker
docker model run hf.co/arunvpp05/zephyr-gguf:Q5_K_M
- LM Studio
- Jan
- vLLM
How to use arunvpp05/zephyr-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "arunvpp05/zephyr-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arunvpp05/zephyr-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/arunvpp05/zephyr-gguf:Q5_K_M
- Ollama
How to use arunvpp05/zephyr-gguf with Ollama:
ollama run hf.co/arunvpp05/zephyr-gguf:Q5_K_M
- Unsloth Studio new
How to use arunvpp05/zephyr-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for arunvpp05/zephyr-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for arunvpp05/zephyr-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for arunvpp05/zephyr-gguf to start chatting
- Docker Model Runner
How to use arunvpp05/zephyr-gguf with Docker Model Runner:
docker model run hf.co/arunvpp05/zephyr-gguf:Q5_K_M
- Lemonade
How to use arunvpp05/zephyr-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull arunvpp05/zephyr-gguf:Q5_K_M
Run and chat with the model
lemonade run user.zephyr-gguf-Q5_K_M
List all available models
lemonade list
Zephyr-7B-Beta — GGUF (Q5_K_M & Q8_0)
Two ready-to-run GGUF builds of the Zephyr-7B-Beta chat model for local CPU inference via the llama.cpp ecosystem.
These are inference-only quantized weights.
Files
zephyr-q5_k_m.gguf— balanced quality vs size (≈ 4.8 GB). Good default for 16 GB RAM laptops.zephyr-q8_0.gguf— higher fidelity (≈ 7.2 GB). Requires more RAM.
GGUF embeds tokenizer/vocab, so separate tokenizer files are not required for inference.
Prompt Format (Zephyr chat)
Use Zephyr chat tags for best results:
<|user|> YOUR_PROMPT_HERE <|assistant|>
Example
<|user|> List three ways Retrieval-Augmented Generation improves factuality. <|assistant|>
How to Run (llama.cpp)
CLI (CPU)
Q5_K_M (fits most 16 GB RAM systems)
./llama-cli -m zephyr-q5_k_m.gguf \
-p "<|user|>\nExplain RAG in 3 bullets.\n\n<|assistant|>\n" \
-n 256 -c 2048 -ngl 0 -t $(nproc)
Q8_0 (higher quality; more RAM)
./llama-cli -m zephyr-q8_0.gguf \
-p "<|user|>\nGive 5 note-taking tips.\n\n<|assistant|>\n" \
-n 256 -c 2048 -ngl 0 -t $(nproc)
Flags
-n 256→ max new tokens-c 2048→ context window-ngl 0→ CPU-only (set>0to offload to GPU if supported)-t $(nproc)→ threads
Some builds use
./maininstead of./llama-cli. Replace the binary name if needed.
Popular UIs
Import the .gguf directly in:
- LM Studio
- KoboldCpp
- Text Generation WebUI (llama.cpp backend)
- Ollama (custom import)
Hardware Notes
Approximate RAM use at 2k context (CPU-only):
- Q5_K_M (~4.8 GB file) → ~8–10 GB RAM
- Q8_0 (~7.2 GB file) → ~12–14 GB RAM
Actual usage varies with context length, batch size, and compile options.
Checksums (optional)
Verify downloads:
sha256sum zephyr-q5_k_m.gguf
sha256sum zephyr-q8_0.gguf
(Add the resulting hashes here if you want to publish them.)
Intended Use & Limitations
- Intended for local assistant/chat and general text generation.
- Not suitable for high-stakes or safety-critical use without human review.
- Outputs may contain mistakes or biases; verify important information.
What’s Included
Quantized GGUF weights:
zephyr-q5_k_m.ggufzephyr-q8_0.gguf
No training code or LoRA adapters are included here.
Acknowledgments
- Base: Zephyr-7B-Beta (converted to GGUF and quantized for CPU inference).
- Inference runtime:
llama.cppand compatible UIs.
Changelog
- v1.0 — Initial release of
Q5_K_MandQ8_0GGUF builds.
- Downloads last month
- 5
5-bit
8-bit
Model tree for arunvpp05/zephyr-gguf
Base model
mistralai/Mistral-7B-v0.1