Instructions to use Orionfold/Kepler-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Orionfold/Kepler-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Orionfold/Kepler-GGUF",
	filename="model-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Orionfold/Kepler-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Orionfold/Kepler-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Orionfold/Kepler-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Orionfold/Kepler-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Orionfold/Kepler-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Orionfold/Kepler-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Orionfold/Kepler-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Orionfold/Kepler-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Orionfold/Kepler-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Orionfold/Kepler-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Orionfold/Kepler-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Orionfold/Kepler-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Orionfold/Kepler-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Orionfold/Kepler-GGUF:Q4_K_M

Ollama
How to use Orionfold/Kepler-GGUF with Ollama:
```
ollama run hf.co/Orionfold/Kepler-GGUF:Q4_K_M
```

Unsloth Studio

How to use Orionfold/Kepler-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Orionfold/Kepler-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Orionfold/Kepler-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Orionfold/Kepler-GGUF to start chatting

How to use Orionfold/Kepler-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Orionfold/Kepler-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Orionfold/Kepler-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Orionfold/Kepler-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Orionfold/Kepler-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Orionfold/Kepler-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Orionfold/Kepler-GGUF with Docker Model Runner:
```
docker model run hf.co/Orionfold/Kepler-GGUF:Q4_K_M
```

Lemonade

How to use Orionfold/Kepler-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Orionfold/Kepler-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Kepler-GGUF-Q4_K_M

List all available models

lemonade list

Kepler GGUF

Kepler is an 8B astrodynamics & quantitative-astrophysics reasoning model — fine-tuned from Qwen/Qwen3-8B to answer orbital-mechanics and astrophysics word problems with a short worked chain and a single \boxed{} numeric answer. It is built for the operator who wants a local, private, $0-per-query numeric reasoner that runs entirely inside an NVIDIA DGX Spark (GB10, 128 GB unified memory) — no API, no network, no per-token bill.

The differentiator is discipline, not size: an SFT pass on a verifier-checked corpus taught Kepler to answer rather than ruminate. It boxes a final answer on 100% of held-out problems with 0% truncation, at roughly 3× the conciseness of frontier cloud models on the same task (~166 output tokens vs ~460–490). Every claim below is a measured run on the Spark, not a wishlist.

GGUF quantizations follow, recommended variant Q8_0 (effectively lossless).

Spark-tested

Per-variant accuracy on the held-out astro benchmark — the quantization ladder. Scored with the same \boxed-extracting, SI-unit-normalized, ±2%-relative-tolerance verifier the model was trained against (astro-bench v0.1, n=44 off-template problems, constants given in-prompt).

Variant	Size	Perplexity (wikitext-2)	tok/s on Spark	astro-bench v0.1 held-out (n=44, \boxed ±2%)
Q4_K_M	4.7 GB	—	—	75.0%
Q5_K_M	5.5 GB	—	—	75.0%
Q6_K	6.3 GB	—	—	84.1%
Q8_0	8.2 GB	—	—	88.6%

Q8_0 is the recommended variant — it preserves full-precision accuracy while halving the F16 footprint. Q4/Q5 lose ~11 pp on the hardest compositional rows (see Known drift).

How it stacks up

Kepler-Q8_0 against frontier cloud models on the same 44-row held-out, matched 4096-token budget, same \boxed ±2% verifier (temp 0.6 / top_p 0.95):

Model	Where it runs	Accuracy	Boxed	Truncation	Mean output tokens
Kepler-Q8_0 (8B)	Local Spark, $0	84.1%	100%	0%	166
Claude Haiku 4.5	Cloud API	97.7%	100%	0%	488
Gemini 3.1 Flash-Lite	Cloud API	95.5%	100%	0%	464

The honest read: a local 8B specialist lands ~11–14 pp below frontier small cloud models on off-template numeric reasoning — while running fully offline at zero marginal cost and answering ~3× more concisely. The format reliability (100% boxed, 0% truncation) matches the frontier; the gap is pure accuracy on a handful of multi-step rows. (Kepler's matched-budget 84.1% here vs the 88.6% fidelity number above is run-to-run sampling variance — both land in the mid-to-high 80s.)

Variants

Variant	Recommended use
Q4_K_M	Smallest footprint; use when memory is tight and you can accept ~11 pp lower accuracy on hard rows.
Q5_K_M	Slightly higher quality than Q4_K_M for a modest size bump.
Q6_K	Near-lossless; a good middle ground if you have headroom.
Q8_0	Recommended. Effectively lossless — best accuracy, fits the Spark envelope comfortably.

How to run

Pull the recommended variant:

huggingface-cli download Orionfold/Kepler-GGUF model-Q8_0.gguf \
  --local-dir ./models/kepler

Serve it via llama-server (OpenAI-compatible API):

llama-server -m ./models/kepler/model-Q8_0.gguf \
  -c 4096 -ngl 99 -t 8 \
  --host 0.0.0.0 --port 8080

Or run in-process via llama-cpp-python:

from llama_cpp import Llama
llm = Llama(
    model_path="./models/kepler/model-Q8_0.gguf",
    n_ctx=4096, n_gpu_layers=99, chat_format="chatml",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "A satellite orbits Earth in a circular orbit at altitude 550 km. Compute its orbital period in minutes. Give your final answer as \\boxed{value unit}."}],
    temperature=0.6,
)
print(out["choices"][0]["message"]["content"])

LM Studio and Ollama (via a Modelfile) load the GGUF directly with no additional setup.

Known drift

Kepler is honest about where it misses. Across all quants, errors cluster on two families:

hohmann_transfer — two-burn orbital transfers (the most multi-step problems).
altitude_from_period — inverse Kepler (solving for orbital radius given the period).

These are an SFT coverage gap, not a precision artifact — they fail similarly at every quant level and were flagged by the headroom analysis as needing more training coverage rather than reinforcement learning. Treat Kepler's answers on multi-burn transfer problems as draft-quality and verify them.

Companion benchmark

The exact benchmark used above is published as a dataset: Orionfold/Kepler-bench — the problem pool + held-out set + the verifier-as-reward scorer, so you can reproduce these numbers.

Methods

Full methodology — the scout, the verifier-is-the-reward bench, the SFT corpus, the SFT-vs-RLVR decision, and the Spark-side measurement protocol: The Gate Before the GPU — Deciding SFT vs RL vs RLVR Before You Spend the Run.

Published by Orionfold LLC · orionfold.com · Methods documented at ainative.business/field-notes.

Downloads last month: 109

GGUF

Model size

8B params

Architecture

qwen3

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

Model tree for Orionfold/Kepler-GGUF

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(294)

this model