Instructions to use KnutJaegersberg/Tri-21B-Think-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use KnutJaegersberg/Tri-21B-Think-gguf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="KnutJaegersberg/Tri-21B-Think-gguf") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("KnutJaegersberg/Tri-21B-Think-gguf", dtype="auto") - llama-cpp-python
How to use KnutJaegersberg/Tri-21B-Think-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="KnutJaegersberg/Tri-21B-Think-gguf", filename="Tri-21B-8bit.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use KnutJaegersberg/Tri-21B-Think-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf KnutJaegersberg/Tri-21B-Think-gguf # Run inference directly in the terminal: llama-cli -hf KnutJaegersberg/Tri-21B-Think-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf KnutJaegersberg/Tri-21B-Think-gguf # Run inference directly in the terminal: llama-cli -hf KnutJaegersberg/Tri-21B-Think-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf KnutJaegersberg/Tri-21B-Think-gguf # Run inference directly in the terminal: ./llama-cli -hf KnutJaegersberg/Tri-21B-Think-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf KnutJaegersberg/Tri-21B-Think-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf KnutJaegersberg/Tri-21B-Think-gguf
Use Docker
docker model run hf.co/KnutJaegersberg/Tri-21B-Think-gguf
- LM Studio
- Jan
- vLLM
How to use KnutJaegersberg/Tri-21B-Think-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "KnutJaegersberg/Tri-21B-Think-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KnutJaegersberg/Tri-21B-Think-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/KnutJaegersberg/Tri-21B-Think-gguf
- SGLang
How to use KnutJaegersberg/Tri-21B-Think-gguf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "KnutJaegersberg/Tri-21B-Think-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KnutJaegersberg/Tri-21B-Think-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "KnutJaegersberg/Tri-21B-Think-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "KnutJaegersberg/Tri-21B-Think-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use KnutJaegersberg/Tri-21B-Think-gguf with Ollama:
ollama run hf.co/KnutJaegersberg/Tri-21B-Think-gguf
- Unsloth Studio new
How to use KnutJaegersberg/Tri-21B-Think-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KnutJaegersberg/Tri-21B-Think-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KnutJaegersberg/Tri-21B-Think-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for KnutJaegersberg/Tri-21B-Think-gguf to start chatting
- Pi new
How to use KnutJaegersberg/Tri-21B-Think-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf KnutJaegersberg/Tri-21B-Think-gguf
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "KnutJaegersberg/Tri-21B-Think-gguf" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use KnutJaegersberg/Tri-21B-Think-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf KnutJaegersberg/Tri-21B-Think-gguf
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default KnutJaegersberg/Tri-21B-Think-gguf
Run Hermes
hermes
- Docker Model Runner
How to use KnutJaegersberg/Tri-21B-Think-gguf with Docker Model Runner:
docker model run hf.co/KnutJaegersberg/Tri-21B-Think-gguf
- Lemonade
How to use KnutJaegersberg/Tri-21B-Think-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull KnutJaegersberg/Tri-21B-Think-gguf
Run and chat with the model
lemonade run user.Tri-21B-Think-gguf-{{QUANT_TAG}}List all available models
lemonade list
The License for the Base Model Tri-21B is the trillion license in this repo, the think and think-preview versions are Apache 2.0.
Introduction
Tri-21B-Think-Preview is an intermediate checkpoint of Tri-21B-Think, featuring mid-training context length expansion to 32K tokens and instruction tuning for chain-of-thought reasoning and tool use.
Model Specifications
- Type: Causal Language Model (Reasoning-Enhanced)
- Base Model: Tri-21B
- Architecture: Transformer Decoder with RoPE, SwiGLU, RMSNorm, and GQA
- Number of Parameters: 20.73B
- Number of Layers: 40
- Number of Attention Heads: 32 (Query) / 8 (Key, Value)
- Head Dimension: 160
- Hidden Size: 5,120
- Intermediate Size: 27,392
- Context Length: 32,768 (up to 262,144 with YaRN)
- Vocab Size: 124,416
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "trillionlabs/Tri-21B-Think-Preview"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Solve the following step by step: What is the sum of the first 100 prime numbers?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=4096,
temperature=0.6,
top_p=0.9
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
vLLM & SGLang Deployment
vLLM and SGLang support for Trillion Model is on the way. Stay tuned!
Fine-tuning Notes
Note on
<think>tags: This model was trained without<think>and</think>as special tokens. They were added post-training for compatibility with reasoning parsers. If you plan to fine-tune this model, you'll need to modifytokenizer_config.jsonto avoid indexing errors.
Replace tokens 123975 and 123976 in tokenizer_config.json:
"123975": {
"content": "<|reserved_special_token_9|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"123976": {
"content": "<|reserved_special_token_10|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
Evaluation
| Category | Benchmark | Description | Tri-21B-Think-Preview |
|---|---|---|---|
| Reasoning | GPQA-Diamond | Graduate-level science questions across physics, chemistry, and biology (PhD-level) | 54 |
| AIME 2025 | American Invitational Mathematics Examination 2025 | 50.0 | |
| MMLU-Pro | Massive Multitask Language Understanding with more answer choices and reasoning-focused questions | 65.19 | |
| HLE | Humanity's Last Exam — 2,500 expert-level questions across 100+ subjects created by nearly 1,000 domain experts | 5.12 | |
| Coding | LiveCodeBench v6 | Competitive programming benchmark with problems sourced from recent programming contests | 48.57 |
| SciCode | Code generation across 338 subproblems in 16 natural science fields drawn from real research workflows | 18 | |
| Instruction Following | IFEval | Tests ability to follow precise formatting and output constraint instructions | 84.05 |
| IFBench | Evaluates generalization to novel, verifiable output constraints not seen during training (Allen AI) | 51.02 | |
| Agentic | TAU2-Bench (Telecom) | Dual-control conversational benchmark where both agent and user use tools to resolve telecom scenarios (Sierra) | 93 |
| AA-LCR | Long-context reasoning over multiple documents at 10K–100K tokens (Artificial Analysis) | 15 | |
| AA-Omniscience | Factual reliability across 6,000 questions in 42 subtopics, penalizing hallucinations (Artificial Analysis) | -48.55 | |
| Korean | KMMLU-Pro | 2,822 questions from 14 Korean National Professional Licensure exams (LG AI Research) | 54.18 |
| CLIcK | 1,995 Korean cultural and linguistic knowledge questions sourced from official exams and textbooks (KAIST) | 77.94 | |
| KoBALT | Korean linguistic understanding across syntax, semantics, pragmatics, phonetics, and morphology (SNU) | 47.29 |
Limitations
- Language Support: Optimized for English, Korean, and Japanese. Other languages may show degraded performance.
- Knowledge Cutoff: February 2025.
- Intermediate Checkpoint: See Tri-21B-Think for the final model.
License
This model is licensed under the Apache 2.0 License.
Contact
For inquiries: info@trillionlabs.co
- Downloads last month
- 257
We're not able to determine the quantization variants.
Model tree for KnutJaegersberg/Tri-21B-Think-gguf
Base model
trillionlabs/Tri-21B