Instructions to use PaletLabs/Circe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PaletLabs/Circe with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="PaletLabs/Circe")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("PaletLabs/Circe")
model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use PaletLabs/Circe with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "PaletLabs/Circe"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PaletLabs/Circe",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/PaletLabs/Circe

SGLang

How to use PaletLabs/Circe with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "PaletLabs/Circe" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PaletLabs/Circe",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "PaletLabs/Circe" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PaletLabs/Circe",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use PaletLabs/Circe with Docker Model Runner:
```
docker model run hf.co/PaletLabs/Circe
```

Circe / README.md

ErnestoOjeda

Update README.md

575bad0 verified about 1 year ago

preview code

raw

history blame contribute delete

4.27 kB

	---
	# 🪐 Circe-1.5B
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- bilingual
	- lora
	- rl
	- cost-efficient
	- tiny-models
	language:
	- en
	- es
	---

	<!-- center-aligned, capped at 420 px wide × 240 px tall -->
	<p align="center">
	<img
	src="https://cdn-uploads.huggingface.co/production/uploads/657e1ad01e3e9c41a49b732e/8IsJaxuOwuqBN0GctRUUe.png"
	alt="Circe-1.5B schematic"
	width="420"
	height="240"
	/>
	</p>


	Circe-1.5B is a single-checkpoint, 1.5 B-parameter language model that asks a simple question:

	> _“How far can you push tiny models on a tiny budget?”_

	\| ⚙️ Spec \| Value \|
	\|---------\|-------\|
	\| Base model \| `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` \|
	\| Trainable params \| 4 M (LoRA) \|
	\| Post-training cost \| ≈ US $12 on 1×L40S \|
	\| Training recipe \| 8 h SFT → 4 h GRPO \|
	\| Context length \| up to 4 k tokens (tested) \|
	\| RAM @ bf16 \| ~9 GB (≤ 3 GB 4-bit GPTQ) \|
	\| Throughput \| ~55 tok / s on 1×A6000 (fp16, no compile) \|

	It keeps DeepSeek-R1’s strong reasoning depth but adds fluent bilingual chat (English & Spanish) in a checkpoint that fits on a laptop GPU.
	We intend to use it as a reproducible waypoint on the road to real-time speech-to-speech reasoning systems.

	---

	# 🔭 Intended Use

	* Base for new LoRAs — domain adaptation, longer-context studies.
	* Research into cost-efficient RL for reasoning.
	* Not for high-stakes or production tasks.

	See the [⚙️ Limitations](#️-limitations--bias) section before use.

	---

	# ⚡ Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe-1.5B", torch_dtype="bfloat16")
	tok = AutoTokenizer.from_pretrained("PaletLabs/Circe-1.5B")

	prompt = "<\|user\|>¿Cómo se dice “tiny model” en español?<\|assistant\|>"
	out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
	print(tok.decode(out[0], skip_special_tokens=True))
	```

	---

	# 🛠️ Installation
	```bash
	git clone https://github.com/palet-global/circe
	cd circe
	python -m venv venv && source venv/bin/activate
	pip install .
	```

	## 🏗️ Re-Training Pipeline

	### Data
	```bash
	python data/fetch_datasets.py --out data/processed
	```

	### Supervised LoRA
	```bash
	accelerate config default # one-time
	accelerate launch train/sft.py \
	--data_dir data/processed \
	--output_dir checkpoints/sft
	```

	### RL (GRPO)
	```bash
	accelerate launch train/rl_grpo.py \
	--data_dir data/processed \
	--output_dir checkpoints/grpo \
	--init_ckpt checkpoints/sft/checkpoint-13000 \
	--num_steps 3000 --save_steps 500 --group 4
	```

	### Merge and Tokenizer
	```bash
	python train/merge_lora.py \
	--ckpt_dir checkpoints/grpo \
	--base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	```

	### SQuAD Sanity Checks
	```bash
	python eval/quick_squad_eval.py --model ./merged --dataset squad
	python eval/quick_squad_eval.py --model ./merged --dataset squad_es
	```

	### Upload
	```bash
	python train/upload_to_hub.py \
	--model_dir merged \
	--repo PaletLabs/Circe-1.5B \
	--token $HF_TOKEN
	```

	---

	# 💻 Hardware & Inference Tips
	- bf16 / fp16: Needs ~9 GB VRAM.
	- 4-bit GPTQ: < 3 GB. `bitsandbytes` works out-of-the-box.
	- Compile once (`torch.compile`) for +10–15 % throughput.

	---
	# ✍️ Current Evaluation Status
	Formal lighteval / MMLU / GSM-8K runs are queued. Preliminary spot-checks show Circe retains DeepSeek-R1’s chain-of-thought depth on reasoning-heavy QA while adding smooth bilingual generation.

	---
	## ⚙️ Limitations & Bias
	- No reward-model alignment.
	- Long-context (> 4 k) stability untested.
	- Training data bias from public QA pairs. Spanish coverage favors Latin American variants.
	- Minimal safety filters so you have to wrap with your own guardrails for production.

	---
	# 🔮 Roadmap
	- Publish full reasoning benchmark suite & eval scripts.
	- Release code-reasoning and doc-QA adapters.
	- Attach a 24 kHz neural codec → real-time, full-duplex voice chat without ASR → TTS hops.

	---
	# 🪪 License
	This project is licensed under the [MIT](https://opensource.org/licenses/MIT) License. Attribution appreciated but not required.