Text Generation
Transformers
Safetensors
English
Spanish
qwen2
bilingual
lora
rl
cost-efficient
tiny-models
conversational
text-generation-inference
Instructions to use PaletLabs/Circe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PaletLabs/Circe with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PaletLabs/Circe") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("PaletLabs/Circe") model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use PaletLabs/Circe with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PaletLabs/Circe" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PaletLabs/Circe
- SGLang
How to use PaletLabs/Circe with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PaletLabs/Circe" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PaletLabs/Circe" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use PaletLabs/Circe with Docker Model Runner:
docker model run hf.co/PaletLabs/Circe
| # 🪐 Circe-1.5B | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - bilingual | |
| - lora | |
| - rl | |
| - cost-efficient | |
| - tiny-models | |
| language: | |
| - en | |
| - es | |
| <!-- center-aligned, capped at 420 px wide × 240 px tall --> | |
| <p align="center"> | |
| <img | |
| src="https://cdn-uploads.huggingface.co/production/uploads/657e1ad01e3e9c41a49b732e/8IsJaxuOwuqBN0GctRUUe.png" | |
| alt="Circe-1.5B schematic" | |
| width="420" | |
| height="240" | |
| /> | |
| </p> | |
| **Circe-1.5B** is a single-checkpoint, 1.5 B-parameter language model that asks a simple question: | |
| > _“How far can you push tiny models on a tiny budget?”_ | |
| | ⚙️ Spec | Value | | |
| |---------|-------| | |
| | Base model | `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` | | |
| | Trainable params | 4 M (LoRA) | | |
| | Post-training cost | **≈ US $12** on 1×L40S | | |
| | Training recipe | 8 h SFT → 4 h GRPO | | |
| | Context length | up to **4 k tokens** (tested) | | |
| | RAM @ bf16 | ~9 GB (≤ 3 GB 4-bit GPTQ) | | |
| | Throughput | ~55 tok / s on 1×A6000 (fp16, no compile) | | |
| It keeps DeepSeek-R1’s strong reasoning depth but adds **fluent bilingual chat** (English & Spanish) in a checkpoint that fits on a laptop GPU. | |
| We intend to use it as a reproducible waypoint on the road to real-time speech-to-speech reasoning systems. | |
| --- | |
| # 🔭 Intended Use | |
| * **Base for new LoRAs** — domain adaptation, longer-context studies. | |
| * **Research** into cost-efficient RL for reasoning. | |
| * **Not** for high-stakes or production tasks. | |
| See the [⚙️ Limitations](#️-limitations--bias) section before use. | |
| --- | |
| # ⚡ Quickstart | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe-1.5B", torch_dtype="bfloat16") | |
| tok = AutoTokenizer.from_pretrained("PaletLabs/Circe-1.5B") | |
| prompt = "<|user|>¿Cómo se dice “tiny model” en español?<|assistant|>" | |
| out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=64) | |
| print(tok.decode(out[0], skip_special_tokens=True)) | |
| ``` | |
| --- | |
| # 🛠️ Installation | |
| ```bash | |
| git clone https://github.com/palet-global/circe | |
| cd circe | |
| python -m venv venv && source venv/bin/activate | |
| pip install . | |
| ``` | |
| ## 🏗️ Re-Training Pipeline | |
| ### Data | |
| ```bash | |
| python data/fetch_datasets.py --out data/processed | |
| ``` | |
| ### Supervised LoRA | |
| ```bash | |
| accelerate config default # one-time | |
| accelerate launch train/sft.py \ | |
| --data_dir data/processed \ | |
| --output_dir checkpoints/sft | |
| ``` | |
| ### RL (GRPO) | |
| ```bash | |
| accelerate launch train/rl_grpo.py \ | |
| --data_dir data/processed \ | |
| --output_dir checkpoints/grpo \ | |
| --init_ckpt checkpoints/sft/checkpoint-13000 \ | |
| --num_steps 3000 --save_steps 500 --group 4 | |
| ``` | |
| ### Merge and Tokenizer | |
| ```bash | |
| python train/merge_lora.py \ | |
| --ckpt_dir checkpoints/grpo \ | |
| --base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | |
| ``` | |
| ### SQuAD Sanity Checks | |
| ```bash | |
| python eval/quick_squad_eval.py --model ./merged --dataset squad | |
| python eval/quick_squad_eval.py --model ./merged --dataset squad_es | |
| ``` | |
| ### Upload | |
| ```bash | |
| python train/upload_to_hub.py \ | |
| --model_dir merged \ | |
| --repo PaletLabs/Circe-1.5B \ | |
| --token $HF_TOKEN | |
| ``` | |
| --- | |
| # 💻 Hardware & Inference Tips | |
| - **bf16 / fp16**: Needs ~9 GB VRAM. | |
| - **4-bit GPTQ**: < 3 GB. `bitsandbytes` works out-of-the-box. | |
| - Compile once (`torch.compile`) for **+10–15 %** throughput. | |
| --- | |
| # ✍️ Current Evaluation Status | |
| Formal **lighteval / MMLU / GSM-8K** runs are queued. Preliminary spot-checks show Circe retains DeepSeek-R1’s chain-of-thought depth on reasoning-heavy QA while adding smooth bilingual generation. | |
| --- | |
| ## ⚙️ Limitations & Bias | |
| - No reward-model alignment. | |
| - Long-context (> 4 k) stability untested. | |
| - Training data bias from public QA pairs. Spanish coverage favors Latin American variants. | |
| - Minimal safety filters so **you** have to wrap with your own guardrails for production. | |
| --- | |
| # 🔮 Roadmap | |
| - Publish full reasoning benchmark suite & eval scripts. | |
| - Release code-reasoning and doc-QA adapters. | |
| - Attach a **24 kHz neural codec** → real-time, full-duplex voice chat without ASR → TTS hops. | |
| --- | |
| # 🪪 License | |
| This project is licensed under the [MIT](https://opensource.org/licenses/MIT) License. Attribution appreciated but not required. | |