Instructions to use zmzfpc/crane-30b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zmzfpc/crane-30b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zmzfpc/crane-30b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("zmzfpc/crane-30b") model = AutoModelForMultimodalLM.from_pretrained("zmzfpc/crane-30b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zmzfpc/crane-30b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zmzfpc/crane-30b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zmzfpc/crane-30b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zmzfpc/crane-30b
- SGLang
How to use zmzfpc/crane-30b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zmzfpc/crane-30b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zmzfpc/crane-30b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zmzfpc/crane-30b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zmzfpc/crane-30b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zmzfpc/crane-30b with Docker Model Runner:
docker model run hf.co/zmzfpc/crane-30b
crane-30b
crane-30b is a CRANE merge: it is produced by merging two Qwen3-30B-A3B checkpoints (an Instruct base and a Thinking donor) with the CRANE method — it is not trained or fine-tuned from scratch. CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing) injects reasoning ability from the Thinking donor into the tool-disciplined Instruct / code base while preserving the base model's output format and tool-calling behavior.
Project page: https://rpi-nsl.github.io/CRANE/ · Code: github.com/rpi-nsl/CRANE
Note: this is the CRANE weight-merging method for code agents. It is unrelated to the similarly-named "CRANE: Reasoning with constrained LLM generation" (arXiv 2502.09061), despite the shared acronym.
How it was made (CRANE)
CRANE is a training-free, parameter-editing weight merge that injects reasoning ability from a "Thinking" donor into a tool-disciplined Instruct / code base, while constraining the edit so the base model's output format and tool-calling behavior are preserved. It treats the Thinking − Instruct delta as a pool of candidate reasoning edits, and applies three composable stages per layer and parameter component :
Three small calibration sets drive the stages — (reasoning transfer), (agent-behavior / tool-use preservation), and (format preservation):
- Stage 1 — Magnitude thresholding . A deterministic median-magnitude threshold keeps only the larger (top-half) delta coordinates and rescales them by 2, discarding low-confidence noise.
- Stage 2 — Conservative Taylor Gate . From a signed, direction-aware score per calibration loss, CTG keeps the positive part of the per-coordinate minimum over the reasoning and agent-behavior objectives, — rewarding a coordinate only when the edit helps both. These aggregate into the per-component, per-layer coefficient , scaled by the single global merge strength .
- Stage 3 — Graduated Sigmoidal Projection (GSP). From the SVD of format-critical Instruct activations , a smooth sigmoidal weight (set by singular amplitude and threshold ) gives the projector , attenuating high-amplitude format directions so reasoning is injected without perturbing chat-template tokens, tool-call delimiters, or JSON/schema structure.
The result is a merge that gains planning / reflection / recovery reasoning while keeping the base agent's compact, tool-call-disciplined behavior — the entire merge is a closed-form edit of the Instruct weights, with no fine-tuning.
This checkpoint's recipe
This checkpoint merges Qwen/Qwen3-30B-A3B-Instruct-2507 (base) and Qwen/Qwen3-30B-A3B-Thinking-2507 (donor) with:
- Global injection strength — , multiplied by the per-component CTG coefficients, so the Thinking delta is added at roughly a quarter strength.
- Per-layer / per-component gating — attention, expert (FFN), norm, and router components each get their own coefficient, varying by layer index rather than a single flat scalar.
- GSP projector — a freshly rebuilt Qwen3-30B graduated-sigmoidal projector (sigmoid threshold ) protects the format / tool-call subspace before injection.
Architecture
The merge preserves the standard Qwen3-30B-A3B (MoE) topology unchanged:
| Property | Value |
|---|---|
| model_type | qwen3_moe |
| Architecture class | Qwen3MoeForCausalLM |
| Total params | ~30B |
| Active params | ~3B |
| hidden_size | 2048 |
| num_hidden_layers | 48 |
| num_experts | 128 |
| num_experts_per_tok | 8 |
| num_attention_heads | 32 |
| num_key_value_heads | 4 |
| head_dim | 128 |
| moe_intermediate_size | 768 |
| max_position_embeddings | 262144 |
| vocab_size | 151936 |
| dtype | bfloat16 |
| rope_theta | 10000000 |
A config_1m.json is also included for the extended long-context variant: it keeps the same rope_scaling (null) and max_position_embeddings (262144), but adds a dual_chunk_attention_config (dual chunk attention, original_max_position_embeddings = 131072 + sparse-attention settings) for longer-context inference.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "zmzfpc/crane-30b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Requires a recent
transformerswith Qwen3-MoE support (transformers >= 4.51).
Citation / attribution
If you use this model or the CRANE method, please cite:
@misc{zhu2026crane,
title = {CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing},
author = {Zhu, Mingzhi and Merler, Michele and Pavuluri, Raju and Patterson, Stacy},
year = {2026},
eprint = {2605.14084},
archivePrefix= {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2605.14084}
}
Project page: https://rpi-nsl.github.io/CRANE/ · Code: github.com/rpi-nsl/CRANE
Base models — built from two Apache-2.0 checkpoints:
- Qwen/Qwen3-30B-A3B-Instruct-2507 (base / backbone)
- Qwen/Qwen3-30B-A3B-Thinking-2507 (reasoning donor)
License: Apache-2.0 (consistent with both base models and the CRANE code).
- Downloads last month
- -
