Instructions to use litert-community/Jan-nano with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/Jan-nano with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/Jan-nano \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use litert-community/Jan-nano with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Jan-nano β LiteRT-LM (blockwise int4)
Menlo/Jan-nano converted to the LiteRT-LM
(.litertlm) format for on-device inference with Google's
LiteRT-LM runtime (the engine behind the
official litert-community/* models).
Jan-nano is a 4B deep-research agent fine-tuned from Qwen3-4B (Qwen3ForCausalLM)
with a multi-stage RLVR recipe, optimized for tool use via the Model Context Protocol (MCP).
It is a reasoning model β it emits a <think>β¦</think> chain before its answer β so it
rides the existing Qwen3 converter and runtime directly.
| Files | model.litertlm β int4 block 128 (recommended, on-device) Β· model_block32.litertlm β int4 block 32 (finer-grain, desktop/Android) |
| Quantization | int4 weights (symmetric) + OCTAV optimal-clipping; embeddings INT8 (externalized section) |
| Compute | integer |
| Context (KV cache) | 4096 |
| Base model | Menlo/Jan-nano (Qwen3-4B) |
β οΈ It's a reasoning model β give it room to think
Jan-nano generates a <think>β¦</think> reasoning chain, then the answer. Run it with
max_tokens β₯ 2048 β at a short limit it gets cut off mid-thought and never reaches the
answer. (All quality numbers below were measured at 2048.)
Which file?
| File | int4 granularity | GSM8K (max_tokens 2048) | iPhone 17 Pro | Mac (M-series, GPU) |
|---|---|---|---|---|
model.litertlm |
block 128 | 88.0% | ~14 tok/s, loads | ~67 tok/s |
model_block32.litertlm |
block 32 | 85.0% | 2.11 GiB section β near the iOS memory ceiling, may not load | ~67 tok/s |
Use model.litertlm (block 128) β for a reasoning model that emits long <think> chains,
faster decode matters, and block 128 (ΒΌ the scales β lighter GPU dequant) is ~40% faster while
matching block 32 on accuracy here. It is also the build that loads reliably on iPhone (the
block-32 build's larger section sits at the device memory edge). block 32 is provided for
desktop/Android where the extra granularity is free.
Quality β GSM8K parity
Measured on GSM8K (n=100, greedy, 0-shot chain-of-thought, max_tokens 2048, identical prompt and answer-extraction for every row).
| Configuration | GSM8K |
|---|---|
| bf16 (reference) | 92.0% |
| LiteRT int4 β block 128 | 88.0% (β4 pt) |
| LiteRT int4 β block 32 | 85.0% (β7 pt) |
int4 is at parity (β4 pt for the recommended block-128 build). Note: evaluating a reasoning
model at a short token budget badly understates int4 β at max_tokens 1024 the same block-32
build scored only 63% purely because the longer int4 reasoning chains were truncated before the
answer; at 2048 it recovers to 85%. Always benchmark reasoning models with enough headroom.
Usage
# build litert-lm from https://github.com/google-ai-edge/litert-lm, then:
litert_lm_main \
--model_path model.litertlm \
--backend gpu \
--input_prompt "Plan how to find where HTTP retries are configured in a Python repo."
The .litertlm bundle carries the tokenizer and prompt template (Qwen3 ChatML β
<|im_start|>role\nβ¦<|im_end|>, stop token <|im_end|>), so no separate tokenizer files are
needed. The model will produce a <think>β¦</think> block followed by its answer.
Run on Android
The official Google AI Edge Gallery app runs
.litertlm models on-device:
- Install a recent Gallery (package
com.google.ai.edge.gallery, 1.0.15+ supports.litertlm). - Download
model.litertlmand push it:adb push model.litertlm /sdcard/Download/ - In the app tap +, pick the file, choose the GPU backend, and raise the max-tokens setting.
- Chat β the bundle already carries the tokenizer and Qwen3 chat template.
A 4B int4 build needs ~2.5 GB free RAM; reboot the phone first if memory is tight.
Run on iPhone
Verified on iPhone 17 Pro (LiteRT-LM Swift runtime): model.litertlm (block 128, 1.94 GiB
section) loads and generates at ~14 tok/s. The block-32 build's section (2.11 GiB) sits at the
device memory ceiling and may fail to load β prefer block 128 on iPhone.
Conversion
Converted with the official litert-torch
converter β Jan-nano is a standard Qwen3ForCausalLM, so it uses the existing Qwen3 path with
no custom graph code. Recipe: blockwise int4 + OCTAV (INT4 weights, block 128 or 32,
symmetric, OCTAV optimal-clipping), embeddings INT8, KV cache 4096.
from litert_torch.generative.export_hf.export import export
export(
model="Menlo/Jan-nano",
output_dir="out",
quantization_recipe="qwen3_int4_block128_octav.json", # blockwise-128 int4 + OCTAV, int8 embeddings
cache_length=4096,
externalize_embedder=True,
)
License
Apache-2.0, inherited from the base model Menlo/Jan-nano (itself fine-tuned from Qwen/Qwen3-4B, also Apache-2.0).
- Downloads last month
- 10