|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
# Isaac-0.2-2B by Perceptron |
|
|
|
|
|
Introducing the 2B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model. |
|
|
|
|
|
This release brings major upgrades β optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output β while remaining fast, compact, and deployable. |
|
|
|
|
|
[Try it on our demo! π](https://www.perceptron.inc/demo) - [API Docs π](https://docs.perceptron.inc/) - [Discord π¬](https://discord.gg/fgBeaACQzE) |
|
|
|
|
|
## Extending the efficient frontier of perception |
|
|
|
|
|
Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10Γ larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices. |
|
|
From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint. |
|
|
|
|
|
 |
|
|
|
|
|
## What's New in Isaac 0.2 |
|
|
|
|
|
* **Reasoning via Thinking Traces**: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks. |
|
|
|
|
|
* **Perceptive Tool Calling + Focus (Zoom & Crop)**: Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region β dramatically improving fine-grained perception. |
|
|
|
|
|
* **Structured Outputs**: More reliable structured output generation for consistent JSON and predictable downstream integration. |
|
|
|
|
|
* **Complex OCR**: Improved text recognition across cluttered, low-resolution, or distorted regions β enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes. |
|
|
|
|
|
* **Desktop Use**: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
 |
|
|
|
|
|
## Chatting with Isaac in π€ Transformers |
|
|
Learn more at our [Huggingface Example Repo](https://github.com/perceptron-ai-inc/perceptron/tree/main/huggingface), where we demo extracting and rendering points. |
|
|
|
|
|
```bash |
|
|
pip install perceptron |
|
|
``` |
|
|
|
|
|
### Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
from transformers.image_utils import load_image |
|
|
from transformers.utils.import_utils import is_torch_cuda_available |
|
|
|
|
|
def document_to_messages(document: list[dict]): |
|
|
messages, images = [], [] |
|
|
for item in document: |
|
|
if not (content := item.get("content")): |
|
|
continue |
|
|
role = item.get("role", "user") |
|
|
if item.get("type") == "image": |
|
|
images.append(load_image(content)) |
|
|
messages.append({"role": role, "content": "<image>"}) |
|
|
elif item.get("type") == "text": |
|
|
messages.append({"role": role, "content": content}) |
|
|
return messages, images |
|
|
|
|
|
hf_path = "PerceptronAI/Isaac-0.2-2B-Preview" |
|
|
device, dtype = ("cuda",torch.bfloat16) if is_torch_cuda_available() else ("cpu",torch.float32) |
|
|
|
|
|
# Load model/processor from the checkpoint |
|
|
processor = AutoProcessor.from_pretrained(hf_path, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
hf_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2" |
|
|
) |
|
|
model = model.to(device=device, dtype=dtype) |
|
|
model.eval() |
|
|
|
|
|
# Prepare input for generation |
|
|
document = [ |
|
|
{ |
|
|
"type": "text", |
|
|
"content": "<hint>BOX</hint>", |
|
|
"role": "user", |
|
|
}, |
|
|
{ |
|
|
"type": "image", |
|
|
"content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp", |
|
|
"role": "user", |
|
|
}, |
|
|
{ |
|
|
"type": "text", |
|
|
"content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.", |
|
|
"role": "user", |
|
|
}, |
|
|
] |
|
|
messages, images = document_to_messages(document) |
|
|
text = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
inputs = processor(text=text, images=images, return_tensors="pt") |
|
|
|
|
|
# Generate text using the model |
|
|
generated_ids = model.generate( |
|
|
tensor_stream=inputs["tensor_stream"].to(next(model.parameters()).device), |
|
|
max_new_tokens=256, |
|
|
do_sample=False, |
|
|
) |
|
|
generated_text = processor.tokenizer.decode( |
|
|
generated_ids[0], skip_special_tokens=False |
|
|
) |
|
|
print(f"\nFull generated output:\n{generated_text}") |
|
|
``` |