This is the 300M parameter variant of Falcon Perception. It supports detection only (bounding boxes). For the full model with segmentation masks, see
tiiuae/Falcon-Perception.
Falcon Perception 300M
Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.
The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: <|coord|> then <|size|>, producing a center point and bounding box size in normalized coordinates.
Links
- Full model (with segmentation):
tiiuae/Falcon-Perception - Code and inference engine:
github.com/tiiuae/Falcon-Perception - Tech report: arXiv link coming soon
- PBench dataset:
tiiuae/PBench - OCR model:
tiiuae/Falcon-OCR
Quickstart
Installation
pip install "torch>=2.5" transformers pillow einops
This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because torch.compile may build optimized kernels.
Run open-vocabulary detection
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/Falcon-Perception-300M",
trust_remote_code=True,
device_map={"": "cuda:0"},
)
image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]
for p in preds:
print(p["xy"], p["hw"])
Each prediction is a dict with normalized bounding box coordinates:
{
"xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1)
"hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1)
}
Visualize detections
from PIL import ImageDraw
draw = ImageDraw.Draw(image)
W, H = image.size
for p in preds:
cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H
bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H
x0, y0 = cx - bw / 2, cy - bh / 2
x1, y1 = cx + bw / 2, cy + bh / 2
draw.rectangle([x0, y0, x1, y1], outline="lime", width=2)
image.save("output.jpg")
API
model.generate(images, queries, **kwargs)
| Parameter | Type | Default | Description |
|---|---|---|---|
images |
PIL.Image or list |
required | Single image or list of images |
queries |
str or list[str] |
required | Query string(s), one per image |
task |
str |
"detection" |
Task type. Only "detection" is supported by this model. |
max_new_tokens |
int |
2048 |
Maximum decoding steps |
min_dimension |
int |
256 |
Minimum image side after resize |
max_dimension |
int |
1024 |
Maximum image side after resize |
compile |
bool |
True |
Run torch.compile on first call |
Returns: list[list[dict]], one list per image.
Each detection dict contains:
{
"xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1)
"hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1)
}
Requesting
task="segmentation"on this model will raise aValueError. Use the fulltiiuae/Falcon-Perceptionmodel for segmentation masks.
What the model is for
Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include:
- Natural language driven object selection in images
- Lightweight bounding-box detection for downstream pipelines
- Crowded scenes where the number of instances is large and variable
- Edge or resource-constrained deployments where the full model is too large
It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.
Model details (high level)
The architecture follows a single-stack early-fusion recipe:
- One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
- Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
- Chain-of-Perception decoding:
<|coord|>then<|size|>per instance - Specialized heads for coordinates and size, with geometry conditioning via Fourier features
Comparison with the full model
| Falcon-Perception | Falcon-Perception-300M | |
|---|---|---|
| Parameters | ~7B | ~0.3B |
| Tasks | Detection + Segmentation | Detection only |
| Output | Bounding boxes + pixel masks | Bounding boxes |
| Token sequence | <|coord|> <|size|> <|seg|> |
<|coord|> <|size|> |
Limitations
- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models.
- OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
- Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
- This variant does not produce segmentation masks. Use the full model if pixel-level masks are needed.
Citation
If you use Falcon Perception, please cite:
@article{bevli2026falcon,
title = {Falcon Perception},
author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
journal = {arXiv preprint arXiv:2603.27365},
year = {2026},
url = {https://arxiv.org/abs/2603.27365}
}
- Downloads last month
- 50