This is the 300M parameter variant of Falcon Perception. It supports detection only (bounding boxes). For the full model with segmentation masks, see tiiuae/Falcon-Perception.

Falcon Perception 300M

Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.

The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: <|coord|> then <|size|>, producing a center point and bounding box size in normalized coordinates.

Quickstart

Installation

pip install "torch>=2.5" transformers pillow einops

This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because torch.compile may build optimized kernels.

Run open-vocabulary detection

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/Falcon-Perception-300M",
    trust_remote_code=True,
    device_map={"": "cuda:0"},
)

image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]

for p in preds:
    print(p["xy"], p["hw"])

Each prediction is a dict with normalized bounding box coordinates:

{
  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
}

Visualize detections

from PIL import ImageDraw

draw = ImageDraw.Draw(image)
W, H = image.size

for p in preds:
    cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H
    bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H
    x0, y0 = cx - bw / 2, cy - bh / 2
    x1, y1 = cx + bw / 2, cy + bh / 2
    draw.rectangle([x0, y0, x1, y1], outline="lime", width=2)

image.save("output.jpg")

API

`model.generate(images, queries, **kwargs)`

Parameter	Type	Default	Description
`images`	`PIL.Image` or `list`	required	Single image or list of images
`queries`	`str` or `list[str]`	required	Query string(s), one per image
`task`	`str`	`"detection"`	Task type. Only `"detection"` is supported by this model.
`max_new_tokens`	`int`	`2048`	Maximum decoding steps
`min_dimension`	`int`	`256`	Minimum image side after resize
`max_dimension`	`int`	`1024`	Maximum image side after resize
`compile`	`bool`	`True`	Run `torch.compile` on first call

Returns: list[list[dict]], one list per image.

Each detection dict contains:

{
  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
}

Requesting task="segmentation" on this model will raise a ValueError. Use the full tiiuae/Falcon-Perception model for segmentation masks.

What the model is for

Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include:

Natural language driven object selection in images
Lightweight bounding-box detection for downstream pipelines
Crowded scenes where the number of instances is large and variable
Edge or resource-constrained deployments where the full model is too large

It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.

Model details (high level)

The architecture follows a single-stack early-fusion recipe:

One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
Chain-of-Perception decoding: <|coord|> then <|size|> per instance
Specialized heads for coordinates and size, with geometry conditioning via Fourier features

Comparison with the full model

	Falcon-Perception	Falcon-Perception-300M
Parameters	~7B	~0.3B
Tasks	Detection + Segmentation	Detection only
Output	Bounding boxes + pixel masks	Bounding boxes
Token sequence	`<\|coord\|>` `<\|size\|>` `<\|seg\|>`	`<\|coord\|>` `<\|size\|>`

Limitations

Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models.
OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
This variant does not produce segmentation masks. Use the full model if pixel-level masks are needed.

Citation

If you use Falcon Perception, please cite:

@article{bevli2026falcon,
  title   = {Falcon Perception},
  author  = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
  journal = {arXiv preprint arXiv:2603.27365},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.27365}
}

Downloads last month: 617

Safetensors

Model size

0.3B params

Tensor type

F32

Collection including tiiuae/Falcon-Perception-300M

Falcon Perception

Collection

Falcon-Perception and Falcon-OCR model: early-fusion, natively multimodal, dense Autoregressive Transformer models. • 5 items • Updated Apr 6 • 14

Paper for tiiuae/Falcon-Perception-300M

Falcon Perception

Paper • 2603.27365 • Published Mar 28 • 16

tiiuae
/

Falcon-Perception-300M

Falcon Perception 300M

Links

Quickstart

Installation

Run open-vocabulary detection

Visualize detections

API

`model.generate(images, queries, **kwargs)`

What the model is for

Model details (high level)

Comparison with the full model

Limitations

Citation

Collection including tiiuae/Falcon-Perception-300M

Falcon Perception

Paper for tiiuae/Falcon-Perception-300M

Falcon Perception