Falcon Perception

This is the 300M parameter variant of Falcon Perception. It supports detection only (bounding boxes). For the full model with segmentation masks, see tiiuae/Falcon-Perception.

Falcon Perception 300M

Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.

The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: <|coord|> then <|size|>, producing a center point and bounding box size in normalized coordinates.

Links

Quickstart

Installation

pip install "torch>=2.5" transformers pillow einops

This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because torch.compile may build optimized kernels.

Run open-vocabulary detection

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/Falcon-Perception-300M",
    trust_remote_code=True,
    device_map={"": "cuda:0"},
)

image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]

for p in preds:
    print(p["xy"], p["hw"])

Each prediction is a dict with normalized bounding box coordinates:

{
  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
}

Visualize detections

from PIL import ImageDraw

draw = ImageDraw.Draw(image)
W, H = image.size

for p in preds:
    cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H
    bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H
    x0, y0 = cx - bw / 2, cy - bh / 2
    x1, y1 = cx + bw / 2, cy + bh / 2
    draw.rectangle([x0, y0, x1, y1], outline="lime", width=2)

image.save("output.jpg")

API

model.generate(images, queries, **kwargs)

Parameter Type Default Description
images PIL.Image or list required Single image or list of images
queries str or list[str] required Query string(s), one per image
task str "detection" Task type. Only "detection" is supported by this model.
max_new_tokens int 2048 Maximum decoding steps
min_dimension int 256 Minimum image side after resize
max_dimension int 1024 Maximum image side after resize
compile bool True Run torch.compile on first call

Returns: list[list[dict]], one list per image.

Each detection dict contains:

{
  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
}

Requesting task="segmentation" on this model will raise a ValueError. Use the full tiiuae/Falcon-Perception model for segmentation masks.

What the model is for

Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include:

  • Natural language driven object selection in images
  • Lightweight bounding-box detection for downstream pipelines
  • Crowded scenes where the number of instances is large and variable
  • Edge or resource-constrained deployments where the full model is too large

It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.

Model details (high level)

The architecture follows a single-stack early-fusion recipe:

  • One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
  • Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
  • Chain-of-Perception decoding: <|coord|> then <|size|> per instance
  • Specialized heads for coordinates and size, with geometry conditioning via Fourier features

Comparison with the full model

Falcon-Perception Falcon-Perception-300M
Parameters ~7B ~0.3B
Tasks Detection + Segmentation Detection only
Output Bounding boxes + pixel masks Bounding boxes
Token sequence <|coord|> <|size|> <|seg|> <|coord|> <|size|>

Limitations

  • Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models.
  • OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
  • Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
  • This variant does not produce segmentation masks. Use the full model if pixel-level masks are needed.

Citation

If you use Falcon Perception, please cite:

@article{bevli2026falcon,
  title   = {Falcon Perception},
  author  = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
  journal = {arXiv preprint arXiv:2603.27365},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.27365}
}
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including tiiuae/Falcon-Perception-300M

Paper for tiiuae/Falcon-Perception-300M