LocateAnything

NVIDIA's advanced 3B vision-language model. Locate any object, UI target, or text in images and videos with natural language.

Note: inputs larger than 1K are auto-resized in this Space demo. For full-resolution inference, download the weights and run locally.

Media Type

Task Type

⚙️ Advanced parameters

Inference Mode

Resize Cap (px)

Temperature 0.7

Top P 0.9

Top K 20

Max Video Frames 4

📖 How to Use

Upload an Image or Video, or pick a Quick Sandbox example below.
Choose a Task Type: Detection · Grounding · OCR · GUI · Pointing.
Enter Categories in the search bar (comma-separated, e.g. car, person).
Optionally tune Advanced parameters above (mode, resize, temperature, etc.).
Click Run Inference or press Enter in the search bar.

Comma-separated targets · supports English & Chinese · press Enter to run

status: No Media Loaded

🖼️ Interactive Quick Sandbox

Book

Sushi

People

OCR

compiled:

📊 Metrics Log

Status: Idle

Tokens/Frames: -

Detections: -

TPS / BPS: - / -

Time: -

🎯 Detected Target Overlays 0

Run inference to populate detected targets here — each result will pop in one by one.

Adjustable: Task Type · Categories · Inference Mode · Resize Cap · Temperature · Top P/K · Max Video Frames

Decoding Trace

Run inference to watch model tokens pop in here — ref labels, box coords, and stats shown in full without scrolling sideways.