Drag & drop your file here

or click to browse local folders

AI Research from NVIDIA

LocateAnything

NVIDIA's advanced 3B vision-language model. Locate any object, UI target, or text in images and videos with natural language.

Note: inputs larger than 1K are auto-resized in this Space demo. For full-resolution inference, download the weights and run locally.

⚙️ Advanced parameters
Temperature 0.7
Top P 0.9
Top K 20
Max Video Frames 4
📖 How to Use
  1. Upload an Image or Video, or pick a Quick Sandbox example below.
  2. Choose a Task Type: Detection · Grounding · OCR · GUI · Pointing.
  3. Enter Categories in the search bar (comma-separated, e.g. car, person).
  4. Optionally tune Advanced parameters above (mode, resize, temperature, etc.).
  5. Click Run Inference or press Enter in the search bar.

Comma-separated targets · supports English & Chinese · press Enter to run

status: No Media Loaded
🖼️ Interactive Quick Sandbox
Book
Sushi
People
OCR
compiled:
📊 Metrics Log
Status: Idle
Tokens/Frames: -
Detections: -
TPS / BPS: - / -
Time: -
🎯 Detected Target Overlays 0

Run inference to populate detected targets here — each result will pop in one by one.

Adjustable: Task Type · Categories · Inference Mode · Resize Cap · Temperature · Top P/K · Max Video Frames

Decoding Trace

Run inference to watch model tokens pop in here — ref labels, box coords, and stats shown in full without scrolling sideways.