Drag & drop your file here
or click to browse local folders
AI Research from NVIDIA
LocateAnything
NVIDIA's advanced 3B vision-language model. Locate any object, UI target, or text in images and videos with natural language.
Note: inputs larger than 1K are auto-resized in this Space demo. For full-resolution inference, download the weights and run locally.
⚙️ Advanced parameters
Temperature
0.7
Top P
0.9
Top K
20
Max Video Frames
4
📖 How to Use
- Upload an Image or Video, or pick a Quick Sandbox example below.
- Choose a Task Type: Detection · Grounding · OCR · GUI · Pointing.
- Enter Categories in the search bar (comma-separated, e.g.
car, person). - Optionally tune Advanced parameters above (mode, resize, temperature, etc.).
- Click Run Inference or press Enter in the search bar.
Comma-separated targets · supports English & Chinese · press Enter to run
status: No Media Loaded
🖼️ Interactive Quick Sandbox
Book
Sushi
People
OCR
compiled:
📊 Metrics Log
Status: Idle
Tokens/Frames: -
Detections: -
TPS / BPS: - / -
Time: -
🎯 Detected Target Overlays
0
Decoding Trace
Run inference to watch model tokens pop in here — ref labels, box coords, and stats shown in full without scrolling sideways.