Instructions to use nvidia/LocateAnything-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/LocateAnything-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nvidia/LocateAnything-3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/LocateAnything-3B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/LocateAnything-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/LocateAnything-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/LocateAnything-3B

SGLang

How to use nvidia/LocateAnything-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/LocateAnything-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/LocateAnything-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/LocateAnything-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nvidia/LocateAnything-3B with Docker Model Runner:
```
docker model run hf.co/nvidia/LocateAnything-3B
```

Inference support for vLLM and SGLang OpenAI endpoints

by Vishva007 - opened 6 days ago

Discussion

Vishva007

6 days ago

Hi NVIDIA Team,

I'm interested in deploying LocateAnything-3B using high-throughput inference engines like vLLM or SGLang.

Are there any specific configuration flags required to handle the Parallel Box Decoding (PBD) architecture when serving via an OpenAI-compatible endpoint?
Does the current implementation in these engines support the custom MLP projector and MoonViT encoder natively, or is a specific trust-remote-code setup required?
If not currently supported, are there plans for an official integration or a recommended Docker container for scalable production serving?

Thanks for this impressive grounding model!

ShihaoW

NVIDIA org 3 days ago

Hi @Vishva007 ,

Thanks for your interest in LocateAnything-3B and for the kind words about our grounding model!

Regarding your deployment questions, to be completely transparent, our team currently lacks extensive experience in deploying models on high-throughput inference engines like vLLM or SGLang. Here is the current status regarding your questions:

Our Recent Attempts: I actually tried using GPT-5.5 xhigh to do some "vibe coding" to hack together a vLLM-compatible version recently, but I ran into a ton of issues and roadblocks.

Potential Reference: During my exploration, one resource that seemed somewhat relevant as a reference point is the vllm-project/dllm-plugin (vLLM plugin for block-based diffusion language model support).

Future Plans: Honestly, adapting our complex multi-modal architecture to fit perfectly into these engines feels like a very difficult path right now. Because of this, we don't have immediate plans for an official integration or a recommended Docker container for scalable production serving at this exact moment.

Since we are still exploring this space ourselves, if you decide to dive into it or make any progress, we would absolutely love to hear your insights or welcome a community PR! We are very open to collaborating with the community to figure out the best deployment path.

Thanks again!

Vishva007

3 days ago

Hi Shihao,

Thank you for the incredibly candid response! I really appreciate the transparency regarding your recent experiments and the current state of vLLM/SGLang support.

To be honest, modifying core engine architectures to support your custom MLP and MoonViT setup is likely a bit out of my depth as well! That is definitely a massive undertaking.

I'll check out the dllm-plugin reference you mentioned. If I happen to hack together a workable workaround or make any breakthroughs, I’ll gladly share them here.

Thanks again to you and the team for an amazing grounding model!

Best regards,
Vishva

Columbus688

2 days ago

You can refer to Kimi-VL's code for most of codes for vLLM adaptation, but still there's much to modify: 1) Not compatible with transformers v5; 2) processor may be a big problem, I'm stucking at _get_prompt_updates because of processor's not being able to correctly deal with <image-1> to <img><IMG_CONTEXT></img> conversion while Kimi-VL's processor correctly dealing with this problem

seems OK to reuse vLLM's version of MoonViT, and MLP can go with a hf_to_vllm_mapper defined in custom LocateAnythingForConditionalGeneration
for PBD MTP, I haven't tried to solve how to adapt this into vLLM's framework (just AR path now, and stucked at that f*cking processor)

woshichaoren123

2 days ago

This comment has been hidden (marked as Low Quality)

ShihaoW

NVIDIA org 2 days ago

@Columbus688 I sincerely apologize for the headache and frustration this has caused you... I truly appreciate your patience and all your great support on the infra adaptation!

Columbus688

2 days ago

I just figured out _get_prompt_updates can refer to Qwen2.5VL's logic and just got it done at noon, then I found this PR https://github.com/vllm-project/vllm/pull/44182
didn't thought of writing a new LogitsProcessor to deal with outputs like the author, tried it on my laptop, it works just so good
@ShihaoW I think you may try to change this PR to a vllm plugin and update it in github link or push forward that PR to provide vLLM adaptation

Liuwang971

1 day ago

I just figured out _get_prompt_updates can refer to Qwen2.5VL's logic and just got it done at noon, then I found this PR https://github.com/vllm-project/vllm/pull/44182
didn't thought of writing a new LogitsProcessor to deal with outputs like the author, tried it on my laptop, it works just so good
@ShihaoW I think you may try to change this PR to a vllm plugin and update it in github link or push forward that PR to provide vLLM adaptation

: this PR doesn't implement MTP (Multi-Token Prediction). Without it, you leave a lot of performance on the table — in my experience MTP gives roughly a 2x–2.5x speedup
If you're focused on speeding things up, feel free to take a look at my project — I've got the MTP path working there and it might save you some time

Columbus688

1 day ago

I just figured out _get_prompt_updates can refer to Qwen2.5VL's logic and just got it done at noon, then I found this PR https://github.com/vllm-project/vllm/pull/44182
didn't thought of writing a new LogitsProcessor to deal with outputs like the author, tried it on my laptop, it works just so good
@ShihaoW I think you may try to change this PR to a vllm plugin and update it in github link or push forward that PR to provide vLLM adaptation

: this PR doesn't implement MTP (Multi-Token Prediction). Without it, you leave a lot of performance on the table — in my experience MTP gives roughly a 2x–2.5x speedup
If you're focused on speeding things up, feel free to take a look at my project — I've got the MTP path working there and it might save you some time

thanks bro, it's really helpful

Giddycrypt

1 day ago

it will be great to see more details on integrating this with vLLM to leverage multi-token prediction

shigureui

about 16 hours ago

I made a working example here, using the embedding input of vllm.

https://github.com/WuNein/LocateAnything-vLLM/blob/main/locateanything_vllm.ipynb

Welcome to star.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment