Inference support for vLLM and SGLang OpenAI endpoints

#3
by Vishva007 - opened

Hi NVIDIA Team,

I'm interested in deploying LocateAnything-3B using high-throughput inference engines like vLLM or SGLang.

  1. Are there any specific configuration flags required to handle the Parallel Box Decoding (PBD) architecture when serving via an OpenAI-compatible endpoint?
  2. Does the current implementation in these engines support the custom MLP projector and MoonViT encoder natively, or is a specific trust-remote-code setup required?
  3. If not currently supported, are there plans for an official integration or a recommended Docker container for scalable production serving?

Thanks for this impressive grounding model!

NVIDIA org

Hi @Vishva007 ,

Thanks for your interest in LocateAnything-3B and for the kind words about our grounding model!

Regarding your deployment questions, to be completely transparent, our team currently lacks extensive experience in deploying models on high-throughput inference engines like vLLM or SGLang. Here is the current status regarding your questions:

Our Recent Attempts: I actually tried using GPT-5.5 xhigh to do some "vibe coding" to hack together a vLLM-compatible version recently, but I ran into a ton of issues and roadblocks.

Potential Reference: During my exploration, one resource that seemed somewhat relevant as a reference point is the vllm-project/dllm-plugin (vLLM plugin for block-based diffusion language model support).

Future Plans: Honestly, adapting our complex multi-modal architecture to fit perfectly into these engines feels like a very difficult path right now. Because of this, we don't have immediate plans for an official integration or a recommended Docker container for scalable production serving at this exact moment.

Since we are still exploring this space ourselves, if you decide to dive into it or make any progress, we would absolutely love to hear your insights or welcome a community PR! We are very open to collaborating with the community to figure out the best deployment path.

Thanks again!

Hi Shihao,

Thank you for the incredibly candid response! I really appreciate the transparency regarding your recent experiments and the current state of vLLM/SGLang support.

To be honest, modifying core engine architectures to support your custom MLP and MoonViT setup is likely a bit out of my depth as well! That is definitely a massive undertaking.

I'll check out the dllm-plugin reference you mentioned. If I happen to hack together a workable workaround or make any breakthroughs, I’ll gladly share them here.

Thanks again to you and the team for an amazing grounding model!

Best regards,
Vishva

You can refer to Kimi-VL's code for most of codes for vLLM adaptation, but still there's much to modify: 1) Not compatible with transformers v5; 2) processor may be a big problem, I'm stucking at _get_prompt_updates because of processor's not being able to correctly deal with <image-1> to <img><IMG_CONTEXT></img> conversion while Kimi-VL's processor correctly dealing with this problem

seems OK to reuse vLLM's version of MoonViT, and MLP can go with a hf_to_vllm_mapper defined in custom LocateAnythingForConditionalGeneration
for PBD MTP, I haven't tried to solve how to adapt this into vLLM's framework (just AR path now, and stucked at that f*cking processor)

This comment has been hidden (marked as Low Quality)
NVIDIA org

@Columbus688 I sincerely apologize for the headache and frustration this has caused you... I truly appreciate your patience and all your great support on the infra adaptation!

I just figured out _get_prompt_updates can refer to Qwen2.5VL's logic and just got it done at noon, then I found this PR https://github.com/vllm-project/vllm/pull/44182
didn't thought of writing a new LogitsProcessor to deal with outputs like the author, tried it on my laptop, it works just so good
@ShihaoW I think you may try to change this PR to a vllm plugin and update it in github link or push forward that PR to provide vLLM adaptation

I just figured out _get_prompt_updates can refer to Qwen2.5VL's logic and just got it done at noon, then I found this PR https://github.com/vllm-project/vllm/pull/44182
didn't thought of writing a new LogitsProcessor to deal with outputs like the author, tried it on my laptop, it works just so good
@ShihaoW I think you may try to change this PR to a vllm plugin and update it in github link or push forward that PR to provide vLLM adaptation

: this PR doesn't implement MTP (Multi-Token Prediction). Without it, you leave a lot of performance on the table β€” in my experience MTP gives roughly a 2x–2.5x speedup
If you're focused on speeding things up, feel free to take a look at my project β€” I've got the MTP path working there and it might save you some time

I just figured out _get_prompt_updates can refer to Qwen2.5VL's logic and just got it done at noon, then I found this PR https://github.com/vllm-project/vllm/pull/44182
didn't thought of writing a new LogitsProcessor to deal with outputs like the author, tried it on my laptop, it works just so good
@ShihaoW I think you may try to change this PR to a vllm plugin and update it in github link or push forward that PR to provide vLLM adaptation

: this PR doesn't implement MTP (Multi-Token Prediction). Without it, you leave a lot of performance on the table β€” in my experience MTP gives roughly a 2x–2.5x speedup
If you're focused on speeding things up, feel free to take a look at my project β€” I've got the MTP path working there and it might save you some time

thanks bro, it's really helpful

it will be great to see more details on integrating this with vLLM to leverage multi-token prediction

I made a working example here, using the embedding input of vllm.

https://github.com/WuNein/LocateAnything-vLLM/blob/main/locateanything_vllm.ipynb

Welcome to star.

Sign up or log in to comment