Instructions to use HongxinLi/GoClick-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HongxinLi/GoClick-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="HongxinLi/GoClick-Base", trust_remote_code=True)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HongxinLi/GoClick-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HongxinLi/GoClick-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HongxinLi/GoClick-Base

SGLang

How to use HongxinLi/GoClick-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HongxinLi/GoClick-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HongxinLi/GoClick-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HongxinLi/GoClick-Base with Docker Model Runner:
```
docker model run hf.co/HongxinLi/GoClick-Base
```

🎯 GoClick-Large: Super Fast Lightweight GUI Grounding Expert

GoClick is a state-of-the-art two-stage framework for precise UI element grounding. Built on the Florence-2 architecture, it bridges the gap between high-level intent and low-level pixel coordinates by separating the Planning and Grounding tasks.

🏗️ Agent Architecture Overview

Stage 1 (Planning): Analyze UI screenshot + Goal -> Output Function Description.
Stage 2 (Grounding): Screenshot + Function Description -> Output Precise Coordinates.Note: This model is the specialized Stage 2 Grounder, fine-tuned for extreme precision in locating elements based on their described functionality.

🚀 Quick Start (Inference of The Model)

Prerequisites

pip install transformers==4.45.0 timm

Note: The version of Transformers should not be too high. Adjust the version if model loading fails.

Usage Example

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image


def postprocess(text: str, image_size: tuple[int]):
    """Function that decodes model's generation into action json.

    Args:
        text: single generated sample
        image_size: corresponding image size
    """
    point_pattern = r"<loc_(\d+)>,<loc_(\d+)>"

    try:
        location = re.findall(point_pattern, text)[0]
        if len(location) > 0:
            point = [int(loc) for loc in location]

    except Exception:
        point = (0, 0)

    return point

# Load model and processor
model = AutoModelForCausalLM.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)

# Load UI screenshot
image = Image.open("ui_screenshot.png")

# Stage 1: Planning

# Functionality Grounding (For AutoGUI FuncPred Benchmark)
planning_prompt = f"Locate the element according to its detailed functionality description. {goal_info} (Output the center coordinates of the target)"

# Intent Grounding (For RefExp, MOTIF, and VisualWebBench Action Grounding)
planning_prompt = f"I want to {goal_info}. Please locate the target element I should interact with. (Output the center coordinates of the target)"

# Description Grounding (For ScreenSpot/v2 and VisualWebBench Element Grounding))
planning_prompt = f"Where is the {goal_info} element? (Output the center coordinates of the target)"


inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt",
    do_resize=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(
            **inputs,
            do_sample= False,
            max_new_tokens=max_new_tokens,
            use_cache=True
        )

text_output = processor.tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
text_output = postprocess(text_output, img_size)

📊 Benchmarks

GoClick-Base also achieves a good tradeoff between GUI element grounding accuracy and inference latency:

Model	Size	TTFT ↓ (ms)	TPOT ↓ (ms/token)	FuncPred (F; M, W)	ScreenSpot (B; M, W, D)	ScreenSpot-v2 (B; M, W, D)	MOTIF (I; M)	RefExp (I; M)	VWB EG (T; W)	VWB AG (I; W)
GPT-4o	-	-	-	9.8	17.8	20.4	30.5	21.8	5.6	6.8
Qwen2VL-7B	8B	118.9	21.2	38.7	66.4	66.9	75.1	64.8	55.9	62.1
CogAgent	18B	1253.2	208.8	29.3	47.4	49.2	46.7	35.0	55.7	59.2
SeeClick	10B	160.4	184.4	19.8	53.4	54.0	11.1	58.1	39.2	27.2
Ferret-UI	8B	152.5	22.9	1.2	7.1	7.8	15.9	5.5	3.9	1.9
UGround	7B	1034.6	27.9	48.8	74.8	76.5	72.4	73.6	85.2	63.1
OS-ATLAS-8B	8B	137.5	19.9	52.1	82.5	84.1	78.8	66.5	82.6	69.9
Aguvis	8B	119.7	21.2	52.0	83.8	85.6	73.8	80.9	91.3	68.0
Qwen2-VL	2B	58.8	16.4	7.1	17.9	18.6	28.8	29.2	17.9	17.5
OS-ATLAS-4B	4B	137.3	31.4	44.6	66.8	68.7	75.4	77.1	47.7	58.3
Ferret-UI	3B	69.5	9.8	1.3	2.1	1.9	5.5	1.1	0.7	1.0
ShowUI	2B	79.7	14.7	39.9	76.1	77.4	72.3	58.4	64.2	55.3
GoClick-L (ours)	0.8B	91.1	8.3	69.5	78.5	81.1	80.4	78.2	90.3	68.0
GoClick-B (ours)	0.2B	37.7	4.1	64.4	74.1	75.2	76.8	71.9	90.3	61.2

📝 Citation

If you use GoClick in your research, please cite our paper:

@misc{li2026goclicklightweightelementgrounding,
      title={GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction}, 
      author={Hongxin Li and Yuntao Chen and Zhaoxiang Zhang},
      year={2026},
      eprint={2604.23941},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.23941}, 
}

Downloads last month: 11

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HongxinLi/GoClick-Base

Base model

microsoft/Florence-2-large

Finetuned

(35)

this model

Paper for HongxinLi/GoClick-Base

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Paper • 2604.23941 • Published 14 days ago • 6