How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HongxinLi/GoClick-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker
docker model run hf.co/HongxinLi/GoClick-Base
Quick Links

🎯 GoClick-Large: Super Fast Lightweight GUI Grounding Expert

GitHub Paper GoClickLarge GoClickBase SFTData SFTZipData

GoClick is a state-of-the-art two-stage framework for precise UI element grounding. Built on the Florence-2 architecture, it bridges the gap between high-level intent and low-level pixel coordinates by separating the Planning and Grounding tasks.

πŸ—οΈ Agent Architecture Overview

  1. Stage 1 (Planning): Analyze UI screenshot + Goal -> Output Function Description.
  2. Stage 2 (Grounding): Screenshot + Function Description -> Output Precise Coordinates.Note: This model is the specialized Stage 2 Grounder, fine-tuned for extreme precision in locating elements based on their described functionality.

πŸš€ Quick Start (Inference of The Model)

Prerequisites

pip install transformers==4.45.0 timm

Note: The version of Transformers should not be too high. Adjust the version if model loading fails.

Usage Example

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image


def postprocess(text: str, image_size: tuple[int]):
    """Function that decodes model's generation into action json.

    Args:
        text: single generated sample
        image_size: corresponding image size
    """
    point_pattern = r"<loc_(\d+)>,<loc_(\d+)>"

    try:
        location = re.findall(point_pattern, text)[0]
        if len(location) > 0:
            point = [int(loc) for loc in location]

    except Exception:
        point = (0, 0)

    return point

# Load model and processor
model = AutoModelForCausalLM.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)

# Load UI screenshot
image = Image.open("ui_screenshot.png")

# Stage 1: Planning

# Functionality Grounding (For AutoGUI FuncPred Benchmark)
planning_prompt = f"Locate the element according to its detailed functionality description. {goal_info} (Output the center coordinates of the target)"

# Intent Grounding (For RefExp, MOTIF, and VisualWebBench Action Grounding)
planning_prompt = f"I want to {goal_info}. Please locate the target element I should interact with. (Output the center coordinates of the target)"

# Description Grounding (For ScreenSpot/v2 and VisualWebBench Element Grounding))
planning_prompt = f"Where is the {goal_info} element? (Output the center coordinates of the target)"


inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt",
    do_resize=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(
            **inputs,
            do_sample= False,
            max_new_tokens=max_new_tokens,
            use_cache=True
        )

text_output = processor.tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
text_output = postprocess(text_output, img_size)

πŸ“Š Benchmarks

GoClick-Base also achieves a good tradeoff between GUI element grounding accuracy and inference latency:

Model Size TTFT ↓ (ms) TPOT ↓ (ms/token) FuncPred (F; M, W) ScreenSpot (B; M, W, D) ScreenSpot-v2 (B; M, W, D) MOTIF (I; M) RefExp (I; M) VWB EG (T; W) VWB AG (I; W)
GPT-4o - - - 9.8 17.8 20.4 30.5 21.8 5.6 6.8
Qwen2VL-7B 8B 118.9 21.2 38.7 66.4 66.9 75.1 64.8 55.9 62.1
CogAgent 18B 1253.2 208.8 29.3 47.4 49.2 46.7 35.0 55.7 59.2
SeeClick 10B 160.4 184.4 19.8 53.4 54.0 11.1 58.1 39.2 27.2
Ferret-UI 8B 152.5 22.9 1.2 7.1 7.8 15.9 5.5 3.9 1.9
UGround 7B 1034.6 27.9 48.8 74.8 76.5 72.4 73.6 85.2 63.1
OS-ATLAS-8B 8B 137.5 19.9 52.1 82.5 84.1 78.8 66.5 82.6 69.9
Aguvis 8B 119.7 21.2 52.0 83.8 85.6 73.8 80.9 91.3 68.0
Qwen2-VL 2B 58.8 16.4 7.1 17.9 18.6 28.8 29.2 17.9 17.5
OS-ATLAS-4B 4B 137.3 31.4 44.6 66.8 68.7 75.4 77.1 47.7 58.3
Ferret-UI 3B 69.5 9.8 1.3 2.1 1.9 5.5 1.1 0.7 1.0
ShowUI 2B 79.7 14.7 39.9 76.1 77.4 72.3 58.4 64.2 55.3
GoClick-L (ours) 0.8B 91.1 8.3 69.5 78.5 81.1 80.4 78.2 90.3 68.0
GoClick-B (ours) 0.2B 37.7 4.1 64.4 74.1 75.2 76.8 71.9 90.3 61.2

πŸ“ Citation

If you use GoClick in your research, please cite our paper:

@misc{li2026goclicklightweightelementgrounding,
      title={GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction}, 
      author={Hongxin Li and Yuntao Chen and Zhaoxiang Zhang},
      year={2026},
      eprint={2604.23941},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.23941}, 
}
Downloads last month
25
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for HongxinLi/GoClick-Base

Finetuned
(35)
this model

Paper for HongxinLi/GoClick-Base