Update README.md

35fc6d3 verified about 1 month ago

4.14 kB

	---
	license: apache-2.0
	datasets:
	- GUI-Libra/GUI-Libra-81K-RL
	- GUI-Libra/GUI-Libra-81K-SFT
	language:
	- en
	base_model:
	- Qwen/Qwen3-VL-4B-Instruct
	tags:
	- VLM
	- GUI
	- agent
	---

	# Introduction

	The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".


	GitHub: https://github.com/GUI-Libra/GUI-Libra
	Website: https://GUI-Libra.github.io


	# Usage
	## 1) Start an OpenAI-compatible vLLM server

	```bash
	pip install -U vllm
	vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
	````

	* Endpoint: `http://localhost:8000/v1`
	* The `api_key` here must match `--api-key`.


	## 2) Minimal Python example (prompt + image → request)

	Install dependencies:

	```bash
	pip install -U openai
	```

	Create `minimal_infer.py`:

	```python
	import base64
	from openai import OpenAI

	MODEL = "GUI-Libra/GUI-Libra-4B"
	client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

	def b64_image(path: str) -> str:
	with open(path, "rb") as f:
	return base64.b64encode(f.read()).decode("utf-8")

	# 1) Your screenshot path
	img_b64 = b64_image("screen.png")

	system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
	action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
	## Explanation: Tap or click a specific UI element and provide its coordinates

	action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
	## Explanation: Select an item from a list or dropdown menu

	action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
	## Explanation: Enter text into a specific input field or at the current focus if coordinate is None

	action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
	## Explanation: Press a specified key on the keyboard

	action_type: Scroll, action_target: None, value: "up" \| "down" \| "left" \| "right", point_2d: None
	## Explanation: Scroll a view or container in the specified direction
	"""

	# 2) Your prompt (instruction + desired output format)

	task_desc = 'Go to Amazon.com and buy a math book'
	prev_txt = ''
	question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
	img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
	query = question_description.format(img_size_string, task_desc, prev_txt)

	query = query + '\n' + '''The response should be structured in the following format:
	<thinking>Your step-by-step thought process here...</thinking>
	<answer>
	{
	"action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
	"action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
	"value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
	"point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
	}
	</answer>'''

	resp = client.chat.completions.create(
	model=MODEL,
	messages=[
	{"role": "system", "content": "You are a helpful GUI agent."},
	{"role": "user", "content": [
	{"type": "image_url",
	"image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
	{"type": "text", "text": prompt},
	]},
	],
	temperature=0.0,
	max_completion_tokens=1024,
	)

	print(resp.choices[0].message.content)
	```

	Run:

	```bash
	python minimal_infer.py
	```

	---

	## Notes

	* Replace `screen.png` with your own screenshot file.
	* If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
	* The example assumes your vLLM server is running locally on port `8000`.