| --- |
| license: apache-2.0 |
| datasets: |
| - GUI-Libra/GUI-Libra-81K-RL |
| - GUI-Libra/GUI-Libra-81K-SFT |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen3-VL-4B-Instruct |
| tags: |
| - VLM |
| - GUI |
| - agent |
| --- |
| |
| # Introduction |
|
|
| The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL". |
|
|
|
|
| **GitHub:** https://github.com/GUI-Libra/GUI-Libra |
| **Website:** https://GUI-Libra.github.io |
|
|
|
|
| # Usage |
| ## 1) Start an OpenAI-compatible vLLM server |
|
|
| ```bash |
| pip install -U vllm |
| vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123 |
| ```` |
|
|
| * Endpoint: `http://localhost:8000/v1` |
| * The `api_key` here must match `--api-key`. |
|
|
|
|
| ## 2) Minimal Python example (prompt + image → request) |
|
|
| Install dependencies: |
|
|
| ```bash |
| pip install -U openai |
| ``` |
|
|
| Create `minimal_infer.py`: |
|
|
| ```python |
| import base64 |
| from openai import OpenAI |
| |
| MODEL = "GUI-Libra/GUI-Libra-4B" |
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123") |
| |
| def b64_image(path: str) -> str: |
| with open(path, "rb") as f: |
| return base64.b64encode(f.read()).decode("utf-8") |
| |
| # 1) Your screenshot path |
| img_b64 = b64_image("screen.png") |
| |
| system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list: |
| action_type: Click, action_target: Element description, value: None, point_2d: [x, y] |
| ## Explanation: Tap or click a specific UI element and provide its coordinates |
| |
| action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None |
| ## Explanation: Select an item from a list or dropdown menu |
| |
| action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None |
| ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None |
| |
| action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None |
| ## Explanation: Press a specified key on the keyboard |
| |
| action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None |
| ## Explanation: Scroll a view or container in the specified direction |
| """ |
| |
| # 2) Your prompt (instruction + desired output format) |
| |
| task_desc = 'Go to Amazon.com and buy a math book' |
| prev_txt = '' |
| question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n''' |
| img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1]) |
| query = question_description.format(img_size_string, task_desc, prev_txt) |
| |
| query = query + '\n' + '''The response should be structured in the following format: |
| <thinking>Your step-by-step thought process here...</thinking> |
| <answer> |
| { |
| "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.", |
| "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with", |
| "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'", |
| "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100] |
| } |
| </answer>''' |
| |
| resp = client.chat.completions.create( |
| model=MODEL, |
| messages=[ |
| {"role": "system", "content": "You are a helpful GUI agent."}, |
| {"role": "user", "content": [ |
| {"type": "image_url", |
| "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}}, |
| {"type": "text", "text": prompt}, |
| ]}, |
| ], |
| temperature=0.0, |
| max_completion_tokens=1024, |
| ) |
| |
| print(resp.choices[0].message.content) |
| ``` |
|
|
| Run: |
|
|
| ```bash |
| python minimal_infer.py |
| ``` |
|
|
| --- |
|
|
| ## Notes |
|
|
| * Replace `screen.png` with your own screenshot file. |
| * If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests. |
| * The example assumes your vLLM server is running locally on port `8000`. |
|
|
|
|
|
|
|
|