Cheng Chang commited on
Commit
e1d539f
·
1 Parent(s): 29baff7

readme edit

Browse files
Files changed (3) hide show
  1. LICENSE +49 -0
  2. README.md +122 -36
  3. sample_code.py → sanity.py +0 -0
LICENSE ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ OpenMDW License Agreement, version 1.0 (OpenMDW-1.0)
2
+
3
+ By exercising rights granted to you under this agreement, you accept and agree
4
+ to its terms.
5
+
6
+ As used in this agreement, "Model Materials" means the materials provided to
7
+ you under this agreement, consisting of: (1) one or more machine learning
8
+ models (including architecture and parameters); and (2) all related artifacts
9
+ (including associated data, documentation and software) that are provided to
10
+ you hereunder.
11
+
12
+ Subject to your compliance with this agreement, permission is hereby granted,
13
+ free of charge, to deal in the Model Materials without restriction, including
14
+ under all copyright, patent, database, and trade secret rights included or
15
+ embodied therein.
16
+
17
+ If you distribute any portion of the Model Materials, you shall retain in your
18
+ distribution (1) a copy of this agreement, and (2) all copyright notices and
19
+ other notices of origin included in the Model Materials that are applicable to
20
+ your distribution.
21
+
22
+ If you file, maintain, or voluntarily participate in a lawsuit against any
23
+ person or entity asserting that the Model Materials directly or indirectly
24
+ infringe any patent, then all rights and grants made to you hereunder are
25
+ terminated, unless that lawsuit was in response to a corresponding lawsuit
26
+ first brought against you.
27
+
28
+ This agreement does not impose any restrictions or obligations with respect to
29
+ any use, modification, or sharing of any outputs generated by using the Model
30
+ Materials.
31
+
32
+ THE MODEL MATERIALS ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
33
+ OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
34
+ FITNESS FOR A PARTICULAR PURPOSE, TITLE, NONINFRINGEMENT, ACCURACY, OR THE
35
+ ABSENCE OF LATENT OR OTHER DEFECTS OR ERRORS, WHETHER OR NOT DISCOVERABLE, ALL
36
+ TO THE GREATEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW.
37
+
38
+ YOU ARE SOLELY RESPONSIBLE FOR (1) CLEARING RIGHTS OF OTHER PERSONS THAT MAY
39
+ APPLY TO THE MODEL MATERIALS OR ANY USE THEREOF, INCLUDING WITHOUT LIMITATION
40
+ ANY PERSON'S COPYRIGHTS OR OTHER RIGHTS INCLUDED OR EMBODIED IN THE MODEL
41
+ MATERIALS; (2) OBTAINING ANY NECESSARY CONSENTS, PERMISSIONS OR OTHER RIGHTS
42
+ REQUIRED FOR ANY USE OF THE MODEL MATERIALS; OR (3) PERFORMING ANY DUE
43
+ DILIGENCE OR UNDERTAKING ANY OTHER INVESTIGATIONS INTO THE MODEL MATERIALS OR
44
+ ANYTHING INCORPORATED OR EMBODIED THEREIN.
45
+
46
+ IN NO EVENT SHALL THE PROVIDERS OF THE MODEL MATERIALS BE LIABLE FOR ANY CLAIM,
47
+ DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
48
+ OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE MODEL MATERIALS, THE
49
+ USE THEREOF OR OTHER DEALINGS THEREIN.
README.md CHANGED
@@ -43,7 +43,7 @@ library_name: transformers
43
  text-decoration:none;
44
  font-weight:600;
45
  font-size:16px;">
46
- 🌐 Website
47
  </a>
48
  <a href="https://arxiv.org/abs/2510.09872" style="
49
  display:inline-block;
@@ -65,7 +65,7 @@ library_name: transformers
65
  text-decoration:none;
66
  font-weight:600;
67
  font-size:16px;">
68
- 💻 Code
69
  </a>
70
  </div>
71
 
@@ -96,7 +96,7 @@ ActIO-UI is developed by [Orby AI](https://www.orby.ai/), a [Uniphore](https://w
96
 
97
 
98
 
99
- # Models
100
 
101
  - [ActIO-UI-7B-SFT](https://huggingface.co/Uniphore/actio-ui-7b-sft): a 7B model trained with supervised finetuning (SFT) using distilled subtask data.
102
  - [ActIO-UI-7B-RLVR](?????(model_link)): a 7B model trained with Reinforcement Learning with Verifiable Rewards (RLVR) over the ActIO-UI-7B-SFT checkpoint.
@@ -139,8 +139,6 @@ ActIO-UI models are specifically trained to solve GUI subtask problems. Both the
139
  </div>
140
 
141
 
142
-
143
-
144
  ## Other Benchmarks
145
 
146
  To access generalizability of GUI subtask execution as a model capability, we compare the performance of ActIO-UI over GUI subtasks (WARC-Bench), long-horizon tasks (WebArena), short-horizon tasks (Miniwob++), and GUI visual grounding (ScreenSpot V2). Without access to any long-horizon and grounding data in its training dataset, our models show improved performance over their base models (except for the grounding performance when compared to Qwen 2.5 VL 72B).
@@ -165,16 +163,17 @@ To access generalizability of GUI subtask execution as a model capability, we co
165
  </div>
166
 
167
 
168
- ## Usage
169
 
170
- ### Image Input Size
 
 
171
 
172
  To maintain optimal model performance, each input image should be set at **1280 (pixel width) \\(\times\\) 720 (pixel height)**.
173
 
174
 
175
- ### Setup
176
 
177
- To run all the code snippets below, we recommend that you install everything in `requirements.txt` in a python environment.
178
  ```bash
179
  python -m venv ./venv
180
  source venv/bin/activate
@@ -182,30 +181,118 @@ pip install -r requirements.txt
182
  ```
183
 
184
 
185
- ### Quick start
186
-
187
- You can use [vLLM](https://docs.vllm.ai/en/latest/index.html) to serve the model.
188
- ```bash
189
- vllm serve Uniphore/actio-ui-7b-sft
190
- ```
191
-
192
- Then you can use the `demo.py` we provide to check out a sample response of the model with the training prompt.
193
- ```
194
- python demo.py
195
- ```
196
-
197
- ### Sample Code
198
-
199
-
200
-
201
-
202
- (Peng 10/06)
203
- - setup code
204
- - quickly run example (5-10 code line).
205
- - important results / hierachical results.
206
-
207
- ```
208
- ?????(sample code)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
  ```
210
 
211
 
@@ -215,14 +302,13 @@ python demo.py
215
  ## License
216
  This project is licensed under the Open Model, Data, & Weights License Agreement (OpenMDW). See the LICENSE file in the root folder for details.
217
 
218
- ## Research Use and Disclaimer
219
- ActIO-UI are intended for research and educational purposes only.
220
-
221
  ## Prohibited Uses
222
  The model may not be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction.
223
  Use for illegal, unethical, or harmful activities is strictly prohibited.
224
 
225
  ## Disclaimer
 
 
226
  The authors, contributors, and copyright holders are not responsible for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use.
227
  Use of the name, logo, or trademarks of "ActIO", "ActIO-UI" "WARC-Bench", or "Uniphore" does not imply any endorsement or affiliation unless separate written permission is obtained.
228
  Users are solely responsible for ensuring their use complies with applicable laws and regulations.
 
43
  text-decoration:none;
44
  font-weight:600;
45
  font-size:16px;">
46
+ 🌐 Website (Coming Soon!)
47
  </a>
48
  <a href="https://arxiv.org/abs/2510.09872" style="
49
  display:inline-block;
 
65
  text-decoration:none;
66
  font-weight:600;
67
  font-size:16px;">
68
+ 💻 Code (Coming Soon!)
69
  </a>
70
  </div>
71
 
 
96
 
97
 
98
 
99
+ # Model Family
100
 
101
  - [ActIO-UI-7B-SFT](https://huggingface.co/Uniphore/actio-ui-7b-sft): a 7B model trained with supervised finetuning (SFT) using distilled subtask data.
102
  - [ActIO-UI-7B-RLVR](?????(model_link)): a 7B model trained with Reinforcement Learning with Verifiable Rewards (RLVR) over the ActIO-UI-7B-SFT checkpoint.
 
139
  </div>
140
 
141
 
 
 
142
  ## Other Benchmarks
143
 
144
  To access generalizability of GUI subtask execution as a model capability, we compare the performance of ActIO-UI over GUI subtasks (WARC-Bench), long-horizon tasks (WebArena), short-horizon tasks (Miniwob++), and GUI visual grounding (ScreenSpot V2). Without access to any long-horizon and grounding data in its training dataset, our models show improved performance over their base models (except for the grounding performance when compared to Qwen 2.5 VL 72B).
 
163
  </div>
164
 
165
 
 
166
 
167
+ # Usage
168
+
169
+ ## Image Input Size
170
 
171
  To maintain optimal model performance, each input image should be set at **1280 (pixel width) \\(\times\\) 720 (pixel height)**.
172
 
173
 
174
+ ## Setup
175
 
176
+ To run the code snippets below, we recommend that you install everything in `requirements.txt` in a python environment.
177
  ```bash
178
  python -m venv ./venv
179
  source venv/bin/activate
 
181
  ```
182
 
183
 
184
+ ## Sanity test
185
+
186
+ Note that this is only a sanity test for ensuring model is working properly.
187
+ For replicating the evaluation result or using the model for your own project, please refer to our code repository on [GitHub](?????(repository)).
188
+
189
+ The following code snippet is also available in the attached sanity.py
190
+
191
+ ```{python}
192
+ import base64
193
+ import torch
194
+ from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
195
+ from PIL import Image
196
+
197
+ def encode_image(image_path: str) -> str:
198
+ """Encode image to base64 string for model input."""
199
+ with open(image_path, "rb") as f:
200
+ return base64.b64encode(f.read()).decode()
201
+
202
+
203
+ def load_model(
204
+ model_path: str,
205
+ ) -> tuple[AutoModel, AutoTokenizer, AutoImageProcessor]:
206
+ """Load OpenCUA model, tokenizer, and image processor."""
207
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
208
+ model = AutoModel.from_pretrained(
209
+ model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
210
+ )
211
+ image_processor = AutoImageProcessor.from_pretrained(
212
+ model_path, trust_remote_code=True
213
+ )
214
+
215
+ return model, tokenizer, image_processor
216
+
217
+
218
+ def create_grounding_messages(image_path: str, instruction: str) -> list[dict]:
219
+ """Create chat messages for GUI grounding task."""
220
+ system_prompt = (
221
+ "You are a GUI agent. You are given a task and a screenshot of the screen. "
222
+ "You need to perform a series of pyautogui actions to complete the task."
223
+ )
224
+
225
+ messages = [
226
+ {"role": "system", "content": system_prompt},
227
+ {
228
+ "role": "user",
229
+ "content": [
230
+ {
231
+ "type": "image",
232
+ "image": f"data:image/png;base64,{encode_image(image_path)}",
233
+ },
234
+ {"type": "text", "text": instruction},
235
+ ],
236
+ },
237
+ ]
238
+ return messages
239
+
240
+
241
+ def run_inference(
242
+ model: AutoModel,
243
+ tokenizer: AutoTokenizer,
244
+ image_processor: AutoImageProcessor,
245
+ messages: list[dict],
246
+ image_path: str,
247
+ ) -> str:
248
+ """Run inference on the model."""
249
+ # Prepare text input
250
+ input_ids = tokenizer.apply_chat_template(
251
+ messages, tokenize=True, add_generation_prompt=True
252
+ )
253
+ input_ids = torch.tensor([input_ids]).to(model.device)
254
+
255
+ # Prepare image input
256
+ image = Image.open(image_path).convert("RGB")
257
+ image_info = image_processor.preprocess(images=[image])
258
+ pixel_values = torch.tensor(image_info["pixel_values"]).to(
259
+ dtype=torch.bfloat16, device=model.device
260
+ )
261
+ grid_thws = torch.tensor(image_info["image_grid_thw"])
262
+
263
+ # Generate response
264
+ with torch.no_grad():
265
+ generated_ids = model.generate(
266
+ input_ids,
267
+ pixel_values=pixel_values,
268
+ grid_thws=grid_thws,
269
+ max_new_tokens=2048,
270
+ temperature=0,
271
+ )
272
+
273
+ # Decode output
274
+ prompt_len = input_ids.shape[1]
275
+ generated_ids = generated_ids[:, prompt_len:]
276
+ output_text = tokenizer.batch_decode(
277
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
278
+ )[0]
279
+
280
+ return output_text
281
+
282
+
283
+ # Example usage
284
+ model_path = "Uniphore/actio-ui-7b-sft" # or other model variants
285
+ image_path = "screenshot.png"
286
+ instruction = "Click on the submit button"
287
+
288
+ # Load model
289
+ model, tokenizer, image_processor = load_model(model_path)
290
+
291
+ # Create messages and run inference
292
+ messages = create_grounding_messages(image_path, instruction)
293
+ result = run_inference(model, tokenizer, image_processor, messages, image_path)
294
+
295
+ print("Model output:", result)
296
  ```
297
 
298
 
 
302
  ## License
303
  This project is licensed under the Open Model, Data, & Weights License Agreement (OpenMDW). See the LICENSE file in the root folder for details.
304
 
 
 
 
305
  ## Prohibited Uses
306
  The model may not be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction.
307
  Use for illegal, unethical, or harmful activities is strictly prohibited.
308
 
309
  ## Disclaimer
310
+ ActIO-UI are intended for research and educational purposes only.
311
+
312
  The authors, contributors, and copyright holders are not responsible for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use.
313
  Use of the name, logo, or trademarks of "ActIO", "ActIO-UI" "WARC-Bench", or "Uniphore" does not imply any endorsement or affiliation unless separate written permission is obtained.
314
  Users are solely responsible for ensuring their use complies with applicable laws and regulations.
sample_code.py → sanity.py RENAMED
File without changes