Microsoft recently released OmniParser V2.0, a new parsing tool designed to convert user interface (UI) screenshots into structured formats. OmniParser improves the performance of UI agents based on large language model (LLM), helping users better understand and manipulate information on the screen.
The tool's training dataset includes an interactive icon detection dataset that is carefully selected from popular web pages and automatically annotated to highlight clickable and actionable areas. In addition, there is an icon description dataset designed to combine each UI element with its corresponding functionality.
In V2.0, OmniParser has made significant improvements, with the updated dataset larger and cleaner, and the icon description and positioning effect has been improved by 60%. According to tests, the average latency of this version is also significantly reduced, about 0.6 seconds/frame on the A100 device and 0.8 seconds/frame on a single 4090 graphics card. In terms of performance, OmniParser achieved an average accuracy of 39.6 in ScreenSpot Pro test.
Users only need to use the OmniTool tool to control Windows 11 virtual machines. OmniTool is used in combination with OmniParser, and users can also choose the appropriate visual model. Currently, OmniTool supports a variety of large language models, such as multiple versions of OpenAI, DeepSeek (R1), Qwen (2.5VL) and Anthropic Computer Use, which facilitates users to perform various operations.
OmniParser aims to convert unstructured screenshot images into structured lists of elements, including locations of interactive areas and descriptions of potential functionalities of icons. Users using this tool need to have basic analytical skills and critical thinking, because although OmniParser can extract information, the final judgment still needs to be made by the user. This tool can be used in a variety of screenshots, including PC and mobile interfaces, and is highly adaptable.
However, the limitations of OmniParser are also worth noting. This tool does not detect harmful content in the input, so users should provide input with caution to ensure it does not contain harmful information. Meanwhile, although OmniParser only converts screenshots into text, it can still be used to build actionable graphical user interface agents. Developers must follow security standards and ethics when building and operating agents using OmniParser.
Model: https://huggingface.co/microsoft/OmniParser-v2.0
Project: https://github.com/microsoft/OmniParser/tree/master