Current location: Home> Ai News

ScreenSpot-Pro: a multi-modal LLM benchmark tool designed for high-resolution environments!

Author: LoRA Time: 06 Jan 2025 352

In professional environments, graphical user interface (GUI) agents face three key challenges. First, the complexity of professional applications is much higher than that of general software, requiring an in-depth understanding of complex layouts; second, professional tools usually have higher resolutions, resulting in smaller target sizes, thereby reducing positioning accuracy; finally, work Processes often rely on additional tools and documentation, adding to operational complexity. These challenges highlight the need to develop more advanced benchmarks and solutions to improve the performance of GUI agents in these demanding scenarios.

image.png

Current GUI positioning models and benchmarks cannot meet the requirements of professional environments. For example, tools such as ScreenSpot are designed primarily for low-resolution tasks and lack the versatility to accurately simulate real-world scenarios. However, models such as OS-Atlas and UGround perform poorly in terms of computational efficiency, and often fail, especially when the target is small or the interface icons are rich. Furthermore, the lack of multi-language support also limits the application of these models in global workflows. These shortcomings further highlight the need for more comprehensive and realistic benchmarks to advance the field.

To address these issues, a research team from the National University of Singapore, East China Normal University, and Hong Kong Baptist University launched ScreenSpot-Pro, a new benchmark tailored for high-resolution professional environments. The benchmark has a dataset of 1,581 tasks from 23 industries, including development, creative tools, CAD, scientific platforms and office suites. It features high-resolution, full-screen visuals with expert annotations to ensure accuracy and realism. ScreenSpot-Pro also offers multilingual guidance, including English and Chinese, to expand assessment coverage. Unlike before, ScreenSpot-Pro documents the actual workflow to ensure the production of high-quality annotations, thereby providing an effective tool for the comprehensive evaluation and development of GUI positioning models.

This dataset captures realistic and challenging scenes based on high-resolution images, with the target area occupying only 0.07% of the total screen on average, showing the subtlety and miniaturization of GUI elements. Data is collected by professional users with extensive experience in relevant applications, using specialized tools to ensure accuracy of annotations. In addition, the dataset supports multilingual capabilities to facilitate testing of bilingualism and contains multiple workflows to capture the nuances of professional tasks. These characteristics make it particularly useful for evaluating and improving the accuracy and flexibility of GUI agents.

Analysis of existing GUI positioning models using ScreenSpot-Pro revealed severe inadequacies in their ability to handle high-resolution professional environments. The highest accuracy of OS-Atlas-7B is only 18.9%. However, ReGround, which adopts an iterative approach, improves performance through fine-tuning in a multi-step approach, reaching an accuracy of 40.2%. Recognition of small components such as icons presents significant difficulties, and bilingual tasks further highlight the limitations of the model. These findings highlight the need for improved techniques to enhance contextual understanding and adaptability in complex GUI environments.

ScreenSpot-Pro sets a revolutionary benchmark for the evaluation of GUI agents in high-resolution professional environments. It solves specific challenges in complex workflows and provides a diverse and precise data set to guide innovation in GUI positioning. This contribution will lay the foundation for smarter, more efficient agents, supporting the seamless execution of professional tasks and significantly increasing productivity and innovation across industries.

Paper: https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf

Data|:https://huggingface.co/datasets/likaixin/ScreenSpot-Pro

FAQ

Who is the AI course suitable for?

AI courses are suitable for people who are interested in artificial intelligence technology, including but not limited to students, engineers, data scientists, developers, and professionals in AI technology.

How difficult is the AI course to learn?

The course content ranges from basic to advanced. Beginners can choose basic courses and gradually go into more complex algorithms and applications.

What foundations are needed to learn AI?

Learning AI requires a certain mathematical foundation (such as linear algebra, probability theory, calculus, etc.), as well as programming knowledge (Python is the most commonly used programming language).

What can I learn from the AI course?

You will learn the core concepts and technologies in the fields of natural language processing, computer vision, data analysis, and master the use of AI tools and frameworks for practical development.

What kind of work can I do after completing the AI ​​course?

You can work as a data scientist, machine learning engineer, AI researcher, or apply AI technology to innovate in all walks of life.