OpenAI launches PaperBench benchmark to test AI agents to reproduce cutting-edge AI research capabilities

Author: LoRA Time: 03 Apr 2025 1023

On April 2, 2025, OpenAI announced the launch of PaperBench, a new benchmark aimed at evaluating AI agents’ ability to reproduce cutting-edge AI research. PaperBench requires agents to reproduce 20 ICML from scratch. In 2024, it was selected as Spotlight and Oral papers. The tasks include understanding the contribution of the paper, developing relevant code bases and successfully executing experiments.

In multiple tests conducted on PaperBench, the Claude 3.5 Sonnet (new version) performed well, combining the open source framework with an average reproduction score of 21.0%. Although Claude 3.5 has performed well, OpenAI found that it has not surpassed human baseline performance. Further testing was conducted by top machine learning doctors, showing that there is still room for improvement in the reproduction ability of the agent.

According to foreign media reports, the ChatGPT paid subscription users under OpenAI have exceeded 20 million, an increase of 30% from 15.5 million at the end of 2024.

ChatGPT has reached at least $415 million in monthly revenue and about $5 billion in annual revenue, while OpenAI is also promoting the $200/month Pro version, which may be more real revenue.

Tips & Information

OpenAI launches PaperBench benchmark to test AI agents to reproduce cutting-edge AI research capabilities

The Browser Company launches new AI browser Dia

NotebookLM launches "Discover Sources" function to innovate research efficiency

Tinder joins hands with OpenAI to launch the AI voice flirting game "Game Game"

Top 50 global mobile publisher revenue in 2024 OpenAI is on the list for the first time

OpenAI Academy: Free AI Courses for All Skill Levels

ChatGPT paid users exceed 20 million, OpenAI expects revenue to reach US$12.7 billion in 2025

Tinder launches the AI interactive game "The Game Game" to help users improve their flirting skills

MiniMax Audio launches Speech-02: supports 30+ languages, 200,000 characters input