Historical knowledge becomes AI’s weakness: large language models are difficult to solve complex historical problems

Author: LoRA Time: 20 Jan 2025 749

New research shows that although artificial intelligence excels in areas such as programming and content creation, it still falls short when it comes to dealing with complex historical issues. A recent study published at the NeurIPS conference showed that even the most advanced large language models (LLM) struggle to achieve satisfactory results in historical knowledge tests.

The research team developed a test benchmark called Hist-LLM to evaluate three top language models: OpenAI's GPT-4, Meta's Llama, and Google's Gemini. The test was conducted on the Seshat global historical database, and the results were disappointing: the best-performing GPT-4Turbo had an accuracy of only 46%.

AI robot writes papers

Maria del Rio-Chanona, an associate professor at University College London, explained: "These models perform well when it comes to basic historical facts, but fall short when it comes to in-depth historical research at the PhD level." Research has found that AI often gets it wrong in details, such as Misjudgment of whether ancient Egypt had certain military technologies or standing armies during certain periods.

Researchers believe that this poor performance stems from the fact that AI models tend to infer from mainstream historical narratives and have difficulty accurately grasping finer historical details. In addition, the study found that these models performed worse when dealing with historical issues in regions such as sub-Saharan Africa, exposing possible bias issues in the training data.

Peter Turchin, head of research at the Complexity Science Center (CSH), said that this finding shows that in some professional fields, AI is not yet able to replace human experts. However, the research team remains optimistic about the application prospects of AI in historical research, and they are improving the test benchmark in order to help develop better models.

Tips & Information

Historical knowledge becomes AI’s weakness: large language models are difficult to solve complex historical problems

Reka AI releases open source model Reka Flash 3: 2.1 billion parameters general reasoning model

Hyperfusion launches AI political and legal all-in-one machine to help efficient and safe office work

Google holds 14% of Anthropic equity, with a total investment of over US$3 billion

Free open source AI automation tool Nanobrowser: local deployment, support for multiple LLM, new choice for web operation

LLMs.txt generator v2 release: website text conversion speed is 10 times higher

Google holds 14% of Anthropic shares and invests $3 billion

OpenAI launches new tools: Response API helps build AI agents

DeepSeek official clarification: R2 release rumors are false