AI performs poorly on advanced history exam: GPT-4 Turbo only has 46% accuracy

Author: LoRA Time: 21 Jan 2025 313

Recently, a study led by the Complexity Science Institute (CSH) in Austria showed that although large language models (LLMs) perform well in multiple tasks, they have exposed shortcomings when dealing with high-level historical problems. The research team tested three top models, including OpenAI's GPT-4, Meta's Llama and Google's Gemini, and the results were disappointing.

Robot Competition Answer Mathematics

In order to evaluate the performance of these models on historical knowledge, the researchers developed a benchmark tool called "Hist-LLM". This tool is based on the Seshat global historical database and is designed to verify the accuracy of AI answers to historical questions. The research results were announced at the well-known artificial intelligence conference NeurIPS. The data showed that the accuracy of the best-performing GPT-4Turbo was only 46%. This result shows that the performance is only slightly better than random guessing.

Maria del Rio-Chanona, associate professor of computer science at University College London, said: "While large language models are impressive, their depth of understanding for high-level historical knowledge falls short. They are good at handling simple facts, but struggle with more complex ones. For example, when asked whether scale armor existed in ancient Egypt at a specific time, GPT-4Turbo incorrectly answered "yes," when in fact this technology did not appear until 1,500 years ago. In addition, when researchers asked whether ancient Egypt had a professional standing army, GPT-4 also incorrectly answered "yes" when the actual answer was no.

The study also revealed that the model performed poorly in certain regions, such as sub-Saharan Africa, suggesting that its training data may be biased. Study leader Peter Turchin pointed out that these results reflect that in some areas, LLMs are still unable to replace humans.

Tips & Information