Research reveals: It only takes 0.001% of false data to render an AI model ineffective

Author: LoRA Time: 15 Jan 2025 837

Recently, a research team from New York University published a study revealing the vulnerability of large-scale language models (LLM) in data training. They found that even a tiny amount of false information, accounting for only 0.001% of the training data, can cause significant errors in the entire model. This finding is of particular concern for the medical field, where misinformation can directly impact patient safety.

code internet (2)

Researchers pointed out in a paper published in the journal "Nature Medicine" that although LLM performs well, if false information is injected into its training data, these models may still perform worse than untrained models on some open source evaluation benchmarks. The model of impact is just as good. This means that, under regular testing, we may not be able to detect potential risks in these models.

To test this, the research team conducted experiments on a training data set called "The Pile", into which they deliberately added 150,000 AI-generated medical fake articles. In just 24 hours, they generated the content, and the study showed that replacing 0.001% of the content in the dataset, even a small 1 million training markers, resulted in a 4.8% increase in harmful content. The process is extremely inexpensive, costing only $5.

This data poisoning attack does not require direct contact with the model's weights, but rather the attacker can weaken the effectiveness of the LLM simply by publishing harmful information on the network. The research team emphasizes that this finding highlights significant risks when using AI tools in the medical field. At the same time, they also mentioned that there have been relevant cases showing that some AI medical platforms, such as MyChart, often generate wrong information when automatically responding to patient questions, causing trouble to patients.

Therefore, the researchers call on AI developers and medical providers to clearly recognize this vulnerability when developing medical LLMs. They recommend that LLM should not be used for critical tasks such as diagnosis or treatment until safety can be ensured in the future.

FAQ

Who is the AI course suitable for?

AI courses are suitable for people who are interested in artificial intelligence technology, including but not limited to students, engineers, data scientists, developers, and professionals in AI technology.

How difficult is the AI course to learn?

The course content ranges from basic to advanced. Beginners can choose basic courses and gradually go into more complex algorithms and applications.

What foundations are needed to learn AI?

Learning AI requires a certain mathematical foundation (such as linear algebra, probability theory, calculus, etc.), as well as programming knowledge (Python is the most commonly used programming language).

What can I learn from the AI course?

You will learn the core concepts and technologies in the fields of natural language processing, computer vision, data analysis, and master the use of AI tools and frameworks for practical development.

What kind of work can I do after completing the AI course?

You can work as a data scientist, machine learning engineer, AI researcher, or apply AI technology to innovate in all walks of life.