Zuckerberg knew Meta was using pirated library data to train AI

Author: LoRA Time: 15 Jan 2025 401

Recently, as documents disclosed by Meta in a copyright class action lawsuit surfaced, the company used a pirated e-book library called Library Genesis (LibGen) to train its latest AI chatbot Llama3. attracted widespread attention. The documents show that Meta engineers discussed the potential risks of leveraging LibGen, a "shadow library," especially amid growing concerns about copyright and data ownership. Despite the potential negative impact and risk of publicity, Meta CEO Mark Zuckerberg approved the decision.

library study reading (3)

At the request of the court, records of Meta's confidential internal conversations about the use of the LibGen data set were declassified. The documents show that Meta executives made it clear in discussions with the AI research team that LibGen's data was "we know to be pirated." Agree to use this data to improve the performance of Llama3. In an email, Meta's director of product management Sony Theakanath pointed out that although the decision to use LibGen triggered public opinion risks, other AI companies are also using similar data, which makes Meta's team feel that this path is not an isolated one.

More worryingly, Meta staff also discussed how to process and filter text in LibGen to remove copyright markings such as ISBNs and copyright notices. An internal memo said the materials provided by LibGen were "high quality and long-format, making them ideal for learning particularly specialized subjects." This suggests that Meta appears to be trying to conceal its use of unauthorized content.

In addition, Meta employees also mentioned in the email that it may be inappropriate to directly use the company's IP address for torrenting and expressed concerns about this behavior. However, with Zuckerberg "pushing from the top" to use the LibGen data set, Meta's winning mentality in the AI race is clearly revealed. This incident has once again aroused attention and doubts about the copyright issues of large technology companies.

The outcome of this copyright lawsuit may have important implications for other similar ongoing cases, particularly regarding the use of creative works such as images, music and literature. As technology companies’ demand for original content continues to increase, the rights of original content creators will become the focus of attention.

FAQ

Who is the AI course suitable for?

AI courses are suitable for people who are interested in artificial intelligence technology, including but not limited to students, engineers, data scientists, developers, and professionals in AI technology.

How difficult is the AI course to learn?

The course content ranges from basic to advanced. Beginners can choose basic courses and gradually go into more complex algorithms and applications.

What foundations are needed to learn AI?

Learning AI requires a certain mathematical foundation (such as linear algebra, probability theory, calculus, etc.), as well as programming knowledge (Python is the most commonly used programming language).

What can I learn from the AI course?

You will learn the core concepts and technologies in the fields of natural language processing, computer vision, data analysis, and master the use of AI tools and frameworks for practical development.

What kind of work can I do after completing the AI course?

You can work as a data scientist, machine learning engineer, AI researcher, or apply AI technology to innovate in all walks of life.