The open source data set that LLaMA is using has been taken off the shelves: it contains nearly 200,000 books and is benchmarked against the OpenAI data set

巴比特_

2023-08-21 06:22:01

Original source: Qubit

Image source: Generated by Unbounded AI‌

The open source data set was removed from the shelves due to copyright infringement.

Such as LLaMA, GPT-J, etc., have been trained with it.

Today, the website that hosted it for 3 years deleted all related content overnight.

This is Books3, a data set consisting of nearly 200,000 books, with a size of nearly 37GB.

A Danish anti-piracy organization stated that 150 books of its members were found in the data set, which constituted infringement, so it asked the platform to remove it.

Now the Books3 web page link on the platform has been "404".

The original developer of the data set said helplessly that the removal of Books3 is a tragedy in the open source circle.

What is Books3?

Books3 was released in 2020, uploaded by AI developer Shawn Presser, and included in Eleuther AI's open source dataset Pile.

It contains a total of 197,000 books, including all books from the pirated website Bibliotik, intended to benchmark OpenAI's dataset, but main open source.

This is where the name Books3 comes from—

After the release of GPT-3, it was officially disclosed that 15% of the content in its training data set came from two e-book corpora named "Books1" and "Books2", but the specific content has not been disclosed.

The open source Books3 provides more projects with an opportunity to compete with OpenAI.

For example, LLaMA, which exploded this year, and Eleuther AI's GPT-J, all use Books3.

It should be known that book data has always been the core corpus material in large model pre-training, and it can provide a reference for the model to output high-quality long text.

The book data sets used by many AI giants are not open source, or even very mysterious. For example, Books1/2, the understanding of its source and scale is mostly speculation from all walks of life.

Therefore, open source datasets are very important to the AI circle.

For easier access, Books3 is hosted on The Eye. This is a platform that can archive information, extract public data.

And this time it was taken off the shelves, and it was also about this platform.

The Danish anti-piracy group Rights Alliance made a request to The Eye to take it down, and it was granted.

But the good news is that Books3 has not disappeared completely, there are still other ways to get it.

There are also backups on the Wayback Machine, or they can be downloaded from a torrent client.

The author brother gave multiple methods on Twitter.

"Without Books3, you can't do your own ChatGPT"

In fact, the author of the data set has a lot to say about this delisting incident.

He said that the only way to make a model like ChatGPT is to create a dataset like Books3.

Every profit-making company is secretly making data sets. If there is no Books3, it means that only technology giants such as OpenAI can access these book data, so you will not be able to make your own ChatGPT.

In the author's opinion, ChatGPT is like a personal website in the 90s, and it is very important that anyone can do it.

However, since a large part of Books3 data comes from pirated websites, the author also expressed the hope that someone will make a better data set than Books3 in the future, which will not only improve the data quality, but also respect the copyright of books.

This similar situation also happened in OpenAI.

More than a month ago, two full-time authors sued OpenAI for using their works to train ChatGPT without permission.

The reason why this happened is that OpenAI's dataset Books2 has obtained a lot of data from the shadow library (piracy website).

Therefore, some voices joked that AI not only brought new technological breakthroughs, but also brought new tasks to anti-piracy organizations.

Reference link: [1] [2] [3] [4]

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.