📢 Gate Square #Creator Campaign Phase 1# is now live – support the launch of the PUMP token sale!
The viral Solana-based project Pump.Fun ($PUMP) is now live on Gate for public sale!
Join the Gate Square Creator Campaign, unleash your content power, and earn rewards!
📅 Campaign Period: July 11, 18:00 – July 15, 22:00 (UTC+8)
🎁 Total Prize Pool: $500 token rewards
✅ Event 1: Create & Post – Win Content Rewards
📅 Timeframe: July 12, 22:00 – July 15, 22:00 (UTC+8)
📌 How to Join:
Post original content about the PUMP project on Gate Square:
Minimum 100 words
Include hashtags: #Creator Campaign
The open source data set that LLaMA is using has been taken off the shelves: it contains nearly 200,000 books and is benchmarked against the OpenAI data set
Original source: Qubit
The open source data set was removed from the shelves due to copyright infringement.
Such as LLaMA, GPT-J, etc., have been trained with it.
Today, the website that hosted it for 3 years deleted all related content overnight.
This is Books3, a data set consisting of nearly 200,000 books, with a size of nearly 37GB.
Now the Books3 web page link on the platform has been "404".
The original developer of the data set said helplessly that the removal of Books3 is a tragedy in the open source circle.
**What is Books3? **
Books3 was released in 2020, uploaded by AI developer Shawn Presser, and included in Eleuther AI's open source dataset Pile.
It contains a total of 197,000 books, including all books from the pirated website Bibliotik, intended to benchmark OpenAI's dataset, but main open source.
This is where the name Books3 comes from—
After the release of GPT-3, it was officially disclosed that 15% of the content in its training data set came from two e-book corpora named "Books1" and "Books2", but the specific content has not been disclosed.
For example, LLaMA, which exploded this year, and Eleuther AI's GPT-J, all use Books3.
It should be known that book data has always been the core corpus material in large model pre-training, and it can provide a reference for the model to output high-quality long text.
The book data sets used by many AI giants are not open source, or even very mysterious. For example, Books1/2, the understanding of its source and scale is mostly speculation from all walks of life.
For easier access, Books3 is hosted on The Eye. This is a platform that can archive information, extract public data.
And this time it was taken off the shelves, and it was also about this platform.
The Danish anti-piracy group Rights Alliance made a request to The Eye to take it down, and it was granted.
But the good news is that Books3 has not disappeared completely, there are still other ways to get it.
There are also backups on the Wayback Machine, or they can be downloaded from a torrent client.
The author brother gave multiple methods on Twitter.
"Without Books3, you can't do your own ChatGPT"
In fact, the author of the data set has a lot to say about this delisting incident.
He said that the only way to make a model like ChatGPT is to create a dataset like Books3.
In the author's opinion, ChatGPT is like a personal website in the 90s, and it is very important that anyone can do it.
However, since a large part of Books3 data comes from pirated websites, the author also expressed the hope that someone will make a better data set than Books3 in the future, which will not only improve the data quality, but also respect the copyright of books.
More than a month ago, two full-time authors sued OpenAI for using their works to train ChatGPT without permission.
The reason why this happened is that OpenAI's dataset Books2 has obtained a lot of data from the shadow library (piracy website).
Therefore, some voices joked that AI not only brought new technological breakthroughs, but also brought new tasks to anti-piracy organizations.
Reference link: [1] [2] [3] [4]