GPT-5 is not far away! OpenAI launched the web crawler GPTBot, which automatically grabs data and can be selectively turned off

2023-08-08 06:14:41

Edit: Peach is so sleepy

Source: Xinzhiyuan

Guide: Just now, OpenAI launched GPTBot - a web crawler that can automatically grab data from the entire Internet. The resulting data will be used to train AI models like GPT-4 and GPT-5!

Some time ago, there was a turmoil in grabbing platform user data, and Reddit netizens were arguing.

Today, OpenAI launched a web crawler tool GPTBot, which can automatically scrape website data.

how to use?

OpenAI said in the published document that the web crawler will filter to remove sources that require paid access, but also remove personally identifiable information (PII) or text that violates its policies.

The data captured by GPTBot is used to train GPT-4 or GPT-5, which can improve the accuracy and capabilities of future artificial intelligence systems.

The tool can be identified by the following code:

User agent token: GPTBotFull user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +

Forbid access to GPTBot

On the other hand, you can also disable GPTBot from accessing websites by adding it to the site robots.txt.

This means that website owners must voluntarily take measures to prohibit OpenAI from accessing their websites and not using their own data for training.

User-agent: GPTBotDisallow: /

Custom GPTBot Access

You can also control GPTBot's access to some content of the website through the following code.

User-agent: GPTBotAllow: /directory-1/Disallow: /directory-2/

IP Export

For OpenAI's crawler, the website will be called from a block of IP addresses recorded on the OpenAI website.

Netizen Hot Discussion

OpenAI's move has triggered discussions among netizens on the ethical issues of web crawlers used to train AI models.

“OpenAI is not even moderately cited. It is making a derivative work and not citing it, thus obscuring the fact that it is.”

Netizens said that there is finally a chance to prevent OpenAI from grabbing your network data to train the model.

It was also suggested that the ChatGPT browser add-on had been removed for some time, in part because it allowed access to content behind a paywall.

Some time ago, OpenAI submitted a trademark application for GPT-5 to the US Patent Office on July 18, suggesting that the company is training a more advanced AI system.

GPTBot will apparently help OpenAI gather more data from the internet to train the model.

References:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.