Last week, reports emerged that the New York Times may take legal action against the ChatGPT maker OpenAI as its AI models allegedly use content published on the website, which is NYT's intellectual property, to train its AI models. While that may not have happened so far, the major news publisher has now decided to ban OpenAI's web crawler from viewing content on its website. The move means that the website's content cannot be used to train any of OpenAI's AI foundational models.
As per a report by The Verge, NYT has blocked OpenAI's web crawler GPTbot from searching and indexing the contents of the website. The report highlights the robot.txt page of the publication which clearly shows that the bot has been disallowed. Using the Internet Archive's Wayback Machine that lets users check webpages on any past date, it turns out that the bot was blocked on August 17.
This move comes after OpenAI gave website owners an “opt-out” option to not have their site's content be used by the company to train its AI models. On August 7, the company explained that its GPTbot can be stopped by going to the robot.txt page. At the same time, highlighting the usage of the content in its blog post, it said, “Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies”.
For the unaware, the web crawler, also known as a web spider, is essentially a computer program that can search and automatically index website content. It goes through all the URLs of the website and then uses the data to source information for itself. Such web crawlers are being
Read more on tech.hindustantimes.com