Earlier this month, in response to mounting criticism around how OpenAI scoops up data to train ChatGPT, its groundbreaking chatbot, the company made it possible for websites to block it from scraping their content. A short piece of code would tell OpenAI to go away (and it would kindly obey).
Since then, hundreds of sites have shut the door. A Google search reveals many of them: Major online properties such as Amazon, Airbnb, Glassdoor and Quora have added the code to their “robots.txt” file, a kind of rules of engagement for the many bots — or spiders as they are also known — that scour the internet.
When I got in touch with the companies, none were willing to discuss their reasoning, but it's quite obvious: They want to put a stop to OpenAI taking content that doesn't belong to them in order to train its artificial intelligence. Unfortunately, it's going to take a lot more than a line of code to stop that from happening.
Other online resources with the kind of data that an AI system would love also have moved to block the crawler: Furniture store Ikea, jobs site Indeed.com, vehicle comparison resource Kelley Blue Book, and BAILII, the UK's court records system, similar to the US's PACER (which doesn't appear to be blocking the bot).
Coding resources website StackOverflow is blocking the crawler, but not its rival GitHub — perhaps unsurprising given that GitHub's owner, Microsoft, is a major investor in OpenAI. And, as major media companies begin negotiating with (or possibly suing) the likes of OpenAI over access to their archives, many have also taken the step to block the bot. Research reported by Business Insider suggested 70 of the top 1,000 websites globally have added the code. We can expect that number to grow.
Prob
Read more on tech.hindustantimes.com