Yup. The robots.txt file is not only meant to block robots from accessing the site, it's also meant to block bots from accessing resources that are not interesting for human readers, even indirectly.
For example, MediaWiki installations are pretty clever in that by default, /w/ is blocked and /wiki/ is encouraged. Because nobody wants technical pages and wiki histories in search results, they only want the current versions of the pages.
Fun tidbit: in the late 1990s, there was a real epidemic of spammers scraping the web pages for email addresses. Some people developed wpoison.cgi, a script whose sole purpose was to generate garbage web pages with bogus email addresses. Real search engines ignored these, thanks to robots.txt. Guess what the spam bots did?
Do the AI bros really want to go there? Are they asking for model collapse?
Of course they want the model collapse. Literally no American tech company has been about reliably, sustainably supplying a good or service or stewarding some public good.
They’re doing the vc -> juice stock -> gut resources cycle. Nobody cares about the model.
Considering Reddit has decided to start selling user content for training, yeah I guess they want their models to collapse. There’s so much bot generated content nowadays
The basic social contract of the web was to keep things accessible, including to bots. No one has the storage capacity to rip the entire web like all these jokers are pretending - even google merely indexes it except for the most popular pages.
The thing ruining the social contract of the web is the profit motive of all these companies trying to convince people that they should be able to sell data that is otherwise publically accessible, for the purposes of allowing bots to look at it - they can't memorize it.
Of course ChatGPT and the other AI companies ARE partially to blame: it seems they've poisoned the pot by giving their AIs continual access to the training sets and/or even the broader internet on the backend without making this clear to users, allowing Journalists to claim that these AIs have somehow memorized Pettabytes of data into a few Gigabytes. That is an ABSURD, basically impossible, compression ratio for anyone with even the slightest comprehension with the topic.
No, your random article you tricked ChatGPT into spitting out is not worth memorizing, not even to the lie and hallucination prone AI chatbots we have available to prod for free or otherwise. Oh, you paid for it, and your complaint is that its spitting accurate information? YOU'RE PAYING FOR THEM TO HOST THE CHATBOT FOR YOU AND PROVIDE IT ACCESS TO INFORMATION IT WOULD OTHERWISE NOT HAVE ACCESS TO ON THE BACK-END.
By all means, sue the companies into paying for their data, and force them to divulge the data-sets they keep on-hand so that they can be charged for information in them, but stop pretending the AIs themselves contain copies of it, or that its impossible to make them pay ex-post-facto (as opposed to the ENTIRETY of the rest of our legal system and enforcement) ...
AND PEOPLE, stop letting all these companies trick you into thinking that this is a valid excuse to further lock-down the web, or that you must poison your fanart with methods that WILL be bypassed. Its just another potential expense and technical burden these companies want you to believe you must bear rather than sticking to the things you enjoy and/or that put food on your table.
As I always write, trying to restrict AI training on the ground of copyright will only backfire. The sad truth is that malicious parties (dictatorships) will get more training materials because they won't abide by rules. The end result is, dictators would outperform democracies in terms of future generation AIs, if we treat AI training like human reading.
"The bad guys will do it anyway so we need to do it, too" is the worst kind of fatalism. That kind of logic can be used to justify any number of heinous acts, and I refuse to live in a world where the worst of us are allowed to drag down the rest of us.
But, if we make training ai without copyright illegal, it will hamper open source models, while not affecting closed source ones , because they could just buy it off of big social media conglomerates
"Bad guys are going to do bad things, so we shouldn't even bother trying to do anything to make things better, and just let the dystopia happen" is not the answer
🤖 I'm a bot that provides automatic summaries for articles:
Click here to see the summary
If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.
AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.
In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities.
You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find.
The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.
“We recognize that existing web publisher controls were developed before new AI and research use cases,” Google’s VP of trust Danielle Romain wrote last year.