It’s too early to tell how the wave of deals between AI companies and publishers will play out. However, OpenAI has already scored one clear victory: its web crawlers are not being blocked by leading news outlets at the rate they once were.
The generative AI boom sparked a data gold rush and then a data privacy rush (at least for most news outlets), in which publishers sought to block AI crawlers and prevent their work from becoming training data without consent. For example, when Apple debuted a modern artificial intelligence agent this summer, many leading news outlets quickly moved away from scraping Apple’s web using the Robots Exclusion Protocol, a robots.txt file that allows webmasters to control bots. There are so many modern AI bots on the scene that it may feel like you’re playing whack-a-mole just to keep up.
OpenAI’s GPTBot has the highest name recognition and is blocked more often than competitors such as Google AI. The number of leading media sites using robots.txt to “block” OpenAI’s GPTBot increased dramatically from its launch in August 2023 to this fall, and then increased steadily (but more gradually) from November 2023 to April 2024, according to an analysis of 1,000 popular news sites run by an artificial intelligence detection startup based in Ontario AI originality. At its peak, the maximum accounted for just over a third of websites; it has now fallen to nearly a quarter. A smaller group of top news outlets still have a block rate above 50 percent, but it’s down from nearly 90 percent earlier this year.
However, last May, after Dotdash Meredith announced a licensing agreement with OpenAI, that number dropped significantly. It then dropped again in delayed May when Vox announced its own deal — and again in August, when WIRED’s parent company, Condé Nast, struck a deal. The trend of increased blocking appears to be over, at least for now.
These declines make obvious sense. When companies enter into partnerships and consent to the utilize of their data, they are no longer incentivized to barricade it, which would mean they would update their robots.txt files to enable indexing; make enough deals and the overall percentage of sites blocking crawlers will almost certainly drop. Some outlets unlocked OpenAI robots the same day they announced the deal, such as The Atlantic. Others took anywhere from a few days to a few weeks, such as Vox, which announced the partnership in delayed May but unblocked GPTBot on its sites in delayed June.
The robots.txt file is not legally binding, but it has long functioned as a standard governing the behavior of web crawlers. For most of the Internet’s existence, people running websites expected each other to respect this file. When a WIRED investigation earlier this summer found that AI startup Perplexity had likely chosen to ignore commands in its robots.txt file, Amazon’s cloud division launched an investigation to determine whether Perplexity had violated its policies. Ignoring the robots.txt file is not a good idea, which probably explains why so many well-known AI companies – including OpenAI –state clearly that they utilize it to determine what to index. Originality AI CEO Jon Gillham believes this adds urgency to OpenAI’s efforts to strike deals. “It is clear that they see blocking OpenAI as a threat to their future ambitions,” says Gillham.