Amazon’s cloud division has launched an investigation into AI startup Perplexity, at issue in whether the AI search startup violated Amazon Web Services policies by taking down websites that tried to block it, WIRED has learned.
An AWS spokesman, who spoke to WIRED on the condition of anonymity, confirmed the company’s investigation into Perplexity. WIRED previously stated that the startup – which did this support from Jeff Bezos’ family fund, Nvidia, and recently valuable at $3 billion — appears to be based on content scraped from websites that were blocked from access via the Robot Exclusion Protocol, a common web standard. While the Robot Exclusion Protocol is not legally binding, terms of service generally are.
Robot Exclusion Protocol is a decades-old web standard that places a plain text file (such as wired.com/robots.txt) on a domain to indicate which pages should not be accessed by automated bots and crawlers. Although companies using scrapers may ignore this protocol, most traditionally follow it. An Amazon spokesperson told WIRED that AWS customers must follow the robots.txt standard when crawling websites.
“The AWS Terms of Service prohibits customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws,” a spokesperson said in a statement.
Perplexity practices are analyzed June 11 Forbes report who accused the startup of stealing at least one of its articles. WIRED investigations confirmed this practice and found further evidence of abuse and plagiarism removal in systems associated with Perplexity’s AI-powered search chatbot. Engineers at Condé Nast, WIRED’s parent company, block the Perplexity robot on all of its websites using the robots.txt file. WIRED, however, determined that the company accessed the server using an unpublished IP address – 44.221.181.252 – that visited Condé Nast properties at least hundreds of times over the past three months, apparently to search Condé Nast sites.
The Perplexity-linked machine appears to be engaged in extensive crawling of news sites that prohibit bots from accessing their content. Spokespeople for the Guardian, Forbes, and The Novel York Times also say they have repeatedly detected the IP address on the company’s servers.
WIRED traced the IP address to a virtual machine known as an Elastic Compute Cloud (EC2) instance hosted on AWS, which launched an investigation after we asked whether using AWS infrastructure to scrape websites that prohibit it violates the company’s terms of service.
Last week, Perplexity CEO Aravind Srinivas was the first to respond to WIRED’s investigation, saying that the questions we asked the company “reflect a deep and fundamental misunderstanding of how Perplexity and the Internet work.” Srinivas then said he told Fast Company that the secret IP address that WIRED observed scraping Condé Nast’s websites and a test site we created was operated by an outside company that performs web indexing and crawling services. He declined to name the company, citing a nondisclosure agreement. Asked if he would tell the outside company to stop indexing WIRED, Srinivas said, “It’s complicated.”
