Amazon’s cloud division has launched an investigation into Perplexity AI over whether the AI search startup is violating Amazon Web Services rules by scraping websites that have tried to prevent such scraping, WIRED reports.
An AWS spokesperson, who spoke to WIRED on the condition of anonymity, confirmed that the company is looking into Perplexity. WIRED previously reported that the startup backing Funded by the Jeff Bezos Family Fund and NVIDIA, the company recently valuable The $3 billion malware appears to rely on scraped website content being banned by a common web standard, the Robot Exclusion Protocol, which is not legally binding, but terms of use generally are.
Robot Exclusion Protocol Robots.txt is a decades-old web standard in which a plain text file (e.g., wired.com/robots.txt) is placed on a domain to specify pages that automated bots and crawlers shouldn’t visit. Companies that use scrapers can choose to ignore the protocol, but most companies have traditionally respected it. An Amazon spokesperson told WIRED that AWS customers must follow the robots.txt standard when crawling websites.
“AWS’ terms of service prohibit customers from using our services for any illegal activity, and customers are responsible for complying with our terms and all applicable laws,” the spokesperson said in a statement.
Scrutiny of perplexity practices continues Forbes magazine report, June 11th The company accused the startup of stealing at least one of its articles. WIRED’s investigation confirmed the practice and further found evidence of scraping abuse and theft by a system linked to Perplexity’s AI-powered search chatbot. Engineers at WIRED’s parent company, Condé Nast, block Perplexity’s crawlers on all of its websites using robots.txt files. But WIRED found that the company had accessed its servers using an undisclosed IP address (44.221.181.252) and had visited Condé Nast sites at least hundreds of times in the past three months, apparently scraping Condé Nast websites.
Machines associated with Perplexity appear to be conducting extensive crawls of news websites that ban bots from accessing their content, and spokespeople for The Guardian, Forbes and The New York Times said they have also found the IP address on their servers multiple times.
WIRED traced the IP address to a virtual machine called an Elastic Compute Cloud (EC2) instance hosted on AWS, which launched an investigation after we asked whether using AWS infrastructure to scrape prohibited websites violates the company’s terms of service.
Last week, Perplexity CEO Aravind Srinivas told WIRED that the questions posed to the company in the first place “reflect a deep, fundamental misunderstanding of how Perplexity and the Internet work.” He told Fast Company The covert IP addresses that WIRED observed scraping Condé Nast’s website and our test site were operated by a third-party company that provides web crawling and indexing services. He declined to name the company, citing non-disclosure agreements. Asked if he would tell the third-party company to stop crawling WIRED, Srinivas said, “It’s complicated.”