robots.txt: Is This Standard Soon to be a Thing of the Past?

By Amber Jackson

August 31, 2024

undefined mins

Share this article

Prioritise Us on Google

Share this article

Prioritise Us on Google

AI Magazine speaks exclusively with Julius Cerniauskas, CEO of Oxylabs.

In this interview with Oxylabs CEO Julius Cerniauskas, he explains the impact of robots.txt on the AI industry and how it fares in the era of machines

As AI continues to evolve, particularly within a business context, how the technology interacts with the web is becoming more complex. Notably, there are challenges emerging in how AI interacts with protocols such as robots.txt.

The Robots Exclusion Protocol (REP), or “robots.txt,” is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

In considering this, AI Magazine speaks exclusively with Julius Cerniauskas, CEO of web data gathering company Oxylabs, about robots.txt and its impact on the AI industry.

“robots.txt has provided some guidance for bot internet access for the last 30 years,” he explains. “Since 1994, when Martijn Koster came up with the idea, robots.txt has acted as a machine-readable way to inform robots where they may wander (for indexing, data gathering, or other purposes) and which pages or sites should be circumvented.”

What is robots.txt and what is its impact on the AI industry?

“Though robots.txt has no legal power, some have accepted and promoted it. Meanwhile, others saw robots.txt as something like instructions on how to act in public spaces. As such, these instructions are criticised for being too easy to use unreasonably, without properly balancing legitimate business and public interest.

An example of robots.txt (Image credit: Cloudflare)

“The AI “arms race” provided more urgency to this question of whether REP is the most feasible or beneficial solution, with some saying that trying to block AI crawlers might be a long-term disaster. Google, which planned on making robots.txt a widely accepted internet standard five years ago, recently issued a call for industry stakeholders to explore alternatives to the decades-old REP-based order.”

Does robots.txt still live up to its expectations?

“Robots.txt doesn’t live up to its expectations anymore. The question is whether those expectations are even real. AI advancements depend on two main technical factors — computational power and data availability. If diverse data becomes unavailable, AI developers are forced to rely on synthetic datasets that are ineffective for general-purpose model training.

“If millions of websites prohibited AI crawlers via robots.txt, expecting this prohibition to be respected, the machine-learning-based AI technology development would be stalemated for years.

“It is important to note that the decision to ignore robots.txt when collecting public data isn’t new or unprecedented — the biggest internet archival project, the Internet Archive, openly acknowledged they do not always follow robots.txt as it comes in contradiction to their public mission of saving the internet as it is for future generations.

“Bots are indigenous species of the internet; they might cause trouble, but if used responsibly, they perform beneficial functions. Without advancements in web scraping technologies, we wouldn’t have witnessed the recent AI boom. Who should reap the monetary benefits of machine-powered automation is, however, a different question.”

How can the fight between AI companies and public data sources be resolved?

“It would be naive to treat the ongoing battles between AI companies and public data sources as a fight over data ownership and fairness — first, it is a fight over the distribution of monetary gains. There is a widespread argument that the conflict can be solved by installing a compensation mechanism for those whose data is being used.

Julius Cerniauskas: “Robots.txt doesn’t live up to its expectations anymore."

“The first broad steps to ensure compliance with copyright regulations and compensate copyright holders have already been made. In the EU, the newly adopted AI Act laid down specific transparency requirements for general-purpose AI systems, obliging AI companies to provide “sufficiently” detailed summaries of the training data. This obligation grants various copyright holders an easier possibility to exercise and enforce their rights.

“Unfortunately, compensating all data creators and/or owners is simply impossible. The internet is a vast repository of public data, much of which lacks attributed rights and is subject to ongoing legal debate regarding copyright. For AI firms, it would be impossible to identify everyone who could at least potentially claim copyright, let alone the prospect of compensating them.

“The escalating costs of such proactive compensation could render AI development unfeasible for smaller companies, thereby consolidating AI technology and its benefits in the hands of major tech players.

“This situation could potentially undermine the positive societal impact that AI could otherwise deliver.

“Major social media platforms, forums, and media publishers already act as gatekeepers, trying to paywall and monetize public data that often doesn’t even belong to them since it is created by millions of people who use those platforms. Whether it is fair to use this data for AI training should be answered based on the fair use doctrine instead of the Robots Exclusion Protocol regime that leaves a questionable right for anyone on the internet to make a one-sided decision to lock public data using robots.txt.”

What does the future hold for robots.txt in the age of intelligent machines?

“AI technology raises questions we, as humans, haven’t answered yet — is there a difference between human and machine creativity? Can a machine be held responsible and liable? Where is the line between humans and creative machines?

Julius Cerniauskas “AI technology raises questions we, as humans, haven’t answered yet." (Image credit: Oxylabs)

“Answering these questions will undoubtedly result in modifying decade-old legal regimes, such as copyright law and rules surrounding bot responsibility.

“The internet is already very different from what it was a few years ago — search engines are employing multiple AI-driven functionalities, whereas ChatGPT itself is turning into an alternative search engine. Web data is becoming the backbone of the digital economy, and there’s no way to reverse the tide without killing major technological innovations.

“Robots.txt doesn’t have a mechanism to address a variety of bots and crawlers that travel the web today, nor can it control many different AI use cases and potential ways in which data might be used. Fighting the AI revolution seems like Don Quixote’s quest, and even so do the attempts to save the decades-old textual file.

“If we are to move into the next industrial revolution — the age of intelligent machines — we will have to rethink the main legal and social frameworks that set the ways in which the digital space is organised.”

******

Make sure you check out the latest edition of AI Magazine and also sign up to our global conference series - Tech & AI LIVE 2024