AI Crawlers are ruining the internet as we know it

13 October 2025

AI Crawlers are ruining the internet. Recently there has been a lot of talk about AI crawlers increasing hosting costs for organisations, ignoring traditional crawling directives whilst also lowering the amount of traffic that is driven to the websites they are harvesting data from. It poses the question, why should we allow these AI crawlers to visit our websites - ultimately ignoring copyright laws and monetising the information we have created. The current model is a relationship of all take and no give for most website owners - If I were a relationship expert I would be telling you to run a mile.

Hosting increases

The obvious initial impact on organisations is the additional traffic to your website has a direct cost to how much your hosting will cost you. There are ways to reduce this impact by having a good caching strategy and making use of a content delivery network (CDN), but if your website is not configured correctly these AI crawlers can make a lot of uncached requests increasing the load on your hosting. 

In my recent research I’ve found that AI crawlers love to follow links, a lot of websites still have faceted search pages that are implemented as links - Allowing users to follow links to display results that are categorised by the taxonomy term selected. If these filters are set up in a way which does not reduce the number of possible links then it does open your website up to more uncached URL combinations.

If left, the additional traffic not only impacts the performance and uptime of your website, but is likely to increase your hosting bill by around 20% from the examples I’ve seen.

Ignoring traditional crawling directives

Another issue is a lot of these AI crawlers are ignoring the traditional controls in place that were used by search engines to make sure your website did not become overwhelmed by crawlers, but even traditional search engine crawlers ignore a lot of these now.

The Crawl-Delay directive in your robots.txt file is ignored by search engines such as Google and this was initially put in place to protect your website from being flooded by requests. From what I have seen in my research the AI crawlers.

Even the nofollow directive on links is ignored by major website crawlers, including AI crawlers. Now I’ve been working on a website which had search filters set up using facets which are displayed as links. The crawlers loved these and are the source of a lot of website outages from the servers being overloaded with requests. I had to temporarily block all content with f[0] in their query string from specific countries to keep the websites online until I could deploy a fix, which isn’t ideal as it also blocked some legitimate traffic and search engine crawlers.. 

Cloudflare reported some crawlers altering their user agents to bypass the no-crawl directive in robots.txt, this isn’t the first time it’s been reported that AI crawlers are ignoring the limits websites are placing on them, over a year ago in June 2024 it was reported that AI crawlers were ignoring robots.txt exclusions and scraping content regardless. This behaviour makes it feel like the wild west, where the accepted norms of the web are being ignored in the name of advancing technology and making their product better than their competitors.

What benefits does allowing AI crawlers to harvest my website's data?

Let’s be honest, people are lazy. We will use the tool that makes our life easier and gets us the information with the least amount of effort. This means that AI tools are not going away, and if you want your brand to be relevant, we all need to lean in to making our websites easier to consume for AI crawlers.

Now I’ll admit, I’m not an expert in this area, but for brands who have products to sell there are a lot of opportunities to be had. As AI chatbots become another interface that people use in their purchasing pipeline, the information that large language models (LLMs) are suggesting to their users has a lot of value. If you’re able to get your product suggested to the user, how much additional research are they going to do to compare products? The likelihood is they’re going to ask the chatbot for other options and to compare the different recommendations. In order to come out on top you’re going to need to have enough information about your products, user generated reviews and product comparisons on more than just your own website.

Unless there is going to be a transaction at the end of the AI crawlers harvesting your data, I’m struggling to see the benefit of the extra traffic to your website as once the data has been harvested, the user is rarely directed back to the original source of the information.

How do we win?

Realistically there needs to be more controls in place and AI crawlers need to be more mindful about organisations websites and copyright. We need to be able to own our data and I’m in two minds about whether I allow AI crawlers to access this site at the moment.

How do we get that control back?

There are a few methods depending on how much you would like to restrict access to your content from these bots. The first method is to enforce rate limiting in your Web Application Firewall (WAF), this may be Cloudflare, Amazon’s Web Application Firewall or another firewall depending on your hosting set up. If you do enforce rate limiting then I would recommend you set the rate to individual IP addresses with a method such as CAPTCHA or Challenge allowing real users to still access your website.

At the moment I would probably advise against enabling this until more is known about the effects it would have on your rankings, but if you use Cloudflare you’re able to change your settings to make AI crawlers pay to access your content. 

The last method is to implement an LLMs.txt file. The LLMs.txt file is a combination of a robots.txt file and sitemap.xml file - Directing the AI Crawlers to content that has been written in the Markdown language. The reason it is in Markdown is because it is much easier for the crawlers to parse and understand than traditional HTML. That being said the LLMs.txt format is still very new and looking through the access logs of a large website I maintain, shows there has only been 6 calls to this URL over a 4 week period - So at the moment it doesn’t appear to be widely used, but it’s certainly something to keep in mind to stay at the forefront.

In reality there is no way to fully win at the moment and the industry is moving at an extremely fast pace, but there are a lot of ethical considerations that need to be resolved before the AI overlords and content creators are on an equal footing. 

Disclaimer: I do use AI in my day to day work to research and generate code to speed up development, but I don’t agree with how the data has been harvested.