Keep track of how AI user agents are accessing and navigating your website
Itamar says: “In the age of AI platforms and LLMs, it's important to really understand how these different platforms, and more specifically user agents, are accessing and navigating your website.”
What user agents then are we talking about?
“Pretty much everything.
Part of this tip goes into log file analysis, where you're looking into the log files of your website to understand which types of user agents/IP addresses are accessing which pages of your website, and how frequently they do that.
The reason why this is important is because, in the age of AI platforms and all of these different LLMs, people are wondering, ‘Am I showing up? How are these LLMs talking about my brand on my website?’
It's really important to know what they're actually seeing and what types of pages they've managed to get to. That gives you a really good starting point to see whether or not you're in a position where these LLMs or AI platforms can even get to your website or get to the pages you want them to see.”
What additional data can looking at your log files provide, compared with regular analytics?
“Regular analytics really won't give you too much in terms of what types of user agents are going on your website and how frequently. The closest thing you can get is your crawl stats in Google Search Console, but that's specific to Googlebot.
What we're looking at is the entire range of user agents. This could be anything from traditional search engines like Google, Bing, and Yandex, to AI platforms and LLMs like ChatGPT.
You can see how frequently they're going onto your website and what pages they're seeing. If you're creating pages you hope to be surfaced on AI platforms, you first want to make sure that these user agents or platforms have actually accessed these web pages in the first place.”
How do you get access to these log files, and how do you conduct that initial analysis?
“Usually, the log files will be accessible from the host that you have. Let's say you're a WordPress website that's hosted on a platform like SiteGround. You should be able to go into their different tools to extract those logs. You could go into the FTP of your host to navigate towards that as well. It also ranges depending on the CMS that you're using.
For example, Shopify isn't going to directly give you access to the logs, but if you're using Cloudflare, it acts as a proxy, and you can use something called Logflare to access those logs and that data in real time. It depends on your setup, but most of the time, you can get access via your host.
Once you have them, you can use a tool like the Screaming Frog Log File Analyser to input those logs. Ideally, you have multiple amounts of logs, one per day, in the time period you're looking at. Then, when you put all of that together, you can see a timeline of the activity that certain IPs or user agents have had on your site, accessing your website, and which pages they've accessed – and then you can dig deeper into the analysis from that point.”
What types of user agents tend to be navigating through your site, and what are the signs that you should be concerned about?
“Signs for concern will be when you've got user agents that have not visited, say, 200 status code pages on your website that are quite important. If they haven't done that consistently over a period of time, you might say, ‘We need to improve our internal linking.’ Maybe backlink acquisition could help those pages and get them more to the forefront in terms of how different user agents or search bots crawl them.
That's part of what to do and things to look out for. The type of analysis you can get really varies depending on what your goal is, because the data that you're getting is very particular.
It's really just seeing all of the accessible URLs on your website, their different status codes, and then what types of IP addresses and user agents were able to access them.”
What kind of time periods are you talking about there?
“That could be anything more than a couple of weeks, where you're consistently not seeing any visits from that user agent. It means that either it doesn't care enough to access that page, or it may not even be able to get there in the first place.
Now, the particular evaluations from that will vary depending on the user agent and what it's programmed to do, how it's meant to think, and its logic. On the whole, though, if you've got important pages on your website that aren't being visited by ChatGPT's user agent, then you've got a bit of a problem (if your whole strategy is around appearing within LLM searches).
Those are the sorts of things that you could be looking at and thinking about.”
Are different user agents looking for different things, and do we need to try and manage them differently?
“I am not sure what it is that they're looking at. I do know that some user agents (at least for sites that I've looked at within log files) are doing some weird things.
For example, Yandex seems to be accessing a lot of really random 301 pages very frequently, and I'm not really sure why. I don't know if we should call them anomalies, but it's quite interesting to get an idea of where you want to be visible and what the behaviour looks like from those types of user agents on your website.
That can give you enough to change up your strategy a little bit. For example, internal linking and backlink acquisition – or you could make some improvements to the content on those pages specifically.
There are a bunch of different things you could do there, but in terms of fully understanding how they work, that's really not an answer I could give.”
Are there any resources for understanding how your strategy may have changed and how that might impact the way different agents crawl and interpret a website?
“I'm sure there are going to be resources out there. Something that you could do as a test yourself is some log file analysis. Get those log files and put them into a tool like Screaming Frog’s Log File Analyser. Then, if you’re specifically looking at ChatGPT's user agent, try and identify whether or not it has been visiting a particular URL for a while.
Then, you can do some of your own experiments and tests on that and see what that looks like after a couple of weeks or a month, to see if that's made any change in terms of them actually accessing these URLs. That's something a bit more practical that you could do, but I'm sure there are case studies out there that you could look at as well.”
Can this help you identify pages that may be perceived as more important or more likely to be crawled in the future, and therefore decide which pages to optimize?
“Absolutely. There are always going to be more important pages of your website, by default. Your homepage, for example, is probably the most important page.
It's definitely a case of understanding what is deemed to be more important in terms of how frequently these user agents are looking at these pages. There's so much analysis that you can do, but it can give you a way to prioritise – which is obviously important when you've got a really big site with thousands or hundreds of thousands of URLs.”
How often should you be analysing your log files?
“You're getting your log files every day, so they're always going to be unique for each day. It really is just constantly having a way to download them every day. Then, over time, you can import them in bulk, for stretches of 30 days, 60 days, or however long you want. Then, you can perform these analyses.
It's just important to have that data and have those logs. If you're downloading one a day, that should be enough.”
Do you recommend analysing this regularly, setting priorities for which pages to optimize, then liaising with the content production team to produce content for those pages?
“Absolutely. That could definitely be a workflow that you have with it.
It's not something to look into too religiously, but if you at least have a certain cadence that you've set out to be analysing your log files, that can give you good data to reinforce your thinking about how important certain types of pages are, and what you should do as part of your overall strategy – which obviously can involve content as well.”
Would this have any impact on your social media marketing strategy, or is it just SEO-oriented?
“That's a good question. I would assume that social media could be more of an indirect factor that can impact perceived importance, or at least the routes that different platforms use to access these pages. LLMs, for example, can use social media sites to extract data and then follow links through those. It could be an indirect factor with that.
You have to really think about your business, your site, and what's important. Try and dig deeper into which are the most important commercial pages or what information you want to get across, to make sure it is being found by all of the user agents that are most relevant to you.
If you're in a niche where it's quite informational, there might be a lot of people searching for information with LLMs. At that point, you want to make sure that your informational content is prevalent and is being accessed frequently by LLM user agents like ChatGPT.”
Would you mention log files as part of a training session or to stakeholders within a business, or do you avoid mentioning log files by name?
“Generally speaking, if they're not very technical people, then I would say the best approach is probably to avoid naming them. They can understand when you're saying something like, ‘ChatGPT hasn't even been looking into this piece of content, we need to do something about it.’
That's a much more potent way of describing an issue, as opposed to getting into the specifics and talking about the IP addresses or the particular user agents.”
Are there any hosting services that don't provide log files, and any that are a better source for them?
“Again, it depends on your CMS.
If you're an e-commerce site on Shopify, you're not going to be able to get these log files natively. That's where you can proxy it with Cloudflare and then use Logflare, for example, to get those logs.
If you're on a WordPress site or a custom-built site, depending on what your host is, you should be able to navigate through the directories of your website via FTP to access the log files in that directory – or, if you've got a host like SiteGround, it's easy to use their file manager to access them.
A lot of these different hosting providers will have documentation for how to do that. If you're not sure, you can always reach out to them and say, ‘I want to access the log files for these websites.’
A lot of the time, that data can be important for other things that transcend the SEO aspect. Certain businesses and websites have these log files to hand when they're trying to flip, sell, or exit, for the people who are buying the business/website. They can be helpful because it's a way to add value to the asset being sold. There are plenty of different reasons why it can be important, not just within the SEO setting.
Definitely try and make sure that you at least have access to your log files. Whether you do something with them or you don't, it's still good to be able to obtain them.”
Itamar, what's the key takeaway from the tip you shared today?
“Keep track of what user agents are actually seeing, how they're accessing your website, and how frequently they're doing so.”
Itamar Blauer is a Senior SEO Director at StudioHawk. Find out more over at ItamarBlauer.com.