Guide

What Is a Robots.txt File and What Should Yours Say?

3 min read·Published December 2020·By the Xpose team

Free instant site health check →Call 01603 327147

A robots.txt file is a plain text file that sits at the root of your website (e.g. example.com/robots.txt) and contains instructions for search engine crawlers — the automated bots that scan your site to index its content. The file tells crawlers which parts of your site they are and aren’t allowed to access.

Most websites need a robots.txt file, though for many small sites its content is straightforward. Understanding what it does — and what it doesn’t do — prevents both the mistake of leaving it empty and the far more serious mistake of accidentally blocking search engines from indexing your site.

How robots.txt works

The robots.txt file follows a standard called the Robots Exclusion Protocol. It contains one or more “User-agent” entries that identify which crawler the rule applies to, followed by “Disallow” or “Allow” directives that specify which URLs the crawler may or may not access.

“User-agent: *” applies the rule to all crawlers. “User-agent: Googlebot” applies it only to Google’s crawler. “Disallow: /admin/” prevents crawlers from accessing any URL that starts with /admin/. “Disallow:” with nothing after it means no restrictions apply.

It’s important to understand what robots.txt does not do. It does not prevent pages from appearing in search results — a page that’s disallowed from crawling can still be indexed if other sites link to it. For pages you genuinely don’t want in Google’s index, you need a “noindex” meta tag on the page itself, not just a robots.txt disallow rule. Robots.txt is also a suggestion, not a lock — malicious bots can and do ignore it.

What a sensible robots.txt looks like

For most small business websites, a simple robots.txt is sufficient: allow all crawlers access to everything, and include a reference to the XML sitemap location. This looks like: User-agent: * followed by Disallow: (blank) and Sitemap: https://example.com/sitemap.xml.

Common things worth disallowing include: admin panels (/wp-admin/ for WordPress sites, though WordPress automatically protects this), staging or test directories, internal search result pages (which create many similar URLs and waste crawl budget), and private user account areas.

Avoid disallowing your CSS or JavaScript files. An old recommendation to block these has been reversed — Google needs to be able to render your pages properly, which requires accessing your stylesheets and scripts. Blocking them can cause Google to misunderstand how your pages look and function, which can affect rankings.

Checking and maintaining your robots.txt

You can view your robots.txt file by typing your domain followed by /robots.txt in a browser. Google Search Console has a robots.txt tester tool that lets you check whether specific URLs are blocked by your current rules — use this before making any changes to confirm the effect.

The most dangerous robots.txt error is “Disallow: /” — which blocks all crawlers from all pages on your site. This mistake, occasionally made during site migrations or when copying from a staging environment, can cause your entire site to disappear from Google within days. Check your robots.txt after any site migration or CMS update, and set up a Google Search Console alert for crawl errors so you’re notified quickly if something goes wrong.

FAQs

Common questions.

Does every website need a robots.txt file?

Not strictly — if the file is missing, crawlers will simply crawl everything. But having one is best practice because it lets you control crawl behaviour, include your sitemap URL, and prevent unnecessary crawling of admin or duplicate pages. Creating a minimal robots.txt takes five minutes and is worth doing for any site that cares about SEO.

Can robots.txt hurt my SEO?

Yes, if configured incorrectly. Blocking crawlers from your main content pages, your CSS and JavaScript files, or your XML sitemap are the most common errors that negatively affect how Google crawls and understands your site. Always test changes in Google Search Console’s robots.txt tester before publishing them.

Should I block AI web crawlers with robots.txt?

This is an increasingly common question. If you want to prevent AI companies from using your content to train models, you can add disallow rules for known AI crawlers such as GPTBot (OpenAI), Claude-Web (Anthropic), and others. Note that compliance is voluntary — these companies state they respect robots.txt, but there’s no technical enforcement mechanism. The rules only apply to future crawls, not content already collected.

Related guides

Want a hand putting this into practice?

Book a free, no-obligation consultation with a Norwich-based specialist.

Book a free consultation →

Get started

Let's put your business in a better light.

Book a free, no-pressure consultation. We'll talk through your goals and tell you honestly what we'd do — whether you work with us or not.

01
Tell us a bitFill in the form — two minutes, tops.
02
We'll call you backWithin one working day, no pressure.
03
Get a clear planHonest advice and a fixed quote.

01603 327147 start@xpose.online