What Is a Robots.txt File and What Should Yours Say?
A robots.txt file is a plain text file that sits at the root of your website (e.g. example.com/robots.txt) and contains instructions for search engine crawlers — the automated bots that scan your site to index its content. The file tells crawlers which parts of your site they are and aren’t allowed to access.
Most websites need a robots.txt file, though for many small sites its content is straightforward. Understanding what it does — and what it doesn’t do — prevents both the mistake of leaving it empty and the far more serious mistake of accidentally blocking search engines from indexing your site.
How robots.txt works
The robots.txt file follows a standard called the Robots Exclusion Protocol. It contains one or more “User-agent” entries that identify which crawler the rule applies to, followed by “Disallow” or “Allow” directives that specify which URLs the crawler may or may not access.
“User-agent: *” applies the rule to all crawlers. “User-agent: Googlebot” applies it only to Google’s crawler. “Disallow: /admin/” prevents crawlers from accessing any URL that starts with /admin/. “Disallow:” with nothing after it means no restrictions apply.
It’s important to understand what robots.txt does not do. It does not prevent pages from appearing in search results — a page that’s disallowed from crawling can still be indexed if other sites link to it. For pages you genuinely don’t want in Google’s index, you need a “noindex” meta tag on the page itself, not just a robots.txt disallow rule. Robots.txt is also a suggestion, not a lock — malicious bots can and do ignore it.
What a sensible robots.txt looks like
For most small business websites, a simple robots.txt is sufficient: allow all crawlers access to everything, and include a reference to the XML sitemap location. This looks like: User-agent: * followed by Disallow: (blank) and Sitemap: https://example.com/sitemap.xml.
Common things worth disallowing include: admin panels (/wp-admin/ for WordPress sites, though WordPress automatically protects this), staging or test directories, internal search result pages (which create many similar URLs and waste crawl budget), and private user account areas.
Avoid disallowing your CSS or JavaScript files. An old recommendation to block these has been reversed — Google needs to be able to render your pages properly, which requires accessing your stylesheets and scripts. Blocking them can cause Google to misunderstand how your pages look and function, which can affect rankings.
Checking and maintaining your robots.txt
You can view your robots.txt file by typing your domain followed by /robots.txt in a browser. Google Search Console has a robots.txt tester tool that lets you check whether specific URLs are blocked by your current rules — use this before making any changes to confirm the effect.
The most dangerous robots.txt error is “Disallow: /” — which blocks all crawlers from all pages on your site. This mistake, occasionally made during site migrations or when copying from a staging environment, can cause your entire site to disappear from Google within days. Check your robots.txt after any site migration or CMS update, and set up a Google Search Console alert for crawl errors so you’re notified quickly if something goes wrong.
Common questions.
Does every website need a robots.txt file?
Can robots.txt hurt my SEO?
Should I block AI web crawlers with robots.txt?
More on web design & ux.
Want a hand putting this into practice?
Book a free, no-obligation consultation with a Norwich-based specialist.
Let's put your business in a better light.
Book a free, no-pressure consultation. We'll talk through your goals and tell you honestly what we'd do — whether you work with us or not.