Robots.txt Setup
Set up your robots.txt file in 2 minutes to control search engine crawling effectively.
What this covers: Set up your robots.txt file in 2 minutes to control search engine crawling effectively, including do you need a custom robots.txt?, what a robots.txt file looks like.
Who it’s for: Site owners and content creators who want to improve their search rankings and organic traffic.
Key outcome: You’ll have visiting yoursite.com/robots.txt shows your file, and google search console > settings > robots.txt shows “fetched” status.
Time to read: 5 minutes
Part of: SEO + Discoverability series
Do You Need a Custom Robots.txt?
Every website already has an implicit robots.txt — if the file doesn’t exist, search engines assume everything is allowed. You only need a custom one if:
- You have pages that waste crawl budget — admin panels, internal search results, faceted navigation with thousands of parameter combinations
- You run WordPress, Shopify, or a CMS — these platforms generate dozens of URL patterns (tag archives, author pages, query parameters) that dilute crawl efficiency
- You’re preparing for launch — you need to block crawling during development, then open it up on go-live day
If your site has fewer than 50 pages and no CMS, the default (allow everything) is usually fine. For sites with hundreds or thousands of URLs, robots.txt becomes a meaningful lever for crawl budget.
What a Robots.txt File Looks Like
robots.txt controls what search engines can crawl. It’s one of the first files Google checks. Misconfigure it and you can accidentally hide your entire site.
Google crawls each site with a finite budget — typically hundreds to tens of thousands of pages per day depending on your site’s authority and server speed. Every URL Googlebot spends time on is a URL it’s not spending time on. Blocking low-value paths means your important pages get crawled more frequently.
Create a file called robots.txt in your site root (example.com/robots.txt).
Standard Setup
These mistakes are common because they’re easy to make and their consequences aren’t immediately obvious. Avoid them proactively.
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
That’s it for most sites. Allow everything, point to your sitemap.
Block Stuff You Don’t Want Indexed
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /thank-you/
Disallow: /*?* # Block URL parameters
Sitemap: https://example.com/sitemap.xml
Each line has a reason:
| Disallow path | Why block it | What happens if you don’t |
|---|---|---|
/admin/ |
No search value, exposes attack surface | Login pages appear in search results, inviting brute-force attempts |
/cart/ and /checkout/ |
Session-specific, empty for crawlers | Googlebot indexes empty cart pages that return soft 200s |
/thank-you/ |
Post-conversion page with no entry context | Users land on a confirmation page with nothing to do |
/*?* |
Parameter URLs create near-duplicate content | Faceted nav or UTM parameters can generate thousands of duplicate URLs |
For WordPress specifically, also consider blocking /wp-admin/ (but allow /wp-admin/admin-ajax.php), /?s= (internal search results), and /?replytocom= (comment reply URLs). These paths collectively can account for 30-40% of Googlebot’s requests on a typical WordPress site.
Common Robots.txt Mistakes
Avoid these—they cause most of the problems.
- Blocking CSS/JS: Don’t. Google needs to render your pages.
- Thinking robots.txt is security: It’s not. It’s a suggestion. Anyone can ignore it.
- Blocking your whole site during dev: Fine. But remember to remove it before launch.
The single most common disaster: launching a site with Disallow: / still in place from staging. Google will de-index your entire site within days and re-indexing takes weeks. Add a calendar reminder on launch day. Better yet, use a pre-launch checklist that includes checking yoursite.com/robots.txt in a browser.
Test It
Testing is necessary because a misconfigured robots.txt can hide your entire site from search engines. The impact is invisible until you check Search Console and see pages mysteriously missing from the index. Always verify changes in a staging environment first.
Google Search Console → URL Inspection → enter a URL → see if it’s blocked.
Sources
- Google Search Central – robots.txt Introduction
- robotstxt.org – The Robots Exclusion Protocol
- Google – robots.txt Specifications
Robots.txt Questions Answered
Does robots.txt prevent pages from appearing in Google?
Robots.txt blocks crawling, not indexing. If other sites link to a disallowed URL, Google can still index it based on anchor text and link context, displaying the URL in search results without a snippet. To prevent indexing, use a noindex meta tag or X-Robots-Tag HTTP header instead.
Where should robots.txt be placed?
Robots.txt must be at your domain root: yoursite.com/robots.txt. It will not work in subdirectories. Each subdomain needs its own robots.txt file. The file must be accessible via HTTP (return a 200 status code) or search engines will treat all URLs as allowed.
What should you block in WordPress robots.txt?
Block /wp-admin/ (but allow /wp-admin/admin-ajax.php for theme functionality), and any parameter-based duplicate content paths like ?replytocom= and ?s=. Do not block /wp-content/uploads/, CSS, or JavaScript files, as Google needs these to render and understand your pages.
How do you test your robots.txt file?
Use Google Search Console’s robots.txt Tester (under Settings > robots.txt) to validate syntax and test whether specific URLs are blocked or allowed. Also test with different user agents since rules are agent-specific and Googlebot, Googlebot-Image, and Bingbot may behave differently.
✓ Verifying Your Robots.txt Is Live
- Visiting yoursite.com/robots.txt shows your file
- Google Search Console > Settings > robots.txt shows “Fetched” status
- Your sitemap URL is listed in the robots.txt file
Verify: Use Google’s robots.txt Tester in Search Console to confirm important pages aren’t accidentally blocked.