Technical SEO

Robots.txt: How to Control Search Engine Crawling

Learn how to create and configure your robots.txt file to control which pages search engines can and cannot access on your website.

What is Robots.txt?

Robots.txt is a plain text file placed in your website's root directory (e.g., https://yourdomain.com/robots.txt) that tells search engine crawlers which pages or sections of your site they should or shouldn't visit. It follows the Robots Exclusion Protocol, a standard that all major search engines respect.

Important: robots.txt is a suggestion, not a command. Well-behaved crawlers (Google, Bing, etc.) will follow it, but malicious bots may ignore it entirely. Robots.txt should not be used as a security measure — it does not prevent pages from being indexed if they're linked from other sites.

How Robots.txt Works

When a crawler visits your site, the first thing it checks is your robots.txt file. The file contains rules (called "directives") that specify which user-agents (crawlers) can access which paths.

Basic syntax:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public-page

Sitemap: https://yourdomain.com/sitemap.xml

Key Directives

  • User-agent — Specifies which crawler the rules apply to. * means all crawlers.
  • Disallow — Tells crawlers not to access the specified path.
  • Allow — Overrides a Disallow rule for a specific path (useful for allowing a page within a blocked directory).
  • Sitemap — Points crawlers to your XML sitemap.
  • Crawl-delay — Asks crawlers to wait a specified number of seconds between requests (respected by Bing, ignored by Google).

Common Robots.txt Configurations

Allow Everything (Default)

User-agent: *
Disallow:

An empty Disallow value means "nothing is disallowed." This is functionally identical to not having a robots.txt at all.

Block Everything

User-agent: *
Disallow: /

Blocks all crawlers from all pages. Use this only for staging or development sites that should not be indexed.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=

This example blocks administrative areas, user-specific pages, and URL parameters that create duplicate content.

Target Specific Crawlers

User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /private/
Crawl-delay: 5

What to Block with Robots.txt

  • Admin and login pages — No need for search engines to crawl these.
  • Internal search results — These create near-infinite URL combinations that waste crawl budget.
  • Shopping cart and checkout pages — User-specific pages that shouldn't be indexed.
  • Duplicate content from URL parameters — Filter, sort, and pagination parameters that create duplicate versions of pages.
  • Development/staging areas — Test environments that shouldn't appear in search results.
  • API endpoints — Backend services not meant for public indexing.

What NOT to Block

  • CSS and JavaScript files — Google needs to render your pages to understand them. Blocking these files prevents proper rendering.
  • Pages you want indexed — Obvious, but misconfigured robots.txt is one of the most common technical SEO mistakes.
  • Images — Unless you specifically don't want them appearing in image search.

Robots.txt vs. Noindex

A common misconception: robots.txt does NOT remove pages from search results. If a blocked page has external links pointing to it, Google may still index the URL (showing it without a description). To truly prevent indexing, use the <meta name="robots" content="noindex"> tag on the page itself — but don't block the page in robots.txt, or crawlers won't see the noindex tag.

Testing Your Robots.txt

Always test changes before deploying:

  • Google Search Console has a robots.txt tester that shows how Google interprets your rules
  • Test specific URLs against your rules to verify they're blocked or allowed as intended
  • Check your robots.txt is accessible at your root domain

How AI SEO Powered by CGMIMM Helps

AI SEO powered by CGMIMM automatically analyzes your robots.txt file during every site audit. It checks for common misconfigurations — like accidentally blocking important pages, CSS, or JavaScript files — and verifies that your robots.txt aligns with your XML sitemap. If issues are found, the AI generates specific fix recommendations so you can correct them confidently.

Ready to Improve Your SEO?

Stop reading, start ranking. AI SEO powered by CGMIMM gives you the tools to put everything you just learned into practice — automatically.

Start Your 48-Hour Free Trial

Related Articles

Technical SEO

Technical SEO: The Complete Guide

Technical SEO

Site Speed Optimization: Why It Matters for SEO

Technical SEO

Mobile-First Indexing: How to Prepare Your Site

Technical SEO

XML Sitemaps: What They Are and How to Create Them