How to Configure robots.txt for AI Crawlers (Without Going Invisible)

robots.txt is the first gate every compliant AI crawler checks—and the easiest place to accidentally erase yourself from AI answers. This part covers how to configure it per bot, the blanket-rule mistakes that quietly cost visibility, and the CDN/WAF layer that can block AI crawlers even when your robots.txt says ‘come on in.’

Executive summary

Two layers decide whether a crawler reaches your content: robots.txt (a polite request compliant bots honor) and your CDN/WAF (a hard gate that can block bots regardless of robots.txt). Most accidental AI-invisibility lives in one of these two places.

Configure robots.txt per user-agent so you can allow retrieval bots while making a separate call on training bots.
A single broad Disallow—or a staging rule shipped to production—can remove you from every AI answer.
Even a perfect robots.txt won’t help if Cloudflare/Akamai bot mitigation is returning 403s to AI user-agents.

On this page

How robots.txt governs AI crawlers

robots.txt lives at the root of your domain and lists rules grouped by user-agent. Compliant AI crawlers read it before fetching and obey the most specific matching group. It’s a request, not a lock—well-behaved bots honor it; bad actors ignore it.

Anatomy

# https://example.com/robots.txt
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Specificity matters: a named user-agent group (e.g. GPTBot) overrides the wildcard (*) group for that bot. Bots do not merge the two.

Per-bot configuration patterns

Because training and retrieval are different decisions, your robots.txt should treat them differently. Three common, deliberate postures:

Opt out of training, stay visible in AI answers

User-agent: GPTBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
User-agent: PerplexityBot
Allow: /

Fully open (maximize AI visibility)

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Note that Google-Extended controls Gemini training only—it does not control Googlebot, which is still how Google’s AI Overviews access your pages. Disallowing Googlebot to ‘block AI’ also removes you from classic Search.

The accidental block

Most lost AI visibility isn’t a policy choice—it’s a leftover. The usual suspects:

A site-wide Disallow: / that shipped from a staging environment to production.
A blanket ‘block all bots’ rule added during a scraping scare that also catches retrieval crawlers.
Disallowing the directories where your real content lives (e.g. /blog, /resources) while leaving marketing pages open.
A noindex or nosnippet meta/header that suppresses snippet generation—and with it, AI snippet eligibility.

The most expensive two lines on the web

User-agent: *
Disallow: /

After any deploy that touches infrastructure, re-check robots.txt in production. It’s a 30-second check that prevents a category of silent, total AI invisibility.

For leadership

Make ‘verify production robots.txt’ a release checklist item. The failure mode is invisible in dashboards and can persist for months before anyone notices the missing demand.

When robots.txt isn’t the blocker: CDN & WAF

You can have a flawless robots.txt and still be invisible. CDNs and web application firewalls sit in front of your origin and make their own decisions about automated traffic—often denying AI crawlers before robots.txt is ever read.

Common hard blocks:

Cloudflare’s ‘Block AI bots’ / managed bot rules returning 403 to AI user-agents.
Bot-fight / managed-challenge modes that serve a JavaScript or CAPTCHA challenge a crawler can’t solve.
Rate limiting that trips on a crawler’s normal request pace and returns 429.
Firewall rules that allow only known search engines and silently drop newer AI bots.

The fix is to explicitly allow the retrieval crawlers you’ve decided to welcome—by verified user-agent and, ideally, published IP range—so the firewall stops treating them as hostile.

Decide your bot policy once, then make robots.txt and your CDN/WAF agree. Most AI-visibility incidents come from these two layers disagreeing.

Verify it actually works

Don’t trust the config—test the response a crawler gets. Request your page while presenting an AI user-agent and confirm a 200 with your real HTML.

Check the response an AI crawler receives

curl -A "OAI-SearchBot" -I https://example.com/your-page
# expect: HTTP/2 200  (not 403 / 429 / 503)

curl -A "PerplexityBot" -s https://example.com/your-page | head -40
# expect: your actual content in the HTML

Test from outside your network. Internal requests often bypass the very CDN/WAF rules that block crawlers, hiding the problem.

Part 3 of 6

A part of Why AI Search Engines Can’t See Your Website — and How to Fix It.

Why AI Search Engines Can’t See Your Website — and How to Fix It
AI Search Crawlers: GPTBot, ClaudeBot, PerplexityBot & the Bots Reading Your Site
How to Configure robots.txt for AI Crawlers (Without Going Invisible) · you’re here
Do AI Crawlers Render JavaScript? Rendering & Content Extractability
How to Audit Your Site for AI Search Crawlability (Step-by-Step)
Fixing AI Crawlability by Platform: Next.js, WordPress, Headless & CDN

← The crawlers Rendering →

What’s next

Becoming visible in AI search is a technical problem with a revenue outcome. If you'd rather have it diagnosed and fixed, here's how I can help.

About the author

Mihir Naik — Senior Product Manager (AI) at seoClarity, building Clarity ArcAI. Born in Surat, India; based in Toronto. In SEO since 2011. Available for consulting.

Read full bio →