GuidePublished: June 30, 2026 · Last updated: June 30, 2026 · ~4 min read
Mihir NaikSenior PM (AI) at seoClarity

How to Configure robots.txt for AI Crawlers (Without Going Invisible)

robots.txt is the first gate every compliant AI crawler checks—and the easiest place to accidentally erase yourself from AI answers. This part covers how to configure it per bot, the blanket-rule mistakes that quietly cost visibility, and the CDN/WAF layer that can block AI crawlers even when your robots.txt says ‘come on in.’
Executive summary
Two layers decide whether a crawler reaches your content: robots.txt (a polite request compliant bots honor) and your CDN/WAF (a hard gate that can block bots regardless of robots.txt). Most accidental AI-invisibility lives in one of these two places.
  • Configure robots.txt per user-agent so you can allow retrieval bots while making a separate call on training bots.
  • A single broad Disallow—or a staging rule shipped to production—can remove you from every AI answer.
  • Even a perfect robots.txt won’t help if Cloudflare/Akamai bot mitigation is returning 403s to AI user-agents.

How robots.txt governs AI crawlers

robots.txt lives at the root of your domain and lists rules grouped by user-agent. Compliant AI crawlers read it before fetching and obey the most specific matching group. It’s a request, not a lock—well-behaved bots honor it; bad actors ignore it.
Anatomy
# https://example.com/robots.txt
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Specificity matters: a named user-agent group (e.g. GPTBot) overrides the wildcard (*) group for that bot. Bots do not merge the two.

Per-bot configuration patterns

Because training and retrieval are different decisions, your robots.txt should treat them differently. Three common, deliberate postures:
Opt out of training, stay visible in AI answers
User-agent: GPTBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
User-agent: PerplexityBot
Allow: /
Fully open (maximize AI visibility)
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

The accidental block

Most lost AI visibility isn’t a policy choice—it’s a leftover. The usual suspects:
  • A site-wide Disallow: / that shipped from a staging environment to production.
  • A blanket ‘block all bots’ rule added during a scraping scare that also catches retrieval crawlers.
  • Disallowing the directories where your real content lives (e.g. /blog, /resources) while leaving marketing pages open.
  • A noindex or nosnippet meta/header that suppresses snippet generation—and with it, AI snippet eligibility.
The most expensive two lines on the web
User-agent: *
Disallow: /
After any deploy that touches infrastructure, re-check robots.txt in production. It’s a 30-second check that prevents a category of silent, total AI invisibility.
For leadership
Make ‘verify production robots.txt’ a release checklist item. The failure mode is invisible in dashboards and can persist for months before anyone notices the missing demand.

When robots.txt isn’t the blocker: CDN & WAF

You can have a flawless robots.txt and still be invisible. CDNs and web application firewalls sit in front of your origin and make their own decisions about automated traffic—often denying AI crawlers before robots.txt is ever read.
Common hard blocks:
  • Cloudflare’s ‘Block AI bots’ / managed bot rules returning 403 to AI user-agents.
  • Bot-fight / managed-challenge modes that serve a JavaScript or CAPTCHA challenge a crawler can’t solve.
  • Rate limiting that trips on a crawler’s normal request pace and returns 429.
  • Firewall rules that allow only known search engines and silently drop newer AI bots.
The fix is to explicitly allow the retrieval crawlers you’ve decided to welcome—by verified user-agent and, ideally, published IP range—so the firewall stops treating them as hostile.

Verify it actually works

Don’t trust the config—test the response a crawler gets. Request your page while presenting an AI user-agent and confirm a 200 with your real HTML.
Check the response an AI crawler receives
curl -A "OAI-SearchBot" -I https://example.com/your-page
# expect: HTTP/2 200  (not 403 / 429 / 503)

curl -A "PerplexityBot" -s https://example.com/your-page | head -40
# expect: your actual content in the HTML
Test from outside your network. Internal requests often bypass the very CDN/WAF rules that block crawlers, hiding the problem.
What’s next
Becoming visible in AI search is a technical problem with a revenue outcome. If you'd rather have it diagnosed and fixed, here's how I can help.
Mihir Naik
About the author
Mihir Naik — Senior Product Manager (AI) at seoClarity, building Clarity ArcAI. Born in Surat, India; based in Toronto. In SEO since 2011. Available for consulting.
Read full bio →