GuidePublished: June 30, 2026 · Last updated: June 30, 2026 · ~5 min read
Mihir NaikSenior PM (AI) at seoClarity

How to Audit Your Site for AI Search Crawlability (Step-by-Step)

This is the part to start with if you’d rather diagnose than theorize. It’s a repeatable audit that finds why AI crawlers can’t see your site—log analysis, per-bot robots checks, render-diff testing, and CDN/WAF review—plus a scorecard to turn findings into a prioritized fix list.
Executive summary
Run this audit in order; each step assumes the previous one passed. The goal is a short, ranked list of fixes—access problems first, because they have the highest leverage and are usually the cheapest to fix.
  • Step 1 — Logs: which AI bots hit you, and what status codes they get.
  • Step 2 — robots.txt: per-bot allow/disallow, checked in production.
  • Step 3 — Rendering: is your content in the raw HTML, or only after JavaScript?
  • Step 4 — Edge & crawl health: CDN/WAF rules, status codes, canonicals, sitemap.
  • Then score each area and fix the reds before the yellows.

What you’re auditing for

You’re checking the three things every AI citation depends on: can a crawler reach the page, read the content, and reuse it? This audit walks the first two—reach and read—because that’s where technical invisibility lives.
Pick your sample before you start:
  1. Your highest-value pages (key product, solution, and resource pages).
  2. One page per template type (you’re testing templates, not individual URLs).
  3. Any page you know should be cited in AI answers but isn’t.

Step 1 — Log-file analysis: who’s crawling, what they get

Server or CDN access logs are the only ground truth for crawler behavior. Filter for AI user-agents and look at the status codes they receive.
Find AI bot hits and their status codes
# Count requests + status codes for major AI crawlers
grep -Ei "GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot|CCBot" access.log \
  | awk '{print $9}' | sort | uniq -c | sort -rn
Read the results:
  • Lots of 200s for retrieval bots → access is healthy; focus later steps on rendering and extractability.
  • 403 / 429 / 503 for retrieval bots → a hard block (robots or, more likely, CDN/WAF). Highest-priority fix.
  • No hits at all → either the bots can’t discover you (sitemap/links) or something upstream is dropping them.
Verify suspicious user-agents by IP before trusting them—plenty of scrapers spoof GPTBot. Real crawlers come from the operator’s published ranges.

Step 2 — robots.txt and directives, per bot

Confirm what your production robots.txt actually says for each crawler, and check page-level directives that suppress snippets.
Pull production robots.txt and snippet directives
curl -s https://example.com/robots.txt

# Check for snippet-suppressing directives on a page
curl -s https://example.com/your-page \
  | grep -iE "noindex|nosnippet|max-snippet|data-nosnippet"
  • Confirm retrieval bots are allowed and that no broad Disallow: / is present.
  • Confirm you’re not disallowing the directories your real content lives in.
  • Flag any noindex/nosnippet that would make a page ineligible for AI snippets.

Step 3 — Render-diff testing

Compare what’s in the raw HTML against what’s visible in the browser. The gap is the content AI crawlers can’t see.
Raw vs. rendered
# Is your key content in the raw HTML?
curl -s https://example.com/your-page | grep -i "your key phrase"

# Missing here but present in the browser DOM = client-side only.
Do this once per template. A single CSR template can hide the content of thousands of URLs.

Step 4 — Edge rules and crawl health

If logs showed blocks but robots.txt is clean, the culprit is almost always the edge. Then sweep the basics that quietly break crawlability.
  • CDN/WAF: review bot-management rules for anything challenging or blocking AI user-agents; allowlist the retrieval bots you’ve chosen.
  • Status codes: key pages return 200 (not soft-404s, 302 chains, or 5xx under crawler load).
  • Canonicals: each page points to itself or the right target—no conflicting or cross-domain canonicals.
  • Sitemap: present, current, listed in robots.txt, and free of redirected or noindex URLs.
  • Performance: responses fast enough that crawlers don’t time out before receiving content.
For leadership
Most teams discover the biggest fix is one or two edge/robots changes—not a content program. Audit before you budget.

The AI-readiness scorecard

Score each area red / yellow / green, then fix top-down. Access beats rendering beats extractability for leverage.
Score each dimension:
  • Access — Green: retrieval bots get 200s. Yellow: inconsistent. Red: 403/429 or blanket disallow.
  • robots.txt — Green: retrieval allowed, training decided deliberately. Yellow: overly broad rules. Red: Disallow: / or content dirs blocked.
  • Rendering — Green: content in raw HTML. Yellow: partial. Red: client-side only.
  • Extractability — Green: semantic, content in main document. Yellow: buried in tabs/boilerplate. Red: text in images / interaction-gated.
  • Crawl health — Green: clean status/canonical/sitemap. Yellow: minor issues. Red: 5xx, redirect chains, broken sitemap.
Fix order:
  1. Clear every red in Access and robots.txt first—these can zero out AI visibility entirely.
  2. Resolve Rendering reds next—no content in HTML means nothing to cite.
  3. Then Extractability and Crawl-health yellows for accuracy and coverage.
  4. Re-run Steps 1 and 3 to confirm bots now get 200s with your content in the raw HTML.