GuidePublished: June 30, 2026 · Last updated: June 30, 2026 · ~4 min read
Mihir NaikSenior PM (AI) at seoClarity

AI Search Crawlers: GPTBot, ClaudeBot, PerplexityBot & the Bots Reading Your Site

Before you can decide what to allow or block, you need to know who’s actually crawling—and why. AI crawlers fall into two jobs that look identical in your logs but mean very different things for visibility: training and retrieval. This reference maps the major bots to their job and user-agent so you can make deliberate choices instead of inheriting defaults.
Executive summary
Not every AI crawler affects whether you show up in AI answers. Training crawlers feed models; retrieval crawlers fetch live content to build and cite answers. Confusing the two is how sites accidentally disappear from AI search.
  • Retrieval bots (OAI-SearchBot, PerplexityBot, and the systems behind Google’s AI surfaces) are the ones that decide your citation visibility.
  • Training bots (GPTBot, CCBot, Google-Extended) are a separate, content-licensing decision—blocking them does not remove you from AI answers.
  • User-agents and IP ranges change; verify against each provider’s published documentation before writing rules.

Training vs. retrieval: the distinction that decides visibility

Every AI crawler is doing one of two jobs. Telling them apart is the most important thing on this page.
Training crawlers
  • Collect content to train or fine-tune large language models.
  • Blocking them is a legitimate intellectual-property choice and has no direct effect on whether you’re cited in live AI answers.
  • Examples: GPTBot (OpenAI), CCBot (Common Crawl, which feeds many models), Google-Extended (Gemini training control), Applebot-Extended.
Retrieval / search crawlers
  • Fetch live content at answer time (or to maintain a fresh answer index) and are the path to being cited.
  • Blocking them directly removes you from those products’ AI answers.
  • Examples: OAI-SearchBot (ChatGPT search), PerplexityBot (Perplexity), and the crawling behind Google’s AI Overviews / AI Mode, which still relies on Googlebot-class access.
Rule of thumb: if a bot’s job is to answer a user right now, treat blocking it as a revenue decision—not a privacy default.
For leadership
When someone proposes ‘blocking AI bots,’ ask which job they mean. Opting out of training is reasonable; opting out of retrieval means opting out of AI-search demand.

The crawlers worth knowing by name

A working roster of the agents most likely to show up in your logs, grouped by who operates them. Always confirm the current user-agent token and IP ranges against the operator’s official docs before relying on them.

OpenAI

  • GPTBot — training crawler. Controlled in robots.txt via the GPTBot user-agent.
  • OAI-SearchBot — retrieval crawler for ChatGPT search results and citations.
  • ChatGPT-User — fetches a page in response to a specific user action inside ChatGPT.

Anthropic

  • ClaudeBot — Anthropic’s crawler; respect its user-agent in robots.txt.
  • Claude-User / Claude-SearchBot — user-initiated fetch and search-style retrieval.

Perplexity, Google, Microsoft & others

  • PerplexityBot — Perplexity’s retrieval crawler for its answer engine.
  • Googlebot — still the access path for Google’s AI Overviews and AI Mode; Google-Extended only governs Gemini training, not ranking or retrieval.
  • Bingbot — powers Microsoft Copilot answers via the Bing index.
  • Amazonbot, Meta-ExternalAgent, Bytespider, CCBot — additional crawlers worth identifying so you can decide on each.

How to see which bots actually reach you

The roster above is theory until you check your own logs. Your server or CDN access logs are the ground truth for who’s crawling and what they receive.
From your logs, answer three questions per bot:
  1. Is it hitting you at all? Filter by user-agent for each crawler above.
  2. What status code does it get? 200 means access; 403/429/503 means something is turning it away.
  3. Is the traffic real? Spoofed user-agents are common—verify by reverse-DNS or the operator’s published IP ranges before trusting or blocking.
A retrieval bot consistently receiving 403s is a direct, measurable loss of AI visibility—and one of the fastest wins you’ll find.
What’s next
Becoming visible in AI search is a technical problem with a revenue outcome. If you'd rather have it diagnosed and fixed, here's how I can help.
Mihir Naik
About the author
Mihir Naik — Senior Product Manager (AI) at seoClarity, building Clarity ArcAI. Born in Surat, India; based in Toronto. In SEO since 2011. Available for consulting.
Read full bio →