AI Search Crawlers: GPTBot, ClaudeBot, PerplexityBot & the Bots Reading Your Site

Before you can decide what to allow or block, you need to know who’s actually crawling—and why. AI crawlers fall into two jobs that look identical in your logs but mean very different things for visibility: training and retrieval. This reference maps the major bots to their job and user-agent so you can make deliberate choices instead of inheriting defaults.

Executive summary

Not every AI crawler affects whether you show up in AI answers. Training crawlers feed models; retrieval crawlers fetch live content to build and cite answers. Confusing the two is how sites accidentally disappear from AI search.

Retrieval bots (OAI-SearchBot, PerplexityBot, and the systems behind Google’s AI surfaces) are the ones that decide your citation visibility.
Training bots (GPTBot, CCBot, Google-Extended) are a separate, content-licensing decision—blocking them does not remove you from AI answers.
User-agents and IP ranges change; verify against each provider’s published documentation before writing rules.

On this page

Training vs. retrieval: the distinction that decides visibility

Every AI crawler is doing one of two jobs. Telling them apart is the most important thing on this page.

Training crawlers

Collect content to train or fine-tune large language models.
Blocking them is a legitimate intellectual-property choice and has no direct effect on whether you’re cited in live AI answers.
Examples: GPTBot (OpenAI), CCBot (Common Crawl, which feeds many models), Google-Extended (Gemini training control), Applebot-Extended.

Retrieval / search crawlers

Fetch live content at answer time (or to maintain a fresh answer index) and are the path to being cited.
Blocking them directly removes you from those products’ AI answers.
Examples: OAI-SearchBot (ChatGPT search), PerplexityBot (Perplexity), and the crawling behind Google’s AI Overviews / AI Mode, which still relies on Googlebot-class access.

Rule of thumb: if a bot’s job is to answer a user right now, treat blocking it as a revenue decision—not a privacy default.

For leadership

When someone proposes ‘blocking AI bots,’ ask which job they mean. Opting out of training is reasonable; opting out of retrieval means opting out of AI-search demand.

The crawlers worth knowing by name

A working roster of the agents most likely to show up in your logs, grouped by who operates them. Always confirm the current user-agent token and IP ranges against the operator’s official docs before relying on them.

OpenAI

GPTBot — training crawler. Controlled in robots.txt via the GPTBot user-agent.
OAI-SearchBot — retrieval crawler for ChatGPT search results and citations.
ChatGPT-User — fetches a page in response to a specific user action inside ChatGPT.

Anthropic

ClaudeBot — Anthropic’s crawler; respect its user-agent in robots.txt.
Claude-User / Claude-SearchBot — user-initiated fetch and search-style retrieval.

Perplexity, Google, Microsoft & others

PerplexityBot — Perplexity’s retrieval crawler for its answer engine.
Googlebot — still the access path for Google’s AI Overviews and AI Mode; Google-Extended only governs Gemini training, not ranking or retrieval.
Bingbot — powers Microsoft Copilot answers via the Bing index.
Amazonbot, Meta-ExternalAgent, Bytespider, CCBot — additional crawlers worth identifying so you can decide on each.

User-agent strings, IP ranges, and even bot names change as providers split training from retrieval. Treat any static list (including this one) as a starting point to verify, not gospel.

How to see which bots actually reach you

The roster above is theory until you check your own logs. Your server or CDN access logs are the ground truth for who’s crawling and what they receive.

From your logs, answer three questions per bot:

Is it hitting you at all? Filter by user-agent for each crawler above.
What status code does it get? 200 means access; 403/429/503 means something is turning it away.
Is the traffic real? Spoofed user-agents are common—verify by reverse-DNS or the operator’s published IP ranges before trusting or blocking.

A retrieval bot consistently receiving 403s is a direct, measurable loss of AI visibility—and one of the fastest wins you’ll find.

Part 2 of 6

A part of Why AI Search Engines Can’t See Your Website — and How to Fix It.

Why AI Search Engines Can’t See Your Website — and How to Fix It
AI Search Crawlers: GPTBot, ClaudeBot, PerplexityBot & the Bots Reading Your Site · you’re here
How to Configure robots.txt for AI Crawlers (Without Going Invisible)
Do AI Crawlers Render JavaScript? Rendering & Content Extractability
How to Audit Your Site for AI Search Crawlability (Step-by-Step)
Fixing AI Crawlability by Platform: Next.js, WordPress, Headless & CDN

← Overview robots.txt & CDN →

What’s next

Becoming visible in AI search is a technical problem with a revenue outcome. If you'd rather have it diagnosed and fixed, here's how I can help.

About the author

Mihir Naik — Senior Product Manager (AI) at seoClarity, building Clarity ArcAI. Born in Surat, India; based in Toronto. In SEO since 2011. Available for consulting.

Read full bio →