Training vs. retrieval: the distinction that decides visibility
Every AI crawler is doing one of two jobs. Telling them apart is the most important thing on this page.
Training crawlers
- Collect content to train or fine-tune large language models.
- Blocking them is a legitimate intellectual-property choice and has no direct effect on whether you’re cited in live AI answers.
- Examples: GPTBot (OpenAI), CCBot (Common Crawl, which feeds many models), Google-Extended (Gemini training control), Applebot-Extended.
Retrieval / search crawlers
- Fetch live content at answer time (or to maintain a fresh answer index) and are the path to being cited.
- Blocking them directly removes you from those products’ AI answers.
- Examples: OAI-SearchBot (ChatGPT search), PerplexityBot (Perplexity), and the crawling behind Google’s AI Overviews / AI Mode, which still relies on Googlebot-class access.
Rule of thumb: if a bot’s job is to answer a user right now, treat blocking it as a revenue decision—not a privacy default.
For leadership
When someone proposes ‘blocking AI bots,’ ask which job they mean. Opting out of training is reasonable; opting out of retrieval means opting out of AI-search demand.
