AI Crawler Policy Builder

Build robots.txt Disallow rules for AI training crawlers, AI search crawlers, and user-triggered fetchers.

Build a crawler policy

Choose the AI training crawlers, search crawlers, and user-triggered fetchers you want to disallow, then copy or download the generated block for your root robots.txt file.

Blocked AI crawlers
GPTBot Training OpenAI crawler for content that may be used to train generative AI foundation models.
Google-Extended Training Google control token for AI training and Gemini grounding; not a separate HTTP crawler.
Applebot-Extended Training Apple usage-control token for foundation model training; Apple says it does not crawl pages.
CCBot Training Common Crawl crawler for the public web crawl corpus used by researchers and data teams.
ClaudeBot Training Anthropic crawler for web content that could contribute to future model training datasets.
Amazonbot Training Amazon crawler used to improve products and services, and possibly train Amazon AI models.
Bytespider Training ByteDance crawler token commonly included in AI crawler blocklists.
FacebookBot Training Meta-associated crawler token commonly tracked in AI crawler lists.
meta-externalagent Training Meta external agent token commonly listed for AI training and product indexing controls.
cohere-training-data-crawler Training Cohere training data crawler token for model dataset collection.
Applebot Search Apple search crawler for Spotlight, Siri, Safari, and related Apple search experiences.
Claude-SearchBot Search Anthropic search crawler used to improve relevance and accuracy in search responses.
Amzn-SearchBot Search Amazon search crawler for search experiences such as Alexa; not for generative AI training.
PerplexityBot Search Perplexity search crawler for surfacing and linking websites; not used for foundation model training.
OAI-SearchBot Search OpenAI search crawler that controls whether pages can appear in ChatGPT search answers.
Amzn-User User fetch Amazon user-triggered fetcher for live answers, such as Alexa queries.
ChatGPT-User User fetch User-triggered OpenAI fetcher; OpenAI notes robots.txt may not apply to these requests.
Claude-User User fetch User-triggered Claude fetcher; blocking it may reduce visibility in user-directed web search.
Perplexity-User User fetch User-requested Perplexity fetcher that may visit pages to answer questions and cite sources.

Generated robots.txt

Policy ready.

0 agents
robots.txt is a voluntary crawler signal. It is not authentication, legal advice, or a guaranteed block against clients that ignore or spoof crawler rules.

How to use this AI crawler policy builder

Use this builder when you want a clear starting point for controlling how AI crawlers, AI search systems, and training-data crawlers interact with public pages on your site.

  1. Review each crawler token and leave checked anything you want to block. Training tokens are selected by default; search and user-triggered fetchers are optional.
  2. Use / to apply the policy to the whole site, or enter specific paths if only part of your site needs restrictions.
  3. Add extra crawler tokens from your server logs or vendor documentation if you want to block them too.
  4. Copy or download the generated block, merge it into the robots.txt file at the root of your domain, and recheck vendor documentation when policies change.

AI Crawler Policy Builder features

  • Generate robots.txt Disallow blocks for common AI crawler and AI search user-agent tokens.
  • Apply rules to the whole site or to specific public paths.
  • Add extra user-agent tokens from logs, vendor documentation, or publisher blocklists.
  • Copy or download a ready-to-review robots.txt block without sending your site data to a server.

What an AI crawler policy can and cannot do

An AI crawler policy is usually expressed through robots.txt, the plain-text file placed at the root of a website to give automated crawlers instructions about which URLs they may fetch. Search engines have used this convention for decades. AI companies, search engines, data providers, and research crawlers now publish their own user-agent tokens so site owners can signal whether public content should be crawled for search, model training, retrieval, summaries, or other product features.

The important word is signal. A robots.txt rule is not a password, firewall, or access-control system. Well-behaved crawlers fetch the file, parse the matching User-agent group, and follow its Allow and Disallow directives. A crawler that does not check robots.txt, spoofs a user agent, uses a third-party fetcher, or accesses cached copies may not behave the way your file asks it to behave. Sensitive, private, paid, or legally restricted material should be protected with real access controls instead of relying on crawler instructions.

The best policy depends on your tradeoff. A publisher may block training crawlers while still allowing search crawlers that send referral traffic. A SaaS company may allow broad indexing for documentation but block training use for gated templates. A community site may block archive-style crawlers to reduce load. A product catalog may permit traditional search but review AI answer crawlers carefully if summaries could replace visits. This builder makes those choices visible by separating crawler tokens and paths before you publish anything.

Crawler names also change. OpenAI documents separate tokens for search, training, and user-triggered requests. Google uses Google-Extended as a robots.txt control token rather than a separate HTTP user-agent string. Apple documents Applebot-Extended for foundation-model training opt-out, while Applebot itself is the search crawler. Anthropic and Amazon document separate search and user-triggered agents alongside training crawlers. Treat the generated policy as a starting point that should be checked against current vendor documentation and your own server logs.

How the policy is generated

The builder collects selected crawler tokens, additional crawler tokens, target paths, and an optional sitemap URL. It removes duplicate tokens, normalizes blank lines, and writes one robots.txt group per crawler. Each group outputs Disallow directives for the selected paths. The default selections favor blocking training and data-use tokens while leaving search and user-triggered fetchers available unless you choose to block them.

When you append the generated block to an existing robots.txt file, keep rule order and matching behavior in mind. Crawlers generally use the most specific matching group for their token, so duplicate groups can make a file harder to reason about. If your current file already has a group for one of these tokens, merge the generated directives into that existing group instead of publishing conflicting sections. After publishing, use search-console tools, crawler documentation, or server logs to confirm the file is reachable at /robots.txt.

The builder runs in your browser. It does not fetch your domain, inspect your server logs, verify IP ranges, or make claims about legal enforceability. Those steps may still matter for larger publishers, high-value content sites, or domains with unusual crawler traffic. For production policies, keep a dated internal note explaining why each crawler is allowed or blocked so future teams can update the file deliberately instead of copying old rules forward forever.

AI crawler policy FAQ

Should I block every AI crawler?
Not always. Blocking every AI crawler may reduce some forms of AI training or answer exposure, but it can also affect AI search visibility depending on the crawler. Many sites prefer to block training tokens while allowing search-oriented crawlers that can send users back to source pages.
Is robots.txt enough to protect private content?
No. robots.txt is public and voluntary. Use authentication, authorization, paywall controls, noindex headers, or server-level protections for content that must not be accessed.
Where does the generated policy go?
Publish it in the robots.txt file at the root of your domain, such as https://example.com/robots.txt. If you already have a robots.txt file, merge the generated groups with your existing rules.
Why include AI search crawlers separately from training crawlers?
Some providers separate crawlers used for search visibility from crawlers used for model training. Keeping those tokens separate lets you make a more precise tradeoff than blocking every AI-related agent.
Why is Applebot-Extended checked but Applebot is not?
Apple says Applebot-Extended is a robots.txt control token for opting out of Apple foundation-model training use, and that it does not crawl pages itself. Applebot is the actual search crawler for Apple experiences such as Spotlight, Siri, and Safari, so it is left unchecked by default.
Why are user-triggered fetchers unchecked?
User-triggered agents are typically used when a person asks an AI product to fetch or cite a page. Blocking them can reduce visibility in answer experiences, and some vendors note that robots.txt may not apply to user-initiated fetches the same way it applies to automated crawlers.
How often should I review the file?
Review it whenever you change content strategy, see new crawler names in logs, add gated content, migrate domains, or notice a vendor changing crawler documentation. AI crawler naming is still evolving.

Built and maintained by utilkit. Found an issue? Send corrections to contact@utilkit.com