AI Crawler Access: robots.txt and llms.txt

The first step in any GEO program is confirming that AI systems can actually reach your content. This sounds obvious, but the landscape of AI crawlers has expanded dramatically since 2023, and many sites are inadvertently blocking citation-eligible crawlers with overly broad robots.txt rules written before those bots existed. This section covers the full roster of AI crawlers, the critical distinction between training and inference crawlers, a recommended robots.txt configuration, and the emerging llms.txt convention — with an honest assessment of what it actually does.

The AI Crawler Taxonomy

AI bots fall into two functionally distinct categories, and conflating them is the most consequential mistake in access configuration.

Training crawlers collect content to train or fine-tune model weights. Blocking them prevents your content from influencing the model's parametric knowledge — what the model "knows" without looking anything up. The primary training crawlers are:

  • GPTBot — OpenAI's training crawler
  • ClaudeBot / anthropic-ai — Anthropic's training crawler (also appears as Claude-Web)
  • Google-Extended — Google's crawler for Gemini AI training (separate from regular Googlebot)
  • Applebot-Extended — Apple's AI training variant
  • Amazonbot — Amazon's training crawler

Inference and search crawlers retrieve content at query time to include in RAG (Retrieval Augmented Generation) responses. Blocking these directly prevents your content from being cited in live AI answers. This is the category that matters for GEO:

  • OAI-SearchBot — OpenAI's SearchGPT index crawler
  • ChatGPT-User — OpenAI's browsing mode user agent (real-time browsing during a conversation)
  • PerplexityBot — Perplexity's live index crawler
  • YouBot — You.com's crawler
  • PhindBot — Phind's crawler
  • ExaBot — Exa.ai's crawler
  • AndiBot — Andi Search crawler

Blocking a training crawler does not block AI search visibility. A site that blocks GPTBot but allows OAI-SearchBot and PerplexityBot can still appear in ChatGPT and Perplexity responses. This distinction is poorly understood, and many sites that believe they are "blocking AI" have only blocked training — which is a meaningful IP decision, but orthogonal to GEO.

robots.txt Configuration

Strategy 1: Full AI Visibility (Recommended for GEO)

Allow all AI crawlers — both training and inference. This maximizes both parametric knowledge contribution and real-time citation eligibility.

# robots.txt — Full AI visibility configuration
# Updated: 2026-Q1 — audit quarterly as new AI crawlers emerge

User-agent: *
Allow: /

# OpenAI — training crawler
User-agent: GPTBot
Allow: /

# OpenAI — browsing / real-time retrieval
User-agent: ChatGPT-User
Allow: /

# OpenAI — SearchGPT index
User-agent: OAI-SearchBot
Allow: /

# Anthropic — training
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: anthropic-ai
Allow: /

# Perplexity — live index
User-agent: PerplexityBot
Allow: /

# Google — AI training (separate from Googlebot)
User-agent: Google-Extended
Allow: /

# Apple — AI training
User-agent: Applebot-Extended
Allow: /

# Amazon
User-agent: Amazonbot
Allow: /

# Other AI search crawlers
User-agent: YouBot
Allow: /
User-agent: PhindBot
Allow: /
User-agent: ExaBot
Allow: /
User-agent: AndiBot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Strategy 2: Allow Search, Block Training

If your organization has legal or IP concerns about training data contribution but still wants AI search visibility, allow the inference crawlers while blocking training crawlers:

# robots.txt — Block training, allow AI search inference

# Block OpenAI training crawler, allow browsing and SearchGPT
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block Anthropic training crawlers
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /

# Block Google AI training (does not affect organic Googlebot)
User-agent: Google-Extended
Disallow: /

# Allow live search crawlers
User-agent: PerplexityBot
Allow: /
User-agent: YouBot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Maintenance Cadence

The AI crawler landscape is not static. New bots appear every quarter. OpenAI alone has introduced three distinct user agents (GPTBot, ChatGPT-User, OAI-SearchBot) that serve different pipeline stages. Set a calendar reminder to audit your robots.txt against the current list of known AI crawlers every quarter.

robots.txt is advisory, not enforced. Legitimate commercial AI bots (OpenAI, Anthropic, Google, Perplexity) respect the robots.txt standard. Scrapers and non-compliant bots do not. For stricter access control — particularly around training data — supplement robots.txt with WAF rules that block on User-Agent strings and apply rate limiting per crawler. This adds operational complexity but is the only mechanism that actually blocks non-compliant crawlers.

llms.txt: Structured Documentation for AI Agents

llms.txt is a Markdown file placed at the root of your domain (/llms.txt) that provides a human-and-machine-readable index of your site's content, specifically structured for LLM consumption. It was proposed by Jeremy Howard of Answer.AI in late 2024 and gained mainstream traction in November 2024 when Mintlify added automated llms.txt generation to its documentation platform, driving rapid adoption across developer-facing products.

Notable real-world implementations include GitHub Docs, Cursor, and the Anthropic developer platform.

What llms.txt Is (and Isn't)

Before implementing it, be precise about what the file actually does:

  • It is: a structured content index that makes it easier for AI agents and crawlers to discover and navigate your documentation without parsing your full HTML site.
  • It is not: a robots.txt replacement — it has no access control semantics.
  • It is not: a sitemap — search engines do not use it for URL discovery in the traditional sense.
  • Honest caveat: as of early 2026, no major AI provider has publicly documented reading llms.txt at inference time. There is no peer-reviewed evidence of a direct correlation between having llms.txt and higher citation rates. Its clearest documented value is for AI coding assistants (Cursor, Windsurf, Claude Projects) that explicitly load documentation context.

Implement it for documentation-heavy products where AI coding assistants are a meaningful part of your user acquisition funnel. Do not treat it as a citation optimization lever with proven ROI.

llms.txt File Structure

# Your Product Name

> A concise one-to-three sentence summary of what your product does and who it's for. Write this as
> if briefing an LLM that has never encountered your product.

## Documentation

- [Getting Started](https://yourdomain.com/docs/getting-started/): Installation and initial setup
- [Core Concepts](https://yourdomain.com/docs/concepts/): Architecture overview and key abstractions
- [API Reference](https://yourdomain.com/docs/api/): Full endpoint and SDK documentation
- [Configuration](https://yourdomain.com/docs/config/): All configuration options with defaults

## Guides

- [Authentication](https://yourdomain.com/docs/guides/auth/): OAuth, API keys, and JWT setup
- [Deployment](https://yourdomain.com/docs/guides/deploy/): Production deployment checklist
- [Troubleshooting](https://yourdomain.com/docs/guides/troubleshooting/): Common errors and fixes

## Changelog

- [v2.4.0](https://yourdomain.com/changelog/v2-4-0/): Released 2026-03-15 — breaking changes
- [v2.3.0](https://yourdomain.com/changelog/v2-3-0/): Released 2026-01-20

## Optional

- [GitHub Repository](https://github.com/yourorg/yourproject): Source code and issues
- [Community Forum](https://community.yourdomain.com): User discussions

The structure is intentional:

  • H1 (# Your Product Name): The project identifier. Keep it exactly matching your brand name.
  • Blockquote summary (>): The LLM-facing description. This is what a coding assistant reads to understand what your product does before loading any linked pages.
  • H2 sections (## Documentation): Logical groupings with descriptive link anchor text. Anchor text is more important than URL structure here — it is the signal the LLM uses to decide which page to fetch.

llms-full.txt

A companion file at /llms-full.txt is the concatenated, plain-text version of all pages linked in llms.txt. This provides AI agents with a single file to ingest instead of crawling each page individually. Observed behavior: agents that parse llms.txt then visit llms-full.txt at approximately 2x the visit rate of the index file, suggesting they use the index for navigation and the full file for bulk ingestion.

For large documentation sites, llms-full.txt can become multi-megabyte. Use a build step to generate it, with a maximum size cap if necessary:

# Example: Generate llms-full.txt from your docs build output
# Concatenate all Markdown source files in documentation order
find docs/ -name "*.md" | sort | xargs cat > public/llms-full.txt

# Or with a size limit (truncate at 2MB for performance)
find docs/ -name "*.md" | sort | xargs cat | head -c 2097152 > public/llms-full.txt

Serving Both Files

Ensure both files are served with appropriate content types and are accessible to crawlers:

# nginx: serve llms.txt and llms-full.txt as plain text
location ~ ^/llms.*\.txt$ {
    add_header Content-Type "text/plain; charset=utf-8";
    add_header Cache-Control "public, max-age=86400";
}

The combination of a correct robots.txt allowing inference crawlers and a well-structured llms.txt for documentation navigation gives AI systems the clearest possible access path to your content. Everything else in GEO depends on this foundation being correctly in place.