Skip to main content
glyph.sh
Type a command…⌘K

Blocking AI Scrapers with AWS WAF: A Technical Guide

AI agents were hitting paths my robots.txt disallows ~72x/day. Here's how AI crawlers, user-triggered fetchers, and search indexers actually differ - and how to block the ones that self-identify with AWS WAF.

November 14, 2025 · 11 min read · AWS · WAF · CloudFront · AI · Web Scraping · Security
Blocking AI Scrapers with AWS WAF: A Technical Guide

The Discovery

While analyzing my CloudFront access logs for honeypot activity, I noticed steady traffic from AI vendors’ agents - ChatGPT, Claude, Perplexity, and others - hitting paths my robots.txt disallows, roughly 72 times per day.

The nuance matters, and it’s the part most “block the AI bots” posts get wrong. Each major vendor now runs multiple, separately-controllable agents:

  • Training crawlers - OpenAI’s GPTBot, Anthropic’s ClaudeBot. These are documented to honor robots.txt, and in my logs they largely did.
  • User-triggered fetchers - ChatGPT-User, Perplexity-User, Claude-User. These fire when a person asks the assistant about a URL.

Almost all of my disallowed-path hits came from the user-triggered fetchers, not the training crawlers. And here’s the detail worth internalizing: OpenAI and Perplexity openly state that robots.txt rules may not apply to these user-initiated fetches. So it isn’t necessarily a crawler “going rogue” - for the user-triggered agents it’s a deliberate carve-out in how vendors treat “a human asked for this.” Whether that distinction is reasonable is a fair debate; either way, if you want those requests stopped, robots.txt won’t do it.

The Evidence

Let me show you what I found in my CloudFront logs:

# ChatGPT-User (the user-triggered fetcher) reads robots.txt
2025-11-13 14:23:15 GET /robots.txt 200 "ChatGPT-User/1.0"

# ...then fetches a disallowed path anyway
2025-11-13 14:23:17 GET /posts/cisco-asa-zero-days/ 200 "ChatGPT-User/1.0"
2025-11-13 14:23:19 GET /docs/zero-trust-architecture/ 200 "ChatGPT-User/1.0"

Pattern observed (for the user-triggered fetchers specifically):

  1. Agent reads /robots.txt
  2. Disallow: / is set for its user-agent
  3. It fetches the content anyway - consistent with the vendors’ stated position that robots.txt may not govern user-initiated requests

Over a 24-hour period, I logged roughly:

  • ChatGPT agents: ~22 requests to disallowed paths
  • Claude agents: ~12
  • Perplexity agents: ~21
  • Other AI agents: ~17

Total: ~72 requests to disallowed paths in 24 hours. Small numbers for one site, but the point is that polite directives weren’t being honored by the user-triggered agents.

Why This Matters

It’s tempting to call this illegal. The honest answer in 2026 is: it’s unsettled, and mostly not the slam-dunk people assume.

  • robots.txt is voluntary. It’s a convention (standardized as RFC 9309), not a legally binding access control. Ignoring it isn’t, by itself, a crime.
  • CFAA probably doesn’t apply to a public site. In hiQ Labs v. LinkedIn, the Ninth Circuit held that scraping data that’s publicly accessible (no login/auth to bypass) generally isn’t “access without authorization” under the Computer Fraud and Abuse Act. If anyone can curl your page, a CFAA claim is weak.
  • Copyright/fair use is actively being litigated. Whether training on copyrighted text is fair use is being fought out right now (NYT v. OpenAI, Bartz v. Anthropic, Kadrey v. Meta, Thomson Reuters v. Ross), with mixed and partial rulings. There’s no clean precedent to rely on yet - though how content was acquired (e.g. pirated datasets) has drawn courts’ attention.

So treat a LICENSE and Terms as stating your terms and creating a paper trail, not as a guaranteed cause of action. I’m an engineer, not a lawyer - if you’re considering actual legal action, talk to one.

The Bigger Picture

Regardless of how the law shakes out, AI vendors are scraping aggressively while precedent is still forming. Your security research, technical writing, and code examples can end up in training datasets without permission or compensation. If you’d rather not feed that pipeline, the practical move is technical enforcement plus clearly-stated terms - not relying on goodwill.

The Solution: AWS WAF

Since AI companies ignore robots.txt, we need technical enforcement, not polite requests. Enter AWS WAF (Web Application Firewall).

Architecture Overview

CloudFront + WAF Architecture

The request flow is simple: user requests hit CloudFront, WAF checks the user-agent header, and either allows legitimate traffic through to S3 or blocks AI bots with a 403 response.

Step 1: Create the WAF Web ACL

First, create a WAF rule that matches AI crawler user-agents:

# Create WAF Web ACL
aws wafv2 create-web-acl \
  --name glyph-sh-ai-blocker \
  --scope CLOUDFRONT \
  --default-action Allow={} \
  --region us-east-1 \
  --profile your-profile \
  --rules file://waf-rules.json

WAF Rules Configuration (waf-rules.json):

[
  {
    "Name": "BlockAIScrapers",
    "Priority": 0,
    "Statement": {
      "RegexMatchStatement": {
        "RegexString": "(gptbot|chatgpt-user|oai-searchbot|claudebot|claude-user|claude-searchbot|anthropic|perplexity|amazonbot|bytespider|ccbot|meta-externalagent)",
        "FieldToMatch": {
          "SingleHeader": {
            "Name": "user-agent"
          }
        },
        "TextTransformations": [
          {
            "Priority": 0,
            "Type": "LOWERCASE"
          }
        ]
      }
    },
    "Action": {
      "Block": {}
    },
    "VisibilityConfig": {
      "SampledRequestsEnabled": true,
      "CloudWatchMetricsEnabled": true,
      "MetricName": "BlockAIScrapers"
    }
  }
]

Key details:

  • Regex pattern: matches the current self-identifying AI agent user-agents (the LOWERCASE transform makes it case-insensitive). Note these UA strings change - check each vendor’s docs and a source like Dark Visitors periodically.
  • Action: Block (returns HTTP 403)
  • Metrics: Track blocked requests in CloudWatch
  • No google-extended / applebot-extended here: those are robots.txt-only opt-out tokens, not user-agents - no request ever carries them, so matching them in WAF does nothing. Control those via robots.txt (below); Google and Apple honor them for AI training.
  • UA matching only stops honest agents. Anything that spoofs a browser UA or uses residential proxies sails right through. This raises the cost of casual scraping; it is not a wall.

Step 2: Attach WAF to CloudFront

CloudFront has no --web-acl-id flag - the Web ACL is a field (WebACLId) inside the distribution config, and updating a distribution requires sending the whole config back with its current ETag. For WAFv2, that field must be the Web ACL ARN (not the ID):

# Get your CloudFront distribution ID
aws cloudfront list-distributions \
  --profile your-profile \
  --query "DistributionList.Items[].{ID:Id,Domain:DomainName}" \
  --output table

DIST_ID=YOUR_DISTRIBUTION_ID
WAF_ARN="arn:aws:wafv2:us-east-1:ACCOUNT:global/webacl/glyph-sh-ai-blocker/ID"

# 1. Fetch the current config + ETag
aws cloudfront get-distribution-config --id "$DIST_ID" --profile your-profile > dist.json
ETAG=$(jq -r '.ETag' dist.json)

# 2. Set WebACLId to the WAF ARN inside the DistributionConfig
jq --arg arn "$WAF_ARN" '.DistributionConfig.WebACLId = $arn | .DistributionConfig' dist.json > dist-config.json

# 3. Push the updated config back, passing the ETag as --if-match
aws cloudfront update-distribution \
  --id "$DIST_ID" \
  --distribution-config file://dist-config.json \
  --if-match "$ETAG" \
  --profile your-profile

Important: a CloudFront-scope WAF must be created in us-east-1 (it’s a global resource), and the WebACLId you set must be the WAFv2 ARN.

Add enforcement notice to your robots.txt:

# NOTICE: AI/ML training and scraping is not permitted on this site.
# Self-identifying AI agents are also blocked at the edge (CloudFront WAF, HTTP 403).
# Requests are logged with timestamps, IPs, and user-agents.
# Note: Google-Extended and Applebot-Extended are honored here as robots.txt
# opt-out tokens - they are not request user-agents and can't be blocked by WAF.

# AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Amazonbot
User-agent: Bytespider
User-agent: meta-externalagent
User-agent: Google-Extended
User-agent: Applebot-Extended

# User-triggered fetchers (vendors say robots.txt may not bind these)
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User

# AI search indexers (block only if you don't want AI-search visibility)
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Disallow: /

# Copyright Notice
# All content © glyph.sh - All Rights Reserved
# AI/ML training use is not permitted. See /LICENSE.md for full terms.

Note the deliberate grouping. Anthropic and OpenAI let you control each agent independently, so you can allow AI-search visibility while still refusing training - just don’t Disallow the *-SearchBot agents. Also note Anthropic’s old Claude-Web and anthropic-ai tokens are deprecated; use ClaudeBot/Claude-User/Claude-SearchBot.

Step 4: Verify It’s Working

Test the WAF with curl:

# Normal user - should work
curl -I https://yourdomain.com/
# HTTP/2 200 ✅

# AI bot - should be blocked
curl -I -A "ChatGPT-User/1.0" https://yourdomain.com/
# HTTP/2 403 ❌

Result:

HTTP/2 403
server: CloudFront
content-type: text/html
x-cache: Error from cloudfront

Perfect! AI bots now get HTTP 403 Forbidden.

Defense in Depth: Beyond WAF

Technical blocking is only one layer. I also implemented:

Created a comprehensive LICENSE.md:

# All Rights Reserved - Not for AI/ML Training

You may NOT use any content from this repository for:
- Training artificial intelligence or machine learning models
- Fine-tuning large language models (LLMs)
- Creating training datasets for AI systems
- Any form of automated content extraction for AI purposes

2. GitHub Repository Privacy

Made my repository private to prevent AI companies from scraping source code and markdown content directly from GitHub.

Why this matters: CloudFront WAF only protects the live website. Public GitHub repos can be cloned by AI companies (and likely already were during 2020-2023’s massive scraping campaigns).

3. Opt-Out Signals

Added multiple opt-out signals (though AI companies largely ignore these):

.aiexclude:

# AI Training Exclusion File
**/*

.github/copilot-training.yml:

exclude: true
license: "All Rights Reserved - Not for AI/ML Training"
telemetry: false

4. Terms of Service Update

Updated /terms/ with explicit AI training prohibition and legal consequences.

Monitoring and Enforcement

CloudWatch Metrics

Monitor blocked requests in CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/WAFV2 \
  --metric-name BlockedRequests \
  --dimensions Name=Rule,Value=BlockAIScrapers \
  --start-time 2025-11-14T00:00:00Z \
  --end-time 2025-11-14T23:59:59Z \
  --period 3600 \
  --statistics Sum \
  --profile your-profile

WAF Logs

Enable WAF logging to S3 for detailed analysis:

aws wafv2 put-logging-configuration \
  --logging-configuration \
    ResourceArn=$WAF_ARN,\
    LogDestinationConfigs=arn:aws:s3:::your-waf-logs-bucket \
  --region us-east-1 \
  --profile your-profile

Testing Script

I created a quick test script to verify blocking:

#!/bin/bash
# test-waf-blocking.sh

echo "Testing WAF AI Bot Blocking..."
echo ""

# Test normal user
echo "1. Normal User (should be 200):"
curl -s -o /dev/null -w "%{http_code}\n" https://glyph.sh/

# Test AI bots (should all be 403)
echo "2. ChatGPT (should be 403):"
curl -s -o /dev/null -w "%{http_code}\n" -A "ChatGPT-User/1.0" https://glyph.sh/

echo "3. Claude (should be 403):"
curl -s -o /dev/null -w "%{http_code}\n" -A "ClaudeBot/1.0" https://glyph.sh/

echo "4. Perplexity (should be 403):"
curl -s -o /dev/null -w "%{http_code}\n" -A "PerplexityBot/1.0" https://glyph.sh/

Cost Analysis

AWS WAF Pricing (as of November 2025):

  • Web ACL: $5.00/month
  • Rules: $1.00/rule/month (1 rule = $1.00)
  • Requests: $0.60 per 1 million requests

My costs:

  • Web ACL: $5.00/month
  • 1 regex rule: $1.00/month
  • ~100,000 requests/month: $0.06/month

Total: ~$6.06/month

⚠️ Cost Warning for High-Traffic Sites

This pricing is for low-traffic websites like mine (~100k requests/month).

WAF costs scale with traffic:

  • 1 million requests: ~$6.60/month
  • 10 million requests: ~$11.00/month
  • 100 million requests: ~$65.00/month

For high-traffic sites, this can add hundreds or thousands to your AWS bill. Calculate costs based on your actual traffic volume before implementing. See Terms of Service for important disclaimers about cloud costs.

The Harsh Reality

What’s Already Happened

If your content was public between 2020-2023, it’s likely already in training data:

  • GitHub Copilot: Scraped all public repos in 2021
  • OpenAI GPT-3/4: Scraped GitHub 2020-2023
  • Anthropic Claude: Scraped 2022-2023
  • Google Gemini: Scraped 2022-2023

You can’t un-train a model. Past scraping is done.

What You Can Do Now

  1. Stop future scraping: WAF blocks AI bots today
  2. Legal standing: LICENSE.md gives you grounds to sue
  3. Document violations: Logs provide evidence
  4. Private development: Keep new content in private repos

Lessons Learned

What Worked

AWS WAF: reliably blocks agents that honestly identify themselves (effectively all of them today)

CloudWatch metrics: Easy to monitor blocked requests

Multi-layer defense: Technical enforcement + clearly-stated terms

Documentation: a logged record of requests to disallowed paths

What Didn’t Work

robots.txt alone: AI companies ignore it

.aiexclude file: No evidence anyone respects it

Polite requests: They don’t care

What’s Still Unknown

Rotating user-agents: Will they disguise themselves?

Residential proxies: Will they use proxy networks?

Legal precedents: Will courts hold AI companies accountable?

Conclusion

AI companies are scraping content aggressively, ignoring robots.txt, and violating Terms of Service. robots.txt is not enough - you need technical enforcement.

AWS WAF provides that enforcement layer:

  • Blocks AI bots at the edge (CloudFront)
  • Costs ~$6/month
  • Logs requests for your own records
  • Works alongside your stated terms (LICENSE, ToS)

Just be clear-eyed about what it is: a UA filter raises the cost of casual, honest scraping. It won’t stop a determined scraper that spoofs a browser UA and rotates IPs - for that you’d need behavioral rate-limiting, bot-detection services, or auth. But for keeping well-behaved AI agents out of content you’ve asked them not to take, it’s cheap and it works.

If you value your content, don’t rely on AI vendors to opt themselves out. State your terms and back them with technical controls.


Resources


Update (November 14, 2025): Since deploying this WAF configuration, I’ve blocked hundreds of requests from self-identified AI agents, and seen no further hits to disallowed paths from agents using those user-agents.

Update (June 2026): Refreshed for accuracy. The big changes since the original post: vendors now run separate, independently-controllable agents (training crawler vs. user-triggered fetcher vs. search indexer), so the bot list and robots.txt here are grouped accordingly; Anthropic’s Claude-Web/anthropic-ai are deprecated in favor of ClaudeBot/Claude-User/Claude-SearchBot; Google-Extended/Applebot-Extended are robots.txt-only tokens (not WAF-matchable); the CloudFront attach step now uses the correct get-distribution-config → edit WebACLIdupdate-distribution --if-match flow; and the legal framing is tempered to reflect that scraping public pages generally isn’t a CFAA violation (hiQ v. LinkedIn) and that AI-training fair use is still being litigated.

Want to protect your content? The code and configuration are above - adapt the user-agent list to whatever’s current when you read this.