mattedwards.org
robots.txt

Robots Exclusion Standard data for mattedwards.org

Resource Scan

Scan Details

Site Domain mattedwards.org
Base Domain mattedwards.org
Scan Status Ok
Last Scan2025-10-25T07:07:33+00:00
Next Scan 2025-11-08T07:07:33+00:00

Last Scan

Scanned2025-10-25T07:07:33+00:00
URL https://mattedwards.org/robots.txt
Domain IPs 104.21.53.41, 172.67.208.189, 2606:4700:3032::ac43:d0bd, 2606:4700:3034::6815:3529
Response IP 104.21.53.41
Found Yes
Hash 453595133bf535e2f98a355637b7fa6c214d2866104219a2bb44ee45935b1833
SimHash 023257558e62

Groups

ccbot

Rule Path
Disallow /

img2dataset

Rule Path
Disallow /

gptbot

Rule Path
Disallow /

chatgpt-user

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

anthropic-ai

Rule Path
Disallow /

claude-web

Rule Path
Disallow /

omgilibot

Rule Path
Disallow /

omgili

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

magpie-crawler

Rule Path
Disallow /

Other Records

Field Value
sitemap https://www.mattedwards.org/sitemap.xml

Comments

  • The Common Crawl dataset. Original source for GPT and others.
  • The example for img2dataset, although the default is *None*
  • GPTBot is OpenAI's web crawler
  • ChatGPT-User takes direct actions on behalf of ChatGPT users
  • Google's Bard and Vertex AI generative APIs
  • Speculative blocks for Anthropic
  • webz.io - they sell data for training LLMs.
  • Meta's bot that crawls public web pages to improve language models
  • ByteDance's bot used to gather data for their LLMs, including Doubao.
  • Brandwatch - "AI to discover new trends"