cromwell-intl.com
robots.txt

Robots Exclusion Standard data for cromwell-intl.com

Resource Scan

Scan Details

Site Domain cromwell-intl.com
Base Domain cromwell-intl.com
Scan Status Ok
Last Scan2024-09-29T21:00:39+00:00
Next Scan 2024-10-06T21:00:39+00:00

Last Scan

Scanned2024-09-29T21:00:39+00:00
URL https://cromwell-intl.com/robots.txt
Domain IPs 35.203.182.32
Response IP 35.203.182.32
Found Yes
Hash 1f12f87c3fea33664bbcd9682cb2ac992f32dcfbb72463b80ea2a9c3b62209fb
SimHash 1316d371cc72

Groups

*

Rule Path
Disallow /roots/
Disallow /tcpip/class-a-nets.html
Disallow /cybersecurity/attack-study/analysis-01.html
Disallow /cybersecurity/attack-study/analysis-02.html
Disallow /cybersecurity/attack-study/analysis-03.html
Disallow /cybersecurity/attack-study/analysis-04.html
Disallow /cybersecurity/attack-study/analysis-05.html
Disallow /cybersecurity/attack-study/analysis-06.html
Disallow /cybersecurity/attack-study/analysis-07.html
Disallow /cybersecurity/attack-study/analysis-08.html
Disallow /cybersecurity/attack-study/analysis-09.html
Disallow /cybersecurity/attack-study/analysis-10.html
Disallow /cybersecurity/attack-study/analysis-11.html
Disallow /cybersecurity/attack-study/analysis-12.html
Disallow /cybersecurity/attack-study/botnet-log-1-c193.html
Disallow /cybersecurity/attack-study/botnet-log-1-i192.html
Disallow /cybersecurity/attack-study/botnet-log-2-c193.html
Disallow /cybersecurity/attack-study/botnet-log-2-i192.html
Disallow /cybersecurity/attack-study/botnet-log.html

anthropic-ai

Rule Path
Disallow /

claude-web

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

ccbot

Rule Path
Disallow /

img2dataset

Rule Path
Disallow /

gptbot

Rule Path
Disallow /

chatgpt-user

Rule Path
Disallow /

omgilibot

Rule Path
Disallow /

omgili

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

magpie-crawler

Rule Path
Disallow /

applebot-extended

Rule Path
Disallow /

perplexitybot

Rule Path
Disallow /

Comments

  • Asking AI content scrapers to not scrape my content, from:
  • https://github.com/healsdata/ai-training-opt-out/blob/main/robots.txt
  • https://github.com/zcutlip/gen-ai-robots.txt/blob/main/robots.txt
  • However, it seems that blocking Google's AI scraper also excludes
  • a site from Google search results, or it soon will:
  • https://www.osnews.com/story/140536/google-to-websites-let-us-train-our-ai-on-your-content-or-well-remove-you-from-google-search/
  • ClaudeBot, Claude-Web, anthropic-ai = speculative blocks for Anthropic
  • CCBot = Common Crawl dataset, original source for GPT and others
  • The example for img2dataset, although the default is *None*
  • GPTBot = OpenAI's web crawler
  • ChatGPT-User takes direct actions on behalf of ChatGPT users
  • Google-Extended = Google's Bard and Vertex AI generative APIs
  • User-agent: Google-Extended
  • Disallow: /
  • Omgilibot, Omgili = webz.io = they sell data for training LLMs.
  • FacebookBot = Meta's bot that crawls public web pages
  • Bytespider = ByteDance's bot gathering data for their LLMs, including Doubao.
  • magpie-crawler = Brandwatch, "AI to discover new trends"
  • Apple's AI system
  • Perplexity AI
  • https://archive.is/22gCl (wired.com)
  • https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/