toilet-guru.com
robots.txt

Robots Exclusion Standard data for toilet-guru.com

Resource Scan

Scan Details

Site Domain toilet-guru.com
Base Domain toilet-guru.com
Scan Status Ok
Last Scan2024-09-30T06:05:58+00:00
Next Scan 2024-10-07T06:05:58+00:00

Last Scan

Scanned2024-09-30T06:05:58+00:00
URL https://toilet-guru.com/robots.txt
Domain IPs 35.203.182.32
Response IP 35.203.182.32
Found Yes
Hash 217d6201faf06b23d0d9eb036a991b370dc4b085dbf6a906cddeb1ea8c12993a
SimHash 03145750cd72

Groups

*

Rule Path
Disallow /quiz1.php?q=*
Disallow /quiz2.php?q=*

anthropic-ai

Rule Path
Disallow /

claude-web

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

ccbot

Rule Path
Disallow /

img2dataset

Rule Path
Disallow /

gptbot

Rule Path
Disallow /

chatgpt-user

Rule Path
Disallow /

omgilibot

Rule Path
Disallow /

omgili

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

magpie-crawler

Rule Path
Disallow /

applebot-extended

Rule Path
Disallow /

perplexitybot

Rule Path
Disallow /

Other Records

Field Value
sitemap https://toilet-guru.com/sitemap.txt

Comments

  • Asking AI content scrapers to not scrape my content, from:
  • https://github.com/healsdata/ai-training-opt-out/blob/main/robots.txt
  • https://github.com/zcutlip/gen-ai-robots.txt/blob/main/robots.txt
  • However, it seems that blocking Google's AI scraper also excludes
  • a site from Google search results, or it soon will:
  • https://www.osnews.com/story/140536/google-to-websites-let-us-train-our-ai-on-your-content-or-well-remove-you-from-google-search/
  • ClaudeBot, Claude-Web, anthropic-ai = speculative blocks for Anthropic
  • CCBot = Common Crawl dataset, original source for GPT and others
  • The example for img2dataset, although the default is *None*
  • GPTBot = OpenAI's web crawler
  • ChatGPT-User takes direct actions on behalf of ChatGPT users
  • Google-Extended = Google's Bard and Vertex AI generative APIs
  • User-agent: Google-Extended
  • Disallow: /
  • Omgilibot, Omgili = webz.io = they sell data for training LLMs.
  • FacebookBot = Meta's bot that crawls public web pages
  • Bytespider = ByteDance's bot gathering data for their LLMs, including Doubao.
  • magpie-crawler = Brandwatch, "AI to discover new trends"
  • Apple's AI system
  • Perplexity AI
  • https://archive.is/22gCl (wired.com)
  • https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/