toilet-guru.com
robots.txt

Robots Exclusion Standard data for toilet-guru.com

Archived Snapshots

Resource Scan

Scan Details

Site Domain	toilet-guru.com
Base Domain	toilet-guru.com
Scan Status	Ok
Last Scan	2024-09-30T06:05:58+00:00
Next Scan	2024-10-07T06:05:58+00:00

Last Scan

Scanned	2024-09-30T06:05:58+00:00
URL	https://toilet-guru.com/robots.txt
Domain IPs	35.203.182.32
Response IP	35.203.182.32
Found	Yes
Hash	217d6201faf06b23d0d9eb036a991b370dc4b085dbf6a906cddeb1ea8c12993a
SimHash	03145750cd72

Groups

*

Rule	Path
Disallow	/quiz1.php?q=*
Disallow	/quiz2.php?q=*

Rule

Path

Disallow

/quiz1.php?q=*

Disallow

/quiz2.php?q=*

anthropic-ai

Rule	Path
Disallow	/

Rule

Path

Disallow

claude-web

Rule	Path
Disallow	/

Rule

Path

Disallow

claudebot

Rule	Path
Disallow	/

Rule

Path

Disallow

ccbot

Rule	Path
Disallow	/

Rule

Path

Disallow

img2dataset

Rule	Path
Disallow	/

Rule

Path

Disallow

gptbot

Rule	Path
Disallow	/

Rule

Path

Disallow

chatgpt-user

Rule	Path
Disallow	/

Rule

Path

Disallow

omgilibot

Rule	Path
Disallow	/

Rule

Path

Disallow

omgili

Rule	Path
Disallow	/

Rule

Path

Disallow

facebookbot

Rule	Path
Disallow	/

Rule

Path

Disallow

bytespider

Rule	Path
Disallow	/

Rule

Path

Disallow

magpie-crawler

Rule	Path
Disallow	/

Rule

Path

Disallow

applebot-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

perplexitybot

Rule	Path
Disallow	/

Rule

Path

Disallow

Other Records

Field	Value
sitemap	https://toilet-guru.com/sitemap.txt

Field

Value

sitemap

https://toilet-guru.com/sitemap.txt

Comments

Asking AI content scrapers to not scrape my content, from:
https://github.com/healsdata/ai-training-opt-out/blob/main/robots.txt
https://github.com/zcutlip/gen-ai-robots.txt/blob/main/robots.txt
However, it seems that blocking Google's AI scraper also excludes
a site from Google search results, or it soon will:
https://www.osnews.com/story/140536/google-to-websites-let-us-train-our-ai-on-your-content-or-well-remove-you-from-google-search/
ClaudeBot, Claude-Web, anthropic-ai = speculative blocks for Anthropic
CCBot = Common Crawl dataset, original source for GPT and others
The example for img2dataset, although the default is *None*
GPTBot = OpenAI's web crawler
ChatGPT-User takes direct actions on behalf of ChatGPT users
Google-Extended = Google's Bard and Vertex AI generative APIs
User-agent: Google-Extended
Disallow: /
Omgilibot, Omgili = webz.io = they sell data for training LLMs.
FacebookBot = Meta's bot that crawls public web pages
Bytespider = ByteDance's bot gathering data for their LLMs, including Doubao.
magpie-crawler = Brandwatch, "AI to discover new trends"
Apple's AI system
Perplexity AI
https://archive.is/22gCl (wired.com)
https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/

toilet-guru.comrobots.txt

Resource Scan

Scan Details

Last Scan

Groups

*

anthropic-ai

claude-web

claudebot

ccbot

img2dataset

gptbot

chatgpt-user

omgilibot

omgili

facebookbot

bytespider

magpie-crawler

applebot-extended

perplexitybot

Other Records

Comments

toilet-guru.com
robots.txt