scup.org
robots.txt

Robots Exclusion Standard data for scup.org

Resource Scan

Scan Details

Site Domain scup.org
Base Domain scup.org
Scan Status Ok
Last Scan2024-09-18T17:13:03+00:00
Next Scan 2024-10-18T17:13:03+00:00

Last Scan

Scanned2024-09-18T17:13:03+00:00
URL https://scup.org/robots.txt
Redirect https://www.scup.org/robots.txt
Redirect Domain www.scup.org
Redirect Base scup.org
Domain IPs 23.185.0.2, 2620:12a:8000::2, 2620:12a:8001::2
Redirect IPs 23.185.0.2, 2620:12a:8000::2, 2620:12a:8001::2
Response IP 23.185.0.2
Found Yes
Hash 0dcad598b6c597b7c1f4773b60c72967932d277ab58effe1ab9a455e85e8bd94
SimHash 1a5ed3c48662

Groups

*

Rule Path
Disallow /*.pdf$
Disallow /*.zip$
Disallow /*.mp3$

ccbot

Rule Path
Disallow /

img2dataset

Rule Path
Disallow /

gptbot

Rule Path
Disallow /

chatgpt-user

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

anthropic-ai

Rule Path
Disallow /

claude-web

Rule Path
Disallow /

omgilibot

Rule Path
Disallow /

omgili

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

magpie-crawler

Rule Path
Disallow /

applebot-extended

Rule Path
Disallow /private/

Other Records

Field Value
sitemap https://www.scup.org/sitemap_index.xml

Comments

  • START YOAST BLOCK
  • ---------------------------
  • ---------------------------
  • END YOAST BLOCK
  • The Common Crawl dataset. Original source for GPT and others.
  • The example for img2dataset, although the default is *None*
  • GPTBot is OpenAI's web crawler
  • ChatGPT-User takes direct actions on behalf of ChatGPT users
  • Google's Bard and Vertex AI generative APIs
  • Speculative blocks for Anthropic
  • webz.io - they sell data for training LLMs.
  • Meta's bot that crawls public web pages to improve language models
  • ByteDance's bot used to gather data for their LLMs, including Doubao.
  • Brandwatch - "AI to discover new trends"
  • Apple