sapphic.site
robots.txt

Robots Exclusion Standard data for sapphic.site

Resource Scan

Scan Details

Site Domain sapphic.site
Base Domain sapphic.site
Scan Status Ok
Last Scan2025-08-18T20:16:55+00:00
Next Scan 2025-08-25T20:16:55+00:00

Last Scan

Scanned2025-08-18T20:16:55+00:00
URL https://sapphic.site/robots.txt
Domain IPs 185.14.97.167, 2a03:94e0:ffff:185:14:97:0:167
Response IP 185.14.97.167
Found Yes
Hash 663c38276ef2a8913baa54a8f3fc2cfcdf1ca5c9af7ddaa6e003e4cb894f5a67
SimHash a64cb91a8dc4

Groups

*

Rule Path
Disallow /noindex/
Disallow /misc/
Disallow /~strawberry/
Disallow .git

*

Rule Path
Disallow /api/*
Disallow /avatars
Disallow /user/*
Disallow /*/*/src/commit/*
Disallow /*/*/commit/*
Disallow /*/*/*/refs/*
Disallow /*/*/*/star
Disallow /*/*/*/watch
Disallow /*/*/labels
Disallow /*/*/activity/*
Disallow /vendor/*
Disallow /swagger.*.json
Disallow /explore/*?*
Disallow /repo/create
Disallow /repo/migrate
Disallow /org/create
Disallow /*/*/fork
Disallow /*/*/watchers
Disallow /*/*/stargazers
Disallow /*/*/forks
Disallow /*/*/activity
Disallow /*/*/projects
Disallow /*/*/commits/
Disallow /*/*/branches
Disallow /*/*/tags
Disallow /*/*/compare
Disallow /*/*/lastcommit/*
Disallow /*/*/issues/new
Disallow /*/*/issues/?*
Disallow /*/*/issues?*
Disallow /*/*/pulls/?*
Disallow /*/*/pulls?*
Disallow /*/*/pulls/*/files
Disallow /*/tree/
Disallow /*/download
Disallow /*/revisions
Disallow /*/commits/*?author
Disallow /*/commits/*?path
Disallow /*/comments
Disallow /*/blame/
Disallow /*/raw/
Disallow /*/cache/
Disallow /.git/
Disallow */.git/
Disallow /*.git
Disallow /*.atom
Disallow /*.rss
Disallow /*/*/archive/
Disallow *.bundle
Disallow */commit/*.patch
Disallow */commit/*.diff
Disallow /*lang%3D*
Disallow /*source%3D*
Disallow /*ref_cta%3D*
Disallow /*plan%3D*
Disallow /*return_to%3D*
Disallow /*ref_loc%3D*
Disallow /*setup_organization%3D*
Disallow /*source_repo%3D*
Disallow /*ref_page%3D*
Disallow /*source%3D*
Disallow /*referrer%3D*
Disallow /*report%3D*
Disallow /*author%3D*
Disallow /*since%3D*
Disallow /*until%3D*
Disallow /*commits?author=*
Disallow /*tab%3D*
Disallow /*q%3D*
Disallow /*repo-search-archived%3D*

Other Records

Field Value
crawl-delay 2

adsbot
adsbot-google
adsbot-google-mobile

Rule Path
Disallow /
Allow /ads.txt
Allow /app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule Path
Disallow /

turnitinbot

Rule Path
Disallow /

npbot

Rule Path
Disallow /

slysearch

Rule Path
Disallow /

blexbot

Rule Path
Disallow /

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule Path
Disallow /

brandverity/1.0

Rule Path
Disallow /

piplbot

Rule Path
Disallow /

mj12bot

No rules defined. All paths allowed.

Other Records

Field Value
crawl-delay 10

chatgpt-user
gptbot

Rule Path
Disallow /

anthropic-ai
claude-web

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

facebookbot
meta-externalagent

Rule Path
Disallow /

cotoyogi

Rule Path
Disallow /

webzio-extended

Rule Path
Disallow /

img2dataset
omgili
omgilibot
timpibot
velenpublicwebcrawler
facebookexternalhit
icc-crawler
imagesiftbot
petalbot
scrapy
bytespider
amazonbot
diffbot
friendlycrawler
oai-searchbot
applebot-extended

Rule Path
Disallow /

Comments

  • git.girlcock.ceo stuff
  • from https://git.gay/gitgay/assets/src/branch/main/public/robots.txt
  • I opt out of online advertising so malware that injects ads on my site won't
  • get paid. You should do the same. my ads.txt file contains a standard
  • placeholder to forbid any compliant ad networks from paying for ad placement
  • on my domain.
  • Enabling our crawler to access your site offers several significant benefits
  • to you as a publisher. By allowing us access, you enable the maximum number
  • of advertisers to confidently purchase advertising space on your pages. Our
  • comprehensive data insights help advertisers understand the suitability and
  • context of your content, ensuring that their ads align with your audience's
  • interests and needs. This alignment leads to improved user experiences,
  • increased engagement, and ultimately, higher revenue potential for your
  • publication. (https://www.peer39.com/crawler-notice)
  • --> fuck off.
  • IP-violation scanners
  • The next three are borrowed from https://www.videolan.org/robots.txt
  • > This robot collects content from the Internet for the sole purpose of
  • helping educational institutions prevent plagiarism. [...] we compare student
  • papers against the content we find on the Internet to see if we # can find
  • similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
  • --> fuck off.
  • > NameProtect engages in crawling activity in search of a wide range of brand
  • and other intellectual property violations that may be of interest to our
  • clients. (http://www.nameprotect.com/botinfo.html)
  • --> fuck off.
  • iThenticate is a new service we have developed to combat the piracy of
  • intellectual property and ensure the originality of written work for
  • publishers, non-profit agencies, corporations, and newspapers.
  • (http://www.slysearch.com/)
  • --> fuck off.
  • BLEXBot assists internet marketers to get information on the link structure
  • of sites and their interlinking on the web, to avoid any technical and
  • possible legal issues and improve overall online experience.
  • (http://webmeup-crawler.com/)
  • --> fuck off.
  • Providing Intellectual Property professionals with superior brand protection
  • services by artfully merging the latest technology with expert analysis.
  • (https://www.checkmarknetwork.com/spider.html/)
  • "The Internet is just way to big to effectively police alone." (ACTUAL quote)
  • --> fuck off.
  • Stop trademark violations and affiliate non-compliance in paid search.
  • Automatically monitor your partner and affiliates’ online marketing to
  • protect yourself from harmful brand violations and regulatory risks. We
  • regularly crawl websites on behalf of our clients to ensure content
  • compliance with brand and regulatory guidelines.
  • (https://www.brandverity.com/why-is-brandverity-visiting-me)
  • --> fuck off.
  • Misc. icky stuff
  • Pipl assembles online identity information from multiple independent sources
  • to create the most complete picture of a digital identity and connect it to
  • real people and their offline identity records. When all the fragments of
  • online identity data are collected, connected, and corroborated, the result
  • is a more trustworthy identity.
  • --> fuck off.
  • Well-known overly-aggressive bot that claims to respect robots.txt: http://mj12bot.com/
  • Gen-AI data scrapers
  • Eat shit, OpenAI.
  • There isn't any public documentation for this AFAICT.
  • Reuters thinks this works so I might as well give it a shot.
  • Extremely aggressive crawling with no documentation. people had to email the
  • company about this for robots.txt guidance.
  • Official way to opt-out of Google's generative AI training:
  • <https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>
  • FacebookBot crawls public web pages to improve language models for our speech
  • recognition technology.
  • <https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
  • UPDATE: The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.
  • <https://developers.facebook.com/docs/sharing/webmasters/web-crawlers>
  • This one doesn't support robots.txt: https://www.allenai.org/crawler
  • block it with your reverse-proxy or WAF or something.
  • See <https://ds.rois.ac.jp/center8/crawler/>
  • Parent page says it builds LLMs in the infographic: <https://ds.rois.ac.jp/center8/>
  • https://webz.io/bot.html
  • Other AI/hostile shit