sapphic.site
robots.txt

Robots Exclusion Standard data for sapphic.site

Archived Snapshots

Resource Scan

Scan Details

Site Domain	sapphic.site
Base Domain	sapphic.site
Scan Status	Ok
Last Scan	2025-08-18T20:16:55+00:00
Next Scan	2025-08-25T20:16:55+00:00

Last Scan

Scanned	2025-08-18T20:16:55+00:00
URL	https://sapphic.site/robots.txt
Domain IPs	185.14.97.167, 2a03:94e0:ffff:185:14:97:0:167
Response IP	185.14.97.167
Found	Yes
Hash	663c38276ef2a8913baa54a8f3fc2cfcdf1ca5c9af7ddaa6e003e4cb894f5a67
SimHash	a64cb91a8dc4

Groups

*

Rule	Path
Disallow	/noindex/
Disallow	/misc/
Disallow	/~strawberry/
Disallow	.git

Rule

Path

Disallow

/noindex/

Disallow

/misc/

Disallow

/~strawberry/

Disallow

.git

*

Rule	Path
Disallow	/api/*
Disallow	/avatars
Disallow	/user/*
Disallow	///src/commit/*
Disallow	///commit/*
Disallow	////refs/
Disallow	///*/star
Disallow	///*/watch
Disallow	///labels
Disallow	///activity/*
Disallow	/vendor/*
Disallow	/swagger.*.json
Disallow	/explore/?
Disallow	/repo/create
Disallow	/repo/migrate
Disallow	/org/create
Disallow	///fork
Disallow	///watchers
Disallow	///stargazers
Disallow	///forks
Disallow	///activity
Disallow	///projects
Disallow	///commits/
Disallow	///branches
Disallow	///tags
Disallow	///compare
Disallow	///lastcommit/*
Disallow	///issues/new
Disallow	///issues/?*
Disallow	///issues?*
Disallow	///pulls/?*
Disallow	///pulls?*
Disallow	///pulls/*/files
Disallow	/*/tree/
Disallow	/*/download
Disallow	/*/revisions
Disallow	//commits/?author
Disallow	//commits/?path
Disallow	/*/comments
Disallow	/*/blame/
Disallow	/*/raw/
Disallow	/*/cache/
Disallow	/.git/
Disallow	*/.git/
Disallow	/*.git
Disallow	/*.atom
Disallow	/*.rss
Disallow	///archive/
Disallow	*.bundle
Disallow	/commit/.patch
Disallow	/commit/.diff
Disallow	/lang%3D
Disallow	/source%3D
Disallow	/ref_cta%3D
Disallow	/plan%3D
Disallow	/return_to%3D
Disallow	/ref_loc%3D
Disallow	/setup_organization%3D
Disallow	/source_repo%3D
Disallow	/ref_page%3D
Disallow	/source%3D
Disallow	/referrer%3D
Disallow	/report%3D
Disallow	/author%3D
Disallow	/since%3D
Disallow	/until%3D
Disallow	/commits?author=
Disallow	/tab%3D
Disallow	/q%3D
Disallow	/repo-search-archived%3D

Rule

Path

Disallow

/api/*

Disallow

/avatars

Disallow

/user/*

Disallow

/*/*/src/commit/*

Disallow

/*/*/commit/*

Disallow

/*/*/*/refs/*

Disallow

/*/*/*/star

Disallow

/*/*/*/watch

Disallow

/*/*/labels

Disallow

/*/*/activity/*

Disallow

/vendor/*

Disallow

/swagger.*.json

Disallow

/explore/*?*

Disallow

/repo/create

Disallow

/repo/migrate

Disallow

/org/create

Disallow

/*/*/fork

Disallow

/*/*/watchers

Disallow

/*/*/stargazers

Disallow

/*/*/forks

Disallow

/*/*/activity

Disallow

/*/*/projects

Disallow

/*/*/commits/

Disallow

/*/*/branches

Disallow

/*/*/tags

Disallow

/*/*/compare

Disallow

/*/*/lastcommit/*

Disallow

/*/*/issues/new

Disallow

/*/*/issues/?*

Disallow

/*/*/issues?*

Disallow

/*/*/pulls/?*

Disallow

/*/*/pulls?*

Disallow

/*/*/pulls/*/files

Disallow

/*/tree/

Disallow

/*/download

Disallow

/*/revisions

Disallow

/*/commits/*?author

Disallow

/*/commits/*?path

Disallow

/*/comments

Disallow

/*/blame/

Disallow

/*/raw/

Disallow

/*/cache/

Disallow

/.git/

Disallow

*/.git/

Disallow

/*.git

Disallow

/*.atom

Disallow

/*.rss

Disallow

/*/*/archive/

Disallow

*.bundle

Disallow

*/commit/*.patch

Disallow

*/commit/*.diff

Disallow

/*lang%3D*

Disallow

/*source%3D*

Disallow

/*ref_cta%3D*

Disallow

/*plan%3D*

Disallow

/*return_to%3D*

Disallow

/*ref_loc%3D*

Disallow

/*setup_organization%3D*

Disallow

/*source_repo%3D*

Disallow

/*ref_page%3D*

Disallow

/*source%3D*

Disallow

/*referrer%3D*

Disallow

/*report%3D*

Disallow

/*author%3D*

Disallow

/*since%3D*

Disallow

/*until%3D*

Disallow

/*commits?author=*

Disallow

/*tab%3D*

Disallow

/*q%3D*

Disallow

/*repo-search-archived%3D*

Other Records

Field	Value
crawl-delay	2

Field

Value

crawl-delay

adsbot
adsbot-google
adsbot-google-mobile

Rule	Path
Disallow	/
Allow	/ads.txt
Allow	/app-ads.txt

Rule

Path

Disallow

Allow

/ads.txt

Allow

/app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

turnitinbot

Rule	Path
Disallow	/

Rule

Path

Disallow

npbot

Rule	Path
Disallow	/

Rule

Path

Disallow

slysearch

Rule	Path
Disallow	/

Rule

Path

Disallow

blexbot

Rule	Path
Disallow	/

Rule

Path

Disallow

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule	Path
Disallow	/

Rule

Path

Disallow

brandverity/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

piplbot

Rule	Path
Disallow	/

Rule

Path

Disallow

mj12bot

No rules defined. All paths allowed.

Other Records

Field	Value
crawl-delay	10

Field

Value

crawl-delay

chatgpt-user
gptbot

Rule	Path
Disallow	/

Rule

Path

Disallow

anthropic-ai
claude-web

Rule	Path
Disallow	/

Rule

Path

Disallow

claudebot

Rule	Path
Disallow	/

Rule

Path

Disallow

google-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

facebookbot
meta-externalagent

Rule	Path
Disallow	/

Rule

Path

Disallow

cotoyogi

Rule	Path
Disallow	/

Rule

Path

Disallow

webzio-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

img2dataset
omgili
omgilibot
timpibot
velenpublicwebcrawler
facebookexternalhit
icc-crawler
imagesiftbot
petalbot
scrapy
bytespider
amazonbot
diffbot
friendlycrawler
oai-searchbot
applebot-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

Comments

git.girlcock.ceo stuff
from https://git.gay/gitgay/assets/src/branch/main/public/robots.txt
I opt out of online advertising so malware that injects ads on my site won't
get paid. You should do the same. my ads.txt file contains a standard
placeholder to forbid any compliant ad networks from paying for ad placement
on my domain.
Enabling our crawler to access your site offers several significant benefits
to you as a publisher. By allowing us access, you enable the maximum number
of advertisers to confidently purchase advertising space on your pages. Our
comprehensive data insights help advertisers understand the suitability and
context of your content, ensuring that their ads align with your audience's
interests and needs. This alignment leads to improved user experiences,
increased engagement, and ultimately, higher revenue potential for your
publication. (https://www.peer39.com/crawler-notice)
--> fuck off.
IP-violation scanners
The next three are borrowed from https://www.videolan.org/robots.txt
> This robot collects content from the Internet for the sole purpose of
helping educational institutions prevent plagiarism. [...] we compare student
papers against the content we find on the Internet to see if we # can find
similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
--> fuck off.
> NameProtect engages in crawling activity in search of a wide range of brand
and other intellectual property violations that may be of interest to our
clients. (http://www.nameprotect.com/botinfo.html)
--> fuck off.
iThenticate is a new service we have developed to combat the piracy of
intellectual property and ensure the originality of written work for
publishers, non-profit agencies, corporations, and newspapers.
(http://www.slysearch.com/)
--> fuck off.
BLEXBot assists internet marketers to get information on the link structure
of sites and their interlinking on the web, to avoid any technical and
possible legal issues and improve overall online experience.
(http://webmeup-crawler.com/)
--> fuck off.
Providing Intellectual Property professionals with superior brand protection
services by artfully merging the latest technology with expert analysis.
(https://www.checkmarknetwork.com/spider.html/)
"The Internet is just way to big to effectively police alone." (ACTUAL quote)
--> fuck off.
Stop trademark violations and affiliate non-compliance in paid search.
Automatically monitor your partner and affiliatesâ online marketing to
protect yourself from harmful brand violations and regulatory risks. We
regularly crawl websites on behalf of our clients to ensure content
compliance with brand and regulatory guidelines.
(https://www.brandverity.com/why-is-brandverity-visiting-me)
--> fuck off.
Misc. icky stuff
Pipl assembles online identity information from multiple independent sources
to create the most complete picture of a digital identity and connect it to
real people and their offline identity records. When all the fragments of
online identity data are collected, connected, and corroborated, the result
is a more trustworthy identity.
--> fuck off.
Well-known overly-aggressive bot that claims to respect robots.txt: http://mj12bot.com/
Gen-AI data scrapers
Eat shit, OpenAI.
There isn't any public documentation for this AFAICT.
Reuters thinks this works so I might as well give it a shot.
Extremely aggressive crawling with no documentation. people had to email the
company about this for robots.txt guidance.
Official way to opt-out of Google's generative AI training:
<https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>
FacebookBot crawls public web pages to improve language models for our speech
recognition technology.
<https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
UPDATE: The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.
<https://developers.facebook.com/docs/sharing/webmasters/web-crawlers>
This one doesn't support robots.txt: https://www.allenai.org/crawler
block it with your reverse-proxy or WAF or something.
See <https://ds.rois.ac.jp/center8/crawler/>
Parent page says it builds LLMs in the infographic: <https://ds.rois.ac.jp/center8/>
https://webz.io/bot.html
Other AI/hostile shit

sapphic.siterobots.txt

Resource Scan

Scan Details

Last Scan

Groups

*

*

Other Records

adsbotadsbot-googleadsbot-google-mobile

peer39_crawlerpeer39_crawler/1.0

turnitinbot

npbot

slysearch

blexbot

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

brandverity/1.0

piplbot

mj12bot

Other Records

chatgpt-usergptbot

anthropic-aiclaude-web

claudebot

google-extended

facebookbotmeta-externalagent

cotoyogi

webzio-extended

img2datasetomgiliomgilibottimpibotvelenpublicwebcrawlerfacebookexternalhiticc-crawlerimagesiftbotpetalbotscrapybytespideramazonbotdiffbotfriendlycrawleroai-searchbotapplebot-extended

Comments

sapphic.site
robots.txt

adsbot
adsbot-google
adsbot-google-mobile

peer39_crawler
peer39_crawler/1.0

chatgpt-user
gptbot

anthropic-ai
claude-web

facebookbot
meta-externalagent

img2dataset
omgili
omgilibot
timpibot
velenpublicwebcrawler
facebookexternalhit
icc-crawler
imagesiftbot
petalbot
scrapy
bytespider
amazonbot
diffbot
friendlycrawler
oai-searchbot
applebot-extended