girlboss.ceo
robots.txt

Robots Exclusion Standard data for girlboss.ceo

Archived Snapshots

Resource Scan

Scan Details

Site Domain	girlboss.ceo
Base Domain	girlboss.ceo
Scan Status	Ok
Last Scan	2024-06-23T16:41:01+00:00
Next Scan	2024-06-30T16:41:01+00:00

Last Scan

Scanned	2024-06-23T16:41:01+00:00
URL	https://girlboss.ceo/robots.txt
Domain IPs	170.205.37.36, 2605:4840:2:281f::b00b:babe
Response IP	170.205.37.36
Found	Yes
Hash	14f53ba12876eb32c357ab7219f0babb2171e10ae00ca82ff1566922dd6fc686
SimHash	a6ccfb0a8864

Groups

*

Rule	Path
Disallow	/noindex/
Disallow	/misc/
Disallow	/~strawberry/
Disallow	.git

Rule

Path

Disallow

/noindex/

Disallow

/misc/

Disallow

/~strawberry/

Disallow

.git

adsbot
adsbot-google
adsbot-google-mobile

Rule	Path
Disallow	/
Allow	/ads.txt
Allow	/app-ads.txt

Rule

Path

Disallow

Allow

/ads.txt

Allow

/app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

turnitinbot

Rule	Path
Disallow	/

Rule

Path

Disallow

npbot

Rule	Path
Disallow	/

Rule

Path

Disallow

slysearch

Rule	Path
Disallow	/

Rule

Path

Disallow

blexbot

Rule	Path
Disallow	/

Rule

Path

Disallow

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule	Path
Disallow	/

Rule

Path

Disallow

brandverity/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

piplbot

Rule	Path
Disallow	/

Rule

Path

Disallow

chatgpt-user
gptbot
ccbot
ccbot/2.0
ccbot/3.1

Rule	Path
Disallow	/

Rule

Path

Disallow

anthropic-ai
claude-web

Rule	Path
Disallow	/

Rule

Path

Disallow

claudebot

Rule	Path
Disallow	/

Rule

Path

Disallow

facebookbot

Rule	Path
Disallow	/

Rule

Path

Disallow

google-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

Comments

I opt out of online advertising so malware that injects ads on my site won't
get paid. You should do the same. my ads.txt file contains a standard
placeholder to forbid any compliant ad networks from paying for ad placement
on my domain.
Enabling our crawler to access your site offers several significant benefits
to you as a publisher. By allowing us access, you enable the maximum number
of advertisers to confidently purchase advertising space on your pages. Our
comprehensive data insights help advertisers understand the suitability and
context of your content, ensuring that their ads align with your audience's
interests and needs. This alignment leads to improved user experiences,
increased engagement, and ultimately, higher revenue potential for your
publication. (https://www.peer39.com/crawler-notice)
--> fuck off.
IP-violation scanners
The next three are borrowed from https://www.videolan.org/robots.txt
> This robot collects content from the Internet for the sole purpose of
helping educational institutions prevent plagiarism. [...] we compare student
papers against the content we find on the Internet to see if we # can find
similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
--> fuck off.
> NameProtect engages in crawling activity in search of a wide range of brand
and other intellectual property violations that may be of interest to our
clients. (http://www.nameprotect.com/botinfo.html)
--> fuck off.
iThenticate is a new service we have developed to combat the piracy of
intellectual property and ensure the originality of written work for
publishers, non-profit agencies, corporations, and newspapers.
(http://www.slysearch.com/)
--> fuck off.
BLEXBot assists internet marketers to get information on the link structure
of sites and their interlinking on the web, to avoid any technical and
possible legal issues and improve overall online experience.
(http://webmeup-crawler.com/)
--> fuck off.
Providing Intellectual Property professionals with superior brand protection
services by artfully merging the latest technology with expert analysis.
(https://www.checkmarknetwork.com/spider.html/)
"The Internet is just way to big to effectively police alone." (ACTUAL quote)
--> fuck off.
Stop trademark violations and affiliate non-compliance in paid search.
Automatically monitor your partner and affiliatesâ online marketing to
protect yourself from harmful brand violations and regulatory risks. We
regularly crawl websites on behalf of our clients to ensure content
compliance with brand and regulatory guidelines.
(https://www.brandverity.com/why-is-brandverity-visiting-me)
--> fuck off.
Misc. icky stuff
Pipl assembles online identity information from multiple independent sources
to create the most complete picture of a digital identity and connect it to
real people and their offline identity records. When all the fragments of
online identity data are collected, connected, and corroborated, the result
is a more trustworthy identity.
--> fuck off.
Gen-AI data scrapers
Eat shit, OpenAI.
There isn't any public documentation for this AFAICT.
Reuters thinks this works so I might as well give it a shot.
Extremely aggressive crawling with no documentation. people had to email the
company about this for robots.txt guidance.
FacebookBot crawls public web pages to improve language models for our speech
recognition technology.
<https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
Official way to opt-out of Google's generative AI training:
<https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>

girlboss.ceorobots.txt

Resource Scan

Scan Details

Last Scan

Groups

*

adsbotadsbot-googleadsbot-google-mobile

peer39_crawlerpeer39_crawler/1.0

turnitinbot

npbot

slysearch

blexbot

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

brandverity/1.0

piplbot

chatgpt-usergptbotccbotccbot/2.0ccbot/3.1

anthropic-aiclaude-web

claudebot

facebookbot

google-extended

Comments

girlboss.ceo
robots.txt

adsbot
adsbot-google
adsbot-google-mobile

peer39_crawler
peer39_crawler/1.0

chatgpt-user
gptbot
ccbot
ccbot/2.0
ccbot/3.1

anthropic-ai
claude-web