girlboss.ceo
robots.txt

Robots Exclusion Standard data for girlboss.ceo

Resource Scan

Scan Details

Site Domain girlboss.ceo
Base Domain girlboss.ceo
Scan Status Ok
Last Scan2024-06-23T16:41:01+00:00
Next Scan 2024-06-30T16:41:01+00:00

Last Scan

Scanned2024-06-23T16:41:01+00:00
URL https://girlboss.ceo/robots.txt
Domain IPs 170.205.37.36, 2605:4840:2:281f::b00b:babe
Response IP 170.205.37.36
Found Yes
Hash 14f53ba12876eb32c357ab7219f0babb2171e10ae00ca82ff1566922dd6fc686
SimHash a6ccfb0a8864

Groups

*

Rule Path
Disallow /noindex/
Disallow /misc/
Disallow /~strawberry/
Disallow .git

adsbot
adsbot-google
adsbot-google-mobile

Rule Path
Disallow /
Allow /ads.txt
Allow /app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule Path
Disallow /

turnitinbot

Rule Path
Disallow /

npbot

Rule Path
Disallow /

slysearch

Rule Path
Disallow /

blexbot

Rule Path
Disallow /

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule Path
Disallow /

brandverity/1.0

Rule Path
Disallow /

piplbot

Rule Path
Disallow /

chatgpt-user
gptbot
ccbot
ccbot/2.0
ccbot/3.1

Rule Path
Disallow /

anthropic-ai
claude-web

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

Comments

  • I opt out of online advertising so malware that injects ads on my site won't
  • get paid. You should do the same. my ads.txt file contains a standard
  • placeholder to forbid any compliant ad networks from paying for ad placement
  • on my domain.
  • Enabling our crawler to access your site offers several significant benefits
  • to you as a publisher. By allowing us access, you enable the maximum number
  • of advertisers to confidently purchase advertising space on your pages. Our
  • comprehensive data insights help advertisers understand the suitability and
  • context of your content, ensuring that their ads align with your audience's
  • interests and needs. This alignment leads to improved user experiences,
  • increased engagement, and ultimately, higher revenue potential for your
  • publication. (https://www.peer39.com/crawler-notice)
  • --> fuck off.
  • IP-violation scanners
  • The next three are borrowed from https://www.videolan.org/robots.txt
  • > This robot collects content from the Internet for the sole purpose of
  • helping educational institutions prevent plagiarism. [...] we compare student
  • papers against the content we find on the Internet to see if we # can find
  • similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
  • --> fuck off.
  • > NameProtect engages in crawling activity in search of a wide range of brand
  • and other intellectual property violations that may be of interest to our
  • clients. (http://www.nameprotect.com/botinfo.html)
  • --> fuck off.
  • iThenticate is a new service we have developed to combat the piracy of
  • intellectual property and ensure the originality of written work for
  • publishers, non-profit agencies, corporations, and newspapers.
  • (http://www.slysearch.com/)
  • --> fuck off.
  • BLEXBot assists internet marketers to get information on the link structure
  • of sites and their interlinking on the web, to avoid any technical and
  • possible legal issues and improve overall online experience.
  • (http://webmeup-crawler.com/)
  • --> fuck off.
  • Providing Intellectual Property professionals with superior brand protection
  • services by artfully merging the latest technology with expert analysis.
  • (https://www.checkmarknetwork.com/spider.html/)
  • "The Internet is just way to big to effectively police alone." (ACTUAL quote)
  • --> fuck off.
  • Stop trademark violations and affiliate non-compliance in paid search.
  • Automatically monitor your partner and affiliates’ online marketing to
  • protect yourself from harmful brand violations and regulatory risks. We
  • regularly crawl websites on behalf of our clients to ensure content
  • compliance with brand and regulatory guidelines.
  • (https://www.brandverity.com/why-is-brandverity-visiting-me)
  • --> fuck off.
  • Misc. icky stuff
  • Pipl assembles online identity information from multiple independent sources
  • to create the most complete picture of a digital identity and connect it to
  • real people and their offline identity records. When all the fragments of
  • online identity data are collected, connected, and corroborated, the result
  • is a more trustworthy identity.
  • --> fuck off.
  • Gen-AI data scrapers
  • Eat shit, OpenAI.
  • There isn't any public documentation for this AFAICT.
  • Reuters thinks this works so I might as well give it a shot.
  • Extremely aggressive crawling with no documentation. people had to email the
  • company about this for robots.txt guidance.
  • FacebookBot crawls public web pages to improve language models for our speech
  • recognition technology.
  • <https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
  • Official way to opt-out of Google's generative AI training:
  • <https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>