puppygock.gay
robots.txt

Robots Exclusion Standard data for puppygock.gay

Resource Scan

Scan Details

Site Domain puppygock.gay
Base Domain puppygock.gay
Scan Status Ok
Last Scan2024-11-18T07:28:25+00:00
Next Scan 2024-11-25T07:28:25+00:00

Last Scan

Scanned2024-11-18T07:28:25+00:00
URL https://puppygock.gay/robots.txt
Domain IPs 185.14.97.167, 2a03:94e0:ffff:185:14:97:0:167
Response IP 185.14.97.167
Found Yes
Hash 4b0ca2c7d76c90ef07cfd53ec2ae6b7c60297738b06653ec2008271829574cdd
SimHash 064cfb028c44

Groups

*

Rule Path
Disallow /noindex/
Disallow /misc/
Disallow /~strawberry/
Disallow .git

*

Rule Path
Disallow /api/*
Disallow /avatars
Disallow /user/*
Disallow /*/*/src/commit/*
Disallow /*/*/commit/*
Disallow /*/*/*/refs/*
Disallow /*/*/*/star
Disallow /*/*/*/watch
Disallow /*/*/labels
Disallow /*/*/activity/*
Disallow /vendor/*
Disallow /swagger.*.json
Disallow /explore/*?*
Disallow /repo/create
Disallow /repo/migrate
Disallow /org/create
Disallow /*/*/fork
Disallow /*/*/watchers
Disallow /*/*/stargazers
Disallow /*/*/forks
Disallow /*/*/activity
Disallow /*/*/projects
Disallow /*/*/commits/
Disallow /*/*/branches
Disallow /*/*/tags
Disallow /*/*/compare
Disallow /*/*/lastcommit/*
Disallow /*/*/issues/new
Disallow /*/*/issues/?*
Disallow /*/*/issues?*
Disallow /*/*/pulls/?*
Disallow /*/*/pulls?*
Disallow /*/*/pulls/*/files
Disallow /*/tree/
Disallow /*/download
Disallow /*/revisions
Disallow /*/commits/*?author
Disallow /*/commits/*?path
Disallow /*/comments
Disallow /*/blame/
Disallow /*/raw/
Disallow /*/cache/
Disallow /.git/
Disallow */.git/
Disallow /*.git
Disallow /*.atom
Disallow /*.rss
Disallow /*/*/archive/
Disallow *.bundle
Disallow */commit/*.patch
Disallow */commit/*.diff
Disallow /*lang%3D*
Disallow /*source%3D*
Disallow /*ref_cta%3D*
Disallow /*plan%3D*
Disallow /*return_to%3D*
Disallow /*ref_loc%3D*
Disallow /*setup_organization%3D*
Disallow /*source_repo%3D*
Disallow /*ref_page%3D*
Disallow /*source%3D*
Disallow /*referrer%3D*
Disallow /*report%3D*
Disallow /*author%3D*
Disallow /*since%3D*
Disallow /*until%3D*
Disallow /*commits?author=*
Disallow /*tab%3D*
Disallow /*q%3D*
Disallow /*repo-search-archived%3D*

Other Records

Field Value
crawl-delay 2

adsbot
adsbot-google
adsbot-google-mobile

Rule Path
Disallow /
Allow /ads.txt
Allow /app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule Path
Disallow /

turnitinbot

Rule Path
Disallow /

npbot

Rule Path
Disallow /

slysearch

Rule Path
Disallow /

blexbot

Rule Path
Disallow /

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule Path
Disallow /

brandverity/1.0

Rule Path
Disallow /

piplbot

Rule Path
Disallow /

chatgpt-user
gptbot
ccbot
ccbot/2.0
ccbot/3.1

Rule Path
Disallow /

anthropic-ai
claude-web

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

img2dataset
omgili
omgilibot
timpibot
velenpublicwebcrawler
cohere-ai
facebookexternalhit
icc-crawler
imagesiftbot
meta-externalagent
perplexitybot
petalbot
scrapy
bytespider
amazonbot
diffbot
friendlycrawler
oai-searchbot
applebot-extended

Rule Path
Disallow /

Comments

  • git.girlcock.ceo stuff
  • from https://git.gay/gitgay/assets/src/branch/main/public/robots.txt
  • I opt out of online advertising so malware that injects ads on my site won't
  • get paid. You should do the same. my ads.txt file contains a standard
  • placeholder to forbid any compliant ad networks from paying for ad placement
  • on my domain.
  • Enabling our crawler to access your site offers several significant benefits
  • to you as a publisher. By allowing us access, you enable the maximum number
  • of advertisers to confidently purchase advertising space on your pages. Our
  • comprehensive data insights help advertisers understand the suitability and
  • context of your content, ensuring that their ads align with your audience's
  • interests and needs. This alignment leads to improved user experiences,
  • increased engagement, and ultimately, higher revenue potential for your
  • publication. (https://www.peer39.com/crawler-notice)
  • --> fuck off.
  • IP-violation scanners
  • The next three are borrowed from https://www.videolan.org/robots.txt
  • > This robot collects content from the Internet for the sole purpose of
  • helping educational institutions prevent plagiarism. [...] we compare student
  • papers against the content we find on the Internet to see if we # can find
  • similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
  • --> fuck off.
  • > NameProtect engages in crawling activity in search of a wide range of brand
  • and other intellectual property violations that may be of interest to our
  • clients. (http://www.nameprotect.com/botinfo.html)
  • --> fuck off.
  • iThenticate is a new service we have developed to combat the piracy of
  • intellectual property and ensure the originality of written work for
  • publishers, non-profit agencies, corporations, and newspapers.
  • (http://www.slysearch.com/)
  • --> fuck off.
  • BLEXBot assists internet marketers to get information on the link structure
  • of sites and their interlinking on the web, to avoid any technical and
  • possible legal issues and improve overall online experience.
  • (http://webmeup-crawler.com/)
  • --> fuck off.
  • Providing Intellectual Property professionals with superior brand protection
  • services by artfully merging the latest technology with expert analysis.
  • (https://www.checkmarknetwork.com/spider.html/)
  • "The Internet is just way to big to effectively police alone." (ACTUAL quote)
  • --> fuck off.
  • Stop trademark violations and affiliate non-compliance in paid search.
  • Automatically monitor your partner and affiliates’ online marketing to
  • protect yourself from harmful brand violations and regulatory risks. We
  • regularly crawl websites on behalf of our clients to ensure content
  • compliance with brand and regulatory guidelines.
  • (https://www.brandverity.com/why-is-brandverity-visiting-me)
  • --> fuck off.
  • Misc. icky stuff
  • Pipl assembles online identity information from multiple independent sources
  • to create the most complete picture of a digital identity and connect it to
  • real people and their offline identity records. When all the fragments of
  • online identity data are collected, connected, and corroborated, the result
  • is a more trustworthy identity.
  • --> fuck off.
  • Gen-AI data scrapers
  • Eat shit, OpenAI.
  • There isn't any public documentation for this AFAICT.
  • Reuters thinks this works so I might as well give it a shot.
  • Extremely aggressive crawling with no documentation. people had to email the
  • company about this for robots.txt guidance.
  • FacebookBot crawls public web pages to improve language models for our speech
  • recognition technology.
  • <https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
  • Official way to opt-out of Google's generative AI training:
  • <https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>
  • Other AI/hostile shit