puppygock.gay
robots.txt

Robots Exclusion Standard data for puppygock.gay

Archived Snapshots

Resource Scan

Scan Details

Site Domain	puppygock.gay
Base Domain	puppygock.gay
Scan Status	Ok
Last Scan	2024-11-18T07:28:25+00:00
Next Scan	2024-11-25T07:28:25+00:00

Last Scan

Scanned	2024-11-18T07:28:25+00:00
URL	https://puppygock.gay/robots.txt
Domain IPs	185.14.97.167, 2a03:94e0:ffff:185:14:97:0:167
Response IP	185.14.97.167
Found	Yes
Hash	4b0ca2c7d76c90ef07cfd53ec2ae6b7c60297738b06653ec2008271829574cdd
SimHash	064cfb028c44

Groups

*

Rule	Path
Disallow	/noindex/
Disallow	/misc/
Disallow	/~strawberry/
Disallow	.git

Rule

Path

Disallow

/noindex/

Disallow

/misc/

Disallow

/~strawberry/

Disallow

.git

*

Rule	Path
Disallow	/api/*
Disallow	/avatars
Disallow	/user/*
Disallow	///src/commit/*
Disallow	///commit/*
Disallow	////refs/
Disallow	///*/star
Disallow	///*/watch
Disallow	///labels
Disallow	///activity/*
Disallow	/vendor/*
Disallow	/swagger.*.json
Disallow	/explore/?
Disallow	/repo/create
Disallow	/repo/migrate
Disallow	/org/create
Disallow	///fork
Disallow	///watchers
Disallow	///stargazers
Disallow	///forks
Disallow	///activity
Disallow	///projects
Disallow	///commits/
Disallow	///branches
Disallow	///tags
Disallow	///compare
Disallow	///lastcommit/*
Disallow	///issues/new
Disallow	///issues/?*
Disallow	///issues?*
Disallow	///pulls/?*
Disallow	///pulls?*
Disallow	///pulls/*/files
Disallow	/*/tree/
Disallow	/*/download
Disallow	/*/revisions
Disallow	//commits/?author
Disallow	//commits/?path
Disallow	/*/comments
Disallow	/*/blame/
Disallow	/*/raw/
Disallow	/*/cache/
Disallow	/.git/
Disallow	*/.git/
Disallow	/*.git
Disallow	/*.atom
Disallow	/*.rss
Disallow	///archive/
Disallow	*.bundle
Disallow	/commit/.patch
Disallow	/commit/.diff
Disallow	/lang%3D
Disallow	/source%3D
Disallow	/ref_cta%3D
Disallow	/plan%3D
Disallow	/return_to%3D
Disallow	/ref_loc%3D
Disallow	/setup_organization%3D
Disallow	/source_repo%3D
Disallow	/ref_page%3D
Disallow	/source%3D
Disallow	/referrer%3D
Disallow	/report%3D
Disallow	/author%3D
Disallow	/since%3D
Disallow	/until%3D
Disallow	/commits?author=
Disallow	/tab%3D
Disallow	/q%3D
Disallow	/repo-search-archived%3D

Rule

Path

Disallow

/api/*

Disallow

/avatars

Disallow

/user/*

Disallow

/*/*/src/commit/*

Disallow

/*/*/commit/*

Disallow

/*/*/*/refs/*

Disallow

/*/*/*/star

Disallow

/*/*/*/watch

Disallow

/*/*/labels

Disallow

/*/*/activity/*

Disallow

/vendor/*

Disallow

/swagger.*.json

Disallow

/explore/*?*

Disallow

/repo/create

Disallow

/repo/migrate

Disallow

/org/create

Disallow

/*/*/fork

Disallow

/*/*/watchers

Disallow

/*/*/stargazers

Disallow

/*/*/forks

Disallow

/*/*/activity

Disallow

/*/*/projects

Disallow

/*/*/commits/

Disallow

/*/*/branches

Disallow

/*/*/tags

Disallow

/*/*/compare

Disallow

/*/*/lastcommit/*

Disallow

/*/*/issues/new

Disallow

/*/*/issues/?*

Disallow

/*/*/issues?*

Disallow

/*/*/pulls/?*

Disallow

/*/*/pulls?*

Disallow

/*/*/pulls/*/files

Disallow

/*/tree/

Disallow

/*/download

Disallow

/*/revisions

Disallow

/*/commits/*?author

Disallow

/*/commits/*?path

Disallow

/*/comments

Disallow

/*/blame/

Disallow

/*/raw/

Disallow

/*/cache/

Disallow

/.git/

Disallow

*/.git/

Disallow

/*.git

Disallow

/*.atom

Disallow

/*.rss

Disallow

/*/*/archive/

Disallow

*.bundle

Disallow

*/commit/*.patch

Disallow

*/commit/*.diff

Disallow

/*lang%3D*

Disallow

/*source%3D*

Disallow

/*ref_cta%3D*

Disallow

/*plan%3D*

Disallow

/*return_to%3D*

Disallow

/*ref_loc%3D*

Disallow

/*setup_organization%3D*

Disallow

/*source_repo%3D*

Disallow

/*ref_page%3D*

Disallow

/*source%3D*

Disallow

/*referrer%3D*

Disallow

/*report%3D*

Disallow

/*author%3D*

Disallow

/*since%3D*

Disallow

/*until%3D*

Disallow

/*commits?author=*

Disallow

/*tab%3D*

Disallow

/*q%3D*

Disallow

/*repo-search-archived%3D*

Other Records

Field	Value
crawl-delay	2

Field

Value

crawl-delay

adsbot
adsbot-google
adsbot-google-mobile

Rule	Path
Disallow	/
Allow	/ads.txt
Allow	/app-ads.txt

Rule

Path

Disallow

Allow

/ads.txt

Allow

/app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

turnitinbot

Rule	Path
Disallow	/

Rule

Path

Disallow

npbot

Rule	Path
Disallow	/

Rule

Path

Disallow

slysearch

Rule	Path
Disallow	/

Rule

Path

Disallow

blexbot

Rule	Path
Disallow	/

Rule

Path

Disallow

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule	Path
Disallow	/

Rule

Path

Disallow

brandverity/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

piplbot

Rule	Path
Disallow	/

Rule

Path

Disallow

chatgpt-user
gptbot
ccbot
ccbot/2.0
ccbot/3.1

Rule	Path
Disallow	/

Rule

Path

Disallow

anthropic-ai
claude-web

Rule	Path
Disallow	/

Rule

Path

Disallow

claudebot

Rule	Path
Disallow	/

Rule

Path

Disallow

facebookbot

Rule	Path
Disallow	/

Rule

Path

Disallow

google-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

img2dataset
omgili
omgilibot
timpibot
velenpublicwebcrawler
cohere-ai
facebookexternalhit
icc-crawler
imagesiftbot
meta-externalagent
perplexitybot
petalbot
scrapy
bytespider
amazonbot
diffbot
friendlycrawler
oai-searchbot
applebot-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

Comments

git.girlcock.ceo stuff
from https://git.gay/gitgay/assets/src/branch/main/public/robots.txt
I opt out of online advertising so malware that injects ads on my site won't
get paid. You should do the same. my ads.txt file contains a standard
placeholder to forbid any compliant ad networks from paying for ad placement
on my domain.
Enabling our crawler to access your site offers several significant benefits
to you as a publisher. By allowing us access, you enable the maximum number
of advertisers to confidently purchase advertising space on your pages. Our
comprehensive data insights help advertisers understand the suitability and
context of your content, ensuring that their ads align with your audience's
interests and needs. This alignment leads to improved user experiences,
increased engagement, and ultimately, higher revenue potential for your
publication. (https://www.peer39.com/crawler-notice)
--> fuck off.
IP-violation scanners
The next three are borrowed from https://www.videolan.org/robots.txt
> This robot collects content from the Internet for the sole purpose of
helping educational institutions prevent plagiarism. [...] we compare student
papers against the content we find on the Internet to see if we # can find
similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
--> fuck off.
> NameProtect engages in crawling activity in search of a wide range of brand
and other intellectual property violations that may be of interest to our
clients. (http://www.nameprotect.com/botinfo.html)
--> fuck off.
iThenticate is a new service we have developed to combat the piracy of
intellectual property and ensure the originality of written work for
publishers, non-profit agencies, corporations, and newspapers.
(http://www.slysearch.com/)
--> fuck off.
BLEXBot assists internet marketers to get information on the link structure
of sites and their interlinking on the web, to avoid any technical and
possible legal issues and improve overall online experience.
(http://webmeup-crawler.com/)
--> fuck off.
Providing Intellectual Property professionals with superior brand protection
services by artfully merging the latest technology with expert analysis.
(https://www.checkmarknetwork.com/spider.html/)
"The Internet is just way to big to effectively police alone." (ACTUAL quote)
--> fuck off.
Stop trademark violations and affiliate non-compliance in paid search.
Automatically monitor your partner and affiliatesâ online marketing to
protect yourself from harmful brand violations and regulatory risks. We
regularly crawl websites on behalf of our clients to ensure content
compliance with brand and regulatory guidelines.
(https://www.brandverity.com/why-is-brandverity-visiting-me)
--> fuck off.
Misc. icky stuff
Pipl assembles online identity information from multiple independent sources
to create the most complete picture of a digital identity and connect it to
real people and their offline identity records. When all the fragments of
online identity data are collected, connected, and corroborated, the result
is a more trustworthy identity.
--> fuck off.
Gen-AI data scrapers
Eat shit, OpenAI.
There isn't any public documentation for this AFAICT.
Reuters thinks this works so I might as well give it a shot.
Extremely aggressive crawling with no documentation. people had to email the
company about this for robots.txt guidance.
FacebookBot crawls public web pages to improve language models for our speech
recognition technology.
<https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
Official way to opt-out of Google's generative AI training:
<https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>
Other AI/hostile shit

puppygock.gayrobots.txt

Resource Scan

Scan Details

Last Scan

Groups

*

*

Other Records

adsbotadsbot-googleadsbot-google-mobile

peer39_crawlerpeer39_crawler/1.0

turnitinbot

npbot

slysearch

blexbot

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

brandverity/1.0

piplbot

chatgpt-usergptbotccbotccbot/2.0ccbot/3.1

anthropic-aiclaude-web

claudebot

facebookbot

google-extended

img2datasetomgiliomgilibottimpibotvelenpublicwebcrawlercohere-aifacebookexternalhiticc-crawlerimagesiftbotmeta-externalagentperplexitybotpetalbotscrapybytespideramazonbotdiffbotfriendlycrawleroai-searchbotapplebot-extended

Comments

puppygock.gay
robots.txt

adsbot
adsbot-google
adsbot-google-mobile

peer39_crawler
peer39_crawler/1.0

chatgpt-user
gptbot
ccbot
ccbot/2.0
ccbot/3.1

anthropic-ai
claude-web

img2dataset
omgili
omgilibot
timpibot
velenpublicwebcrawler
cohere-ai
facebookexternalhit
icc-crawler
imagesiftbot
meta-externalagent
perplexitybot
petalbot
scrapy
bytespider
amazonbot
diffbot
friendlycrawler
oai-searchbot
applebot-extended