criptonautas.co
robots.txt

Robots Exclusion Standard data for criptonautas.co

Archived Snapshots

Resource Scan

Scan Details

Site Domain	criptonautas.co
Base Domain	criptonautas.co
Scan Status	Failed
Failure Stage	Fetching resource.
Failure Reason	Server returned a server error.
Last Scan	2025-11-16T23:45:44+00:00
Next Scan	2025-12-16T23:45:44+00:00

Last Successful Scan

Scanned	2025-09-25T01:34:57+00:00
URL	https://criptonautas.co/robots.txt
Domain IPs	104.21.73.70, 172.67.158.180, 2606:4700:3031::6815:4946, 2606:4700:3031::ac43:9eb4
Response IP	172.67.158.180
Found	Yes
Hash	0aad1bd6f2a2ed6d883564a66f32944a4e36b750b609502b767edbb4a1ff19c9
SimHash	be9cf1008c64

Groups

*

Rule	Path
Disallow	/noindex/
Disallow	/misc/

Rule

Path

Disallow

/noindex/

Disallow

/misc/

adsbot

Rule	Path
Disallow	/
Allow	/ads.txt
Allow	/app-ads.txt

Rule

Path

Disallow

Allow

/ads.txt

Allow

/app-ads.txt

peer39_crawler
peer39_crawler/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

turnitinbot

Rule	Path
Disallow	/

Rule

Path

Disallow

npbot

Rule	Path
Disallow	/

Rule

Path

Disallow

slysearch

Rule	Path
Disallow	/

Rule

Path

Disallow

blexbot

Rule	Path
Disallow	/

Rule

Path

Disallow

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Rule	Path
Disallow	/

Rule

Path

Disallow

brandverity/1.0

Rule	Path
Disallow	/

Rule

Path

Disallow

piplbot

Rule	Path
Disallow	/

Rule

Path

Disallow

mj12bot

No rules defined. All paths allowed.

Other Records

Field	Value
crawl-delay	10

Field

Value

crawl-delay

chatgpt-user
gptbot

Rule	Path
Disallow	/

Rule

Path

Disallow

google-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

applebot-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

claudebot

Rule	Path
Disallow	/

Rule

Path

Disallow

facebookbot
meta-externalagent

Rule	Path
Disallow	/

Rule

Path

Disallow

cotoyogi

Rule	Path
Disallow	/

Rule

Path

Disallow

webzio-extended

Rule	Path
Disallow	/

Rule

Path

Disallow

kangaroo bot

Rule	Path
Disallow	/
Disallow	/ghost/
Disallow	/email/
Disallow	/members/api/comments/counts/
Allow	/webmentions/receive/

Rule

Path

Disallow

/ghost/

Disallow

/email/

Disallow

/members/api/comments/counts/

Allow

/webmentions/receive/

Other Records

Field	Value
sitemap	https://criptonautas.co/sitemap.xml

Field

Value

sitemap

https://criptonautas.co/sitemap.xml

Comments

I opt out of online advertising so malware that injects ads on my site won't
get paid. You should do the same. my ads.txt file contains a standard
placeholder to forbid any compliant ad networks from paying for ad placement
on my domain.
By allowing us access, you enable the maximum number
of advertisers to confidently purchase advertising space on your pages. Our
comprehensive data insights help advertisers understand the suitability and
context of your content, ensuring that their ads align with your audience's
interests and needs. This alignment leads to improved user experiences,
increased engagement, and ultimately, higher revenue potential for your
publication. (https://www.peer39.com/crawler-notice)
--> fuck off.
IP-violation scanners
The next three are borrowed from https://www.videolan.org/robots.txt
> This robot collects content from the Internet for the sole purpose of
helping educational institutions prevent plagiarism. [...] we compare student
papers against the content we find on the Internet to see if we # can find
similarities. (http://www.turnitin.com/robot/crawlerinfo.html)
--> fuck off.
> NameProtect engages in crawling activity in search of a wide range of brand
and other intellectual property violations that may be of interest to our
clients. (http://www.nameprotect.com/botinfo.html)
--> fuck off.
iThenticate is a new service we have developed to combat the piracy of
intellectual property and ensure the originality of written work for
publishers, non-profit agencies, corporations, and newspapers.
(http://www.slysearch.com/)
--> fuck off.
BLEXBot assists internet marketers to get information on the link structure
of sites and their interlinking on the web, to avoid any technical and
possible legal issues and improve overall online experience.
(http://webmeup-crawler.com/)
--> fuck off.
Providing Intellectual Property professionals with superior brand protection
services by artfully merging the latest technology with expert analysis.
(https://www.checkmarknetwork.com/spider.html/)
"The Internet is just way to big to effectively police alone." (ACTUAL quote)
--> fuck off.
Stop trademark violations and affiliate non-compliance in paid search.
Automatically monitor your partner and affiliates’ online marketing to
protect yourself from harmful brand violations and regulatory risks. We
regularly crawl websites on behalf of our clients to ensure content
compliance with brand and regulatory guidelines.
(https://www.brandverity.com/why-is-brandverity-visiting-me)
--> fuck off.
Misc. icky stuff
Pipl assembles online identity information from multiple independent sources
to create the most complete picture of a digital identity and connect it to
real people and their offline identity records. When all the fragments of
online identity data are collected, connected, and corroborated, the result
is a more trustworthy identity.
--> fuck off.
Well-known overly-aggressive bot that claims to respect robots.txt: http://mj12bot.com/
Gen-AI data scrapers
Eat shit, OpenAI.
Official way to opt-out of Google's generative AI training:
<https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers>
Official way to opt-out of LLM training by Apple
<https://support.apple.com/en-us/119829#datausage>
Anthropic-AI crawler posted guidance after a long period of crawling without opt-out documentation: <https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler>
FacebookBot crawls public web pages to improve language models for our speech
recognition technology.
<https://developers.facebook.com/docs/sharing/bot/?_fb_noscript=1>
UPDATE: The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.
<https://developers.facebook.com/docs/sharing/webmasters/web-crawlers>
This one doesn't support robots.txt: https://www.allenai.org/crawler
block it with your reverse-proxy or WAF or something.
See <https://ds.rois.ac.jp/center8/crawler/>
Parent page says it builds LLMs in the infographic: <https://ds.rois.ac.jp/center8/>
https://webz.io/bot.html
https://kangaroollm.com.au/kangaroo-bot/`
I'm not blocking CCBot for now. It publishes a free index for anyone to use.
Googe used this to train the initial version of Bard (now called Gemini).
I allow CCBot since its index is also used for upstart/hobbyist search engines
like Alexandria and for genuinely useful academic work I personally like.
I allow Owler for similar reasons:
<https://openwebsearch.eu/owler/#owler-opt-out>
<https://openwebsearch.eu/common-goals-with-common-crawl/>.
Omgilibot/Omgili is similar to CCBot, except it sells the scrape results.
I'm not familiar enough with Omgili to make a call here.
In the long run, my embedded robots meta-tags and headers could cover gen-AI
I don't block cohere-ai or Perplexitybot: they don't appear to actually
scrape data for LLM training purposes. The crawling powers search engines
with integrated pre-trained LLMs.
TODO: investigate whether YouBot scrapes to train its own in-house LLM.
Ghost CMS + Webmentions allowed

criptonautas.corobots.txt

Resource Scan

Scan Details

Last Successful Scan

Groups

*

adsbot

peer39_crawlerpeer39_crawler/1.0

turnitinbot

npbot

slysearch

blexbot

checkmarknetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

brandverity/1.0

piplbot

mj12bot

Other Records

chatgpt-usergptbot

google-extended

applebot-extended

claudebot

facebookbotmeta-externalagent

cotoyogi

webzio-extended

kangaroo bot

Other Records

Comments

criptonautas.co
robots.txt

peer39_crawler
peer39_crawler/1.0

chatgpt-user
gptbot

facebookbot
meta-externalagent