clagrills.com
robots.txt

Robots Exclusion Standard data for clagrills.com

Resource Scan

Scan Details

Site Domain clagrills.com
Base Domain clagrills.com
Scan Status Ok
Last Scan2024-09-08T10:32:29+00:00
Next Scan 2024-10-08T10:32:29+00:00

Last Scan

Scanned2024-09-08T10:32:29+00:00
URL https://clagrills.com/robots.txt
Domain IPs 66.117.4.4
Response IP 66.117.4.4
Found Yes
Hash 2c971b37b363abfd9e01452b18cc49dd3877c28ddd0bd79f0519b5f01a437888
SimHash 3a95db13f34c

Groups

gigabot
ia_archiver-web.archive.org
ia_archiver
yandex
yandexbot
moget
ichiro
naverbot
yeti
baiduspider
baiduspider-video
baiduspider-image
sogou spider
youdaobot
yodaobot
ahrefsbot
sistrix
seokicks-robot
seokicks
mj12bot
searchmetricsbot
netseer
semrushbot
discoverybot
backlinkcrawler
ralocobot
yandeximages
a6-indexer
coccoc
apache-httpclient
curious george
webmastercoffee
spbot
whelanlabs
research-scanner
runet-research-crawler
corporatenewssearchengine
spiderling
w3clinemode
netresearchserver
surveybot
gimme60bot
curious george
analyticsseo
genieo
crazywebcrawler
findxbot
domainsigmacrawler
aihitbot
changedetect
changedetection
infominder
sogou
sogou web spider
toweyabot
domainappender
megaindex
deusu
grapeshotcrawler
wotbox
domain re-animator bot
domain re-animator
qwantify
istellabot

Product Comment
gigabot Gigabot is the name of Gigablast's robot
yandex Russian search engine
coccoc 2-2015 Vietnamese browser
apache-httpclient 2-2015
curious george 2-2015
webmastercoffee 2-2015
spbot 2-2015
whelanlabs 2-2015
research-scanner 2-2015
runet-research-crawler 2-2015
corporatenewssearchengine 2-2015
spiderling 2-2015
w3clinemode 2-2015 HttpClient?
netresearchserver 2-2015
surveybot 2-2015
Rule Path
Disallow /

*

Product Comment
* Everybody else
Rule Path Comment
Disallow /part-xref MCM/MHP cross reference
Disallow /stayout Duh
Disallow /pinnacle Nothing much here
Allow / -

Comments

  • Robots.txt file
  • 12-2014 Change philosophy. Block known bad guys. For everyone else, block image directories. Most of the bad guys
  • simply ignore robots.txt anyway.
  • 12-2014 I'll block the bad guys like AmazonAws, Hackers, TopHosts and spammers in our firewall.
  • June 2012 Setup as a common robots.txt for all of my sites. Obviously, some of the directories don't exist
  • on all sites.
  • From Google at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
  • Only one group of group-member records is valid for a particular crawler. The crawler must determine the correct group of
  • records by finding the group with the most specific user-agent that still matches. All other groups of records are
  • ignored by the crawler. The user-agent is non-case-sensitive. All non-matching text is ignored (for example, both
  • googlebot/1.2 and googlebot* are equivalent to googlebot). The order of the groups within the robots.txt file is irrelevant.
  • The start-of-group element user-agent is used to specify for which crawler the group is valid. Only one group of records is valid for a particular crawler.
  • Name the specific bot we don't want, they'll probably ignore this
  • 6-2016 User-agent: msnbot-media # Don't steal our images
  • 6-2016 User-agent: Googlebot-Image
  • 6-2016 User-agent: yahoo-MMCrawler # Don't steal our images
  • 6-2016 User-agent: yahoo-MMCrawler/3.x # Don't steal our images

Warnings

  • 1 invalid line.