nepal.sil.org
robots.txt

Robots Exclusion Standard data for nepal.sil.org

Resource Scan

Scan Details

Site Domain nepal.sil.org
Base Domain sil.org
Scan Status Ok
Last Scan2025-05-22T04:08:22+00:00
Next Scan 2025-06-05T04:08:22+00:00

Last Scan

Scanned2025-05-22T04:08:22+00:00
URL https://nepal.sil.org/robots.txt
Domain IPs 104.22.10.254, 104.22.11.254, 172.67.29.248, 2606:4700:10::6816:afe, 2606:4700:10::6816:bfe, 2606:4700:10::ac43:1df8
Response IP 104.22.11.254
Found Yes
Hash fefe3f5d7c3d8b8c7dabc83c7e2ee2c3d7def9e18a84e1c12d4eccc95edb10c0
SimHash bc147d00e678

Groups

*

Rule Path
Allow /misc/*.css$
Allow /misc/*.css?
Allow /misc/*.js$
Allow /misc/*.js?
Allow /misc/*.gif
Allow /misc/*.jpg
Allow /misc/*.jpeg
Allow /misc/*.png
Allow /modules/*.css$
Allow /modules/*.css?
Allow /modules/*.js$
Allow /modules/*.js?
Allow /modules/*.gif
Allow /modules/*.jpg
Allow /modules/*.jpeg
Allow /modules/*.png
Allow /profiles/*.css$
Allow /profiles/*.css?
Allow /profiles/*.js$
Allow /profiles/*.js?
Allow /profiles/*.gif
Allow /profiles/*.jpg
Allow /profiles/*.jpeg
Allow /profiles/*.png
Allow /themes/*.css$
Allow /themes/*.css?
Allow /themes/*.js$
Allow /themes/*.js?
Allow /themes/*.gif
Allow /themes/*.jpg
Allow /themes/*.jpeg
Allow /themes/*.png
Disallow /includes/
Disallow /misc/
Disallow /modules/
Disallow /profiles/
Disallow /scripts/
Disallow /themes/
Disallow /CHANGELOG.txt
Disallow /cron.php
Disallow /INSTALL.mysql.txt
Disallow /INSTALL.pgsql.txt
Disallow /INSTALL.sqlite.txt
Disallow /install.php
Disallow /INSTALL.txt
Disallow /LICENSE.txt
Disallow /MAINTAINERS.txt
Disallow /update.php
Disallow /UPGRADE.txt
Disallow /xmlrpc.php
Disallow /wp-login.php
Disallow %5E.*%5C/wp-includes%5C/wlwmanifest.xml
Disallow /admin/
Disallow /comment/reply/
Disallow /filter/tips/
Disallow /node/add/
Disallow /search/
Disallow /user/register/
Disallow /user/password/
Disallow /user/login/
Disallow /user/logout/
Disallow /wp-json/wp/v2/users/1
Disallow /?q=admin%2F
Disallow /?q=comment%2Freply%2F
Disallow /?q=filter%2Ftips%2F
Disallow /?q=node%2Fadd%2F
Disallow /?q=search%2F
Disallow /?q=user%2Fpassword%2F
Disallow /?q=user%2Fregister%2F
Disallow /?q=user%2Flogin%2F
Disallow /?q=user%2Flogout%2F
Disallow /*?
Allow /*?page=
Disallow /*?page=*&*
Disallow /*?page=0*
Disallow /resources/search/*/*/*
Disallow /*/resources/search/*/*/*

Other Records

Field Value
crawl-delay 10

a6-indexer

Rule Path
Disallow /

alphaseobot

Rule Path
Disallow /

alphaseobot-sa

Rule Path
Disallow /

applebot

Rule Path
Disallow /

aspiegelbot

Rule Path
Disallow /

barkrowler

Rule Path
Disallow /

blackboard safeassign

Rule Path
Disallow /

bingbot/2.0

Rule Path
Disallow /

blexbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

crawler4j

Rule Path
Disallow /

dataforseobot

Rule Path
Disallow /

dotbot

Rule Path
Disallow /

gigabot

Rule Path
Disallow /

liebaofast

Rule Path
Disallow /

mauibot

Rule Path
Disallow /

mauibot (crawler.feedback+wc@gmail.com)

Rule Path
Disallow /

megaindex.ru/2.0

Rule Path
Disallow /

mqqbrowser

Rule Path
Disallow /

nimbostratus-bot/v1.3.2

Rule Path
Disallow /

qwant-news

Rule Path
Disallow /

qwantify

Rule Path
Disallow /

seekport crawler

Rule Path
Disallow /

semrushbot

Rule Path
Disallow /

semrushbot-sa

Rule Path
Disallow /

seznambot

Rule Path
Disallow /

sputnikbot/2.3

Rule Path
Disallow /

the knowledge ai

Rule Path
Disallow /

timpibot/0.8

Rule Path
Disallow /

tinytestbot

Rule Path
Disallow /

turnitinbot

Rule Path
Disallow /

ucbrowser

Rule Path
Disallow /

yacybot

Rule Path
Disallow /

yandexbot

Rule Path
Disallow /

yandexbot/3.0

Rule Path
Disallow /

yeti

Rule Path
Disallow /

yisouspider

Rule Path
Disallow /

Comments

  • robots.txt
  • This file is to prevent the crawling and indexing of certain parts
  • of your site by web crawlers and spiders run by sites like Yahoo!
  • and Google. By telling these "robots" where not to go on your site,
  • you save bandwidth and server resources.
  • This file will be ignored unless it is at the root of your host:
  • Used: http://example.com/robots.txt
  • Ignored: http://example.com/site/robots.txt
  • For more information about the robots.txt standard, see:
  • http://www.robotstxt.org/robotstxt.html
  • CSS, JS, Images
  • Directories
  • Files
  • RH, 06.30.21: these are files likely bad bots are requesting
  • Paths (clean URLs)
  • RH, 06.30.21: these are files likely bad bots are requesting
  • Paths (no clean URLs)
  • RH, 07.01.21: Views has URL parameters from exposed filters (Archives and Publications Search views); https://www.drupal.org/node/345620
  • Disallow all URL variables except for page
  • RH, 04.30.24: archives search URLs being crawled can kill the sites, espeically with multiple facets. It might be how bots discover Archives items, so allow single faceted browse and search URLs but block all after (/resources/search/domain/anthropology is allowed but not /resources/search/domain/anthropology/contributor/maranz-david-e). If bots find Archives items by crawling the site, then this should allow all items to be found via browse URLs and search URLs. If there continue to be issues, we can consider blocking all. Note that /*/resources/search is for multilingual URLs
  • Disallow: /resources/browse/*
  • Disallow: /resources/search/*
  • Block bots
  • RDH, 08.19.19: I really don't want to block Applebot, but for now, I am. It is crawling us too much
  • RDH, 05.13.20: I really don't want to block bing, but for now, I am. It is also already in htaccess rules
  • RDH, 06.30.21: Very temporary to get some relief.
  • User-Agent: Googlebot
  • Disallow: /