scholar.archive.org
robots.txt

Robots Exclusion Standard data for scholar.archive.org

Resource Scan

Scan Details

Site Domain scholar.archive.org
Base Domain archive.org
Scan Status Ok
Last Scan2025-03-03T10:41:39+00:00
Next Scan 2025-04-02T10:41:39+00:00

Last Scan

Scanned2025-03-03T10:41:39+00:00
URL https://scholar.archive.org/robots.txt
Domain IPs 207.241.225.8, 207.241.232.8
Response IP 207.241.232.8
Found Yes
Hash 497d356a937fcdeff680e3f9d9a7cdeae67e23633a4624957c21aecb9206556c
SimHash be275950c7d5

Groups

semrushbot
yandexbot
bingbot
googlebot
semanticscholarbot
yacybot
petalbot
yeti
riddler

Rule Path
Disallow /search

*

Rule Path
Disallow /search

*

Rule Path
Allow /

Other Records

Field Value
sitemap https://scholar.archive.org/sitemap.xml
sitemap https://scholar.archive.org/sitemap-index-works.xml

Comments

  • Hello friends!
  • If you are considering large or automated crawling, you may want to look at
  • our catalog API (https://api.fatcat.wiki) or bulk database snapshots instead.
  • large-scale bots should not index search pages
  • crawling search result pages is expensive, so we do specify a long crawl
  • delay for those (for bots other than the above broad search bots)
  • UPDATE: actually, just block all robots from search page, we are overwhelmed
  • as of 2022-10-31
  • Allow: /search
  • Crawl-delay: 5
  • by default, can crawl anything on this domain. HTTP 429 ("backoff") status
  • codes are used for rate-limiting instead of any crawl delay specified here.
  • Up to a handful concurrent requests should be fine.
  • same info as sitemap-index-works.xml plus following citation_pdf_url
  • Sitemap: https://scholar.archive.org/sitemap-index-access.xml