citeseerx.ist.psu.edu
robots.txt

Robots Exclusion Standard data for citeseerx.ist.psu.edu

Resource Scan

Scan Details

Site Domain citeseerx.ist.psu.edu
Base Domain psu.edu
Scan Status Failed
Failure StageFetching resource.
Failure ReasonCouldn't establish SSL connection.
Last Scan2025-05-29T16:50:03+00:00
Next Scan 2025-06-28T16:50:03+00:00

Last Successful Scan

Scanned2025-04-07T14:57:46+00:00
URL https://citeseerx.ist.psu.edu/robots.txt
Domain IPs 130.203.136.161, 130.203.136.162, 130.203.136.163
Response IP 130.203.136.163
Found Yes
Hash 36b7e9f14ffd55b2324f657c61c928d65664ec09fa77551aa7cd80b0c522109c
SimHash 061e9d508645

Groups

baiduspider

Rule Path
Disallow /

baiduspider

Rule Path
Disallow /

baiduspider+

Rule Path
Disallow /

googlebot

Rule Path
Disallow

petalbot

Rule Path
Disallow /

bingbot

Rule Path
Disallow /
Disallow /doc_view/pid*
Disallow /pdf*

*

Rule Path
Disallow /doc_view/pid*
Disallow /pdf*

Other Records

Field Value
crawl-delay 10

msnbot

Rule Path
Disallow /
Disallow /doc_view/pid*
Disallow /pdf*

Other Records

Field Value
crawl-delay 40

Other Records

Field Value
sitemap https://citeseerx.ist.psu.edu/sitemap_index.xml

Comments

  • blocked for extensive crawls without respecting crawl-delay
  • added msnbot with more delay - was hitting hard with different ip's