opencorporates.com
robots.txt

Robots Exclusion Standard data for opencorporates.com

Resource Scan

Scan Details

Site Domain opencorporates.com
Base Domain opencorporates.com
Scan Status Ok
Last Scan2024-10-04T22:11:30+00:00
Next Scan 2024-10-11T22:11:30+00:00

Last Scan

Scanned2024-10-04T22:11:30+00:00
URL https://opencorporates.com/robots.txt
Domain IPs 209.126.35.14
Response IP 209.126.35.14
Found Yes
Hash ad18359ba750d55a416ebb75a3284b5a714cb49553560bf10e544ba3a957d992
SimHash a4cdb9ad6442

Groups

rogerbot

Rule Path
Disallow

gptbot

Rule Path
Disallow /

*

Rule Path
Disallow /assets
Disallow /data
Disallow /events
Disallow /filings
Disallow /networks
Disallow /officers
Disallow /placeholders
Disallow /search
Disallow /statements
Disallow /users
Disallow /*?page=
Disallow /*%26page%3D
Disallow /*/network.json

Other Records

Field Value
sitemap https://opencorporates.com/sitemap.xml.gz

Comments

  • See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
  • To ban all spiders from the entire site uncomment the next two lines:
  • User-Agent: *
  • Disallow: /