gwern.net
robots.txt

Robots Exclusion Standard data for gwern.net

Resource Scan

Scan Details

Site Domain gwern.net
Base Domain gwern.net
Scan Status Ok
Last Scan2025-11-24T09:11:09+00:00
Next Scan 2025-12-24T09:11:09+00:00

Last Scan

Scanned2025-11-24T09:11:09+00:00
URL https://gwern.net/robots.txt
Domain IPs 104.26.10.177, 104.26.11.177, 172.67.71.248, 2606:4700:20::681a:ab1, 2606:4700:20::681a:bb1, 2606:4700:20::ac43:47f8
Response IP 104.26.11.177
Found Yes
Hash b97e30a11dd5969b162bfbabff3637558cb2c26004376ac439897efb6b34bea4
SimHash 60200a3aced0

Groups

ia_archiver

Rule Path
Disallow /
Allow /modafinil
Allow /dnm-arrest

*

Rule Path
Disallow /fulltext
Disallow /*.md
Disallow /*.md.html
Disallow /static/*.*.html
Disallow /static/nginx/*
Disallow /static/redirect/*
Disallow /metadata/*
Disallow /metadata/annotation/backlink/*
Disallow /metadata/annotation/similar/*
Disallow /metadata/annotation/link-bibliography/*
Disallow /confidential/*
Disallow /private/*
Disallow /secret/*
Disallow /doc/www/*

Other Records

Field Value
sitemap https://gwern.net/sitemap.xml

Comments

  • Hide copies:
  • Duplicate content is bad for SEO (and clutters search results), so no Markdown sources, WWW archives, metadata snippets, or link-bibliography compilations.
  • disallow syntax-highlighted versions of source code as duplicates:
  • spurious google hits for filenames cluttering results:
  • I disallow doc/*/index pages because those keep cluttering up Google Scholar
  • Disallow: /doc/*/index
  • Allow: /doc/rotten.com/*