warwick.ac.uk
robots.txt

Robots Exclusion Standard data for warwick.ac.uk

Resource Scan

Scan Details

Site Domain warwick.ac.uk
Base Domain warwick.ac.uk
Scan Status Ok
Last Scan2024-09-24T17:25:21+00:00
Next Scan 2024-10-08T17:25:21+00:00

Last Scan

Scanned2024-09-24T17:25:21+00:00
URL https://warwick.ac.uk/robots.txt
Domain IPs 137.205.28.41
Response IP 137.205.28.41
Found Yes
Hash 796665d34be286e1794d16b969ac7b71e1fe4f0d2cbee5aec0df564415d676f2
SimHash 0c82a003adb6

Groups

gptbot

Rule Path
Disallow /

*

Rule Path
Disallow /training/
Disallow /sitebuilder2/
Allow /sitebuilder2/api/sitebuilder.ics
Allow /sitebuilder2/api/gadgets/
Allow /sitebuilder2/api/rss/
Allow /sitebuilder2/api/sitemap/
Allow /sitebuilder2/api/videoSitemap.xml
Allow /sitebuilder2/file/*

rogerbot

Rule Path
Disallow /services/sport/events/calendar/*?*
Disallow /services/sport/news/*?*
Disallow /services/sport/active/tennis/classes/*?*
Disallow /services/sport/content-hub/feed/*?*
Disallow /services/conferences/content-corner/*?*
Disallow /services/conferences/news/*?*

Other Records

Field Value
sitemap https://warwick.ac.uk/sitebuilder2/api/sitemap/index.xml

Comments

  • robots.txt for https://warwick.ac.uk/
  • Apply to all user agents
  • Don't index the training pages to try and stop people who want to study architecture from applying here because Warwick doesn't offer an Architecture course
  • Explanation: https://twitter.com/matmannion/status/1146342325980975104
  • Disallow indexing of the CMS application itself as no useful content exists there for externals, with exclusions below
  • let google get ical feeds
  • Allow thumbnail images
  • Disallow query string variations of sports calendars/news
  • Disallow query string variations of conferences news