www.gov.uk
robots.txt

Robots Exclusion Standard data for www.gov.uk

Resource Scan

Scan Details

Site Domain www.gov.uk
Base Domain www.gov.uk
Scan Status Ok
Last Scan2024-06-01T04:31:49+00:00
Next Scan 2024-06-15T04:31:49+00:00

Last Scan

Scanned2024-06-01T04:31:49+00:00
URL https://www.gov.uk/robots.txt
Domain IPs 151.101.0.144, 151.101.128.144, 151.101.192.144, 151.101.64.144, 2a04:4e42:200::144, 2a04:4e42:400::144, 2a04:4e42:600::144, 2a04:4e42::144
Response IP 199.232.44.144
Found Yes
Hash 047b1ef95795017de52c48a860efb7adebbeebcaa9d55f071a41532ed8a9fa99
SimHash 6e1c9c5d95d3

Groups

*

Rule Path
Disallow /*/print$
Disallow /info/*

ahrefsbot

No rules defined. All paths allowed.

Other Records

Field Value
crawl-delay 10

deepcrawl

Rule Path
Disallow /

ms search 6.0 robot

Rule Path
Disallow /

Other Records

Field Value
sitemap https://www.gov.uk/sitemap.xml

Comments

  • Don't allow indexing of user needs pages
  • https://ahrefs.com/robot/ crawls the site frequently
  • https://www.deepcrawl.com/bot/ makes lots of requests. Ideally we'd slow it
  • down rather than blocking it but it doesn't mention whether or not it
  • supports crawl-delay.
  • Complaints of 429 'Too many requests' seem to be coming from SharePoint servers
  • (https://social.msdn.microsoft.com/Forums/en-US/3ea268ed-58a6-4166-ab40-d3f4fc55fef4)
  • The robot doesn't recognise its User-Agent string, see the MS support article:
  • https://support.microsoft.com/en-us/help/3019711/the-sharepoint-server-crawler-ignores-directives-in-robots-txt