www.gov.uk
robots.txt

Robots Exclusion Standard data for www.gov.uk

Resource Scan

Scan Details

Site Domain www.gov.uk
Base Domain www.gov.uk
Scan Status Ok
Last Scan2024-11-02T10:54:50+00:00
Next Scan 2024-11-16T10:54:50+00:00

Last Scan

Scanned2024-11-02T10:54:50+00:00
URL https://www.gov.uk/robots.txt
Domain IPs 151.101.0.144, 151.101.128.144, 151.101.192.144, 151.101.64.144, 2a04:4e42:200::144, 2a04:4e42:400::144, 2a04:4e42:600::144, 2a04:4e42::144
Response IP 199.232.44.144
Found Yes
Hash 5a1295a7846646430e2cf3a8f7d40898ed0c3b1cc02e9e1b3f68366c6b4d6574
SimHash 6e1c9c5d95d3

Groups

*

Rule Path
Disallow /*/print$
Disallow /info/*
Disallow /search/all*

ahrefsbot

No rules defined. All paths allowed.

Other Records

Field Value
crawl-delay 10

deepcrawl

Rule Path
Disallow /

ms search 6.0 robot

Rule Path
Disallow /

Other Records

Field Value
sitemap https://www.gov.uk/sitemap.xml

Comments

  • Don't allow indexing of user needs pages
  • Don't allow indexing of site search
  • https://ahrefs.com/robot/ crawls the site frequently
  • https://www.deepcrawl.com/bot/ makes lots of requests. Ideally we'd slow it
  • down rather than blocking it but it doesn't mention whether or not it
  • supports crawl-delay.
  • Complaints of 429 'Too many requests' seem to be coming from SharePoint servers
  • (https://social.msdn.microsoft.com/Forums/en-US/3ea268ed-58a6-4166-ab40-d3f4fc55fef4)
  • The robot doesn't recognise its User-Agent string, see the MS support article:
  • https://support.microsoft.com/en-us/help/3019711/the-sharepoint-server-crawler-ignores-directives-in-robots-txt