theguardian.com
robots.txt

Robots Exclusion Standard data for theguardian.com

Resource Scan

Scan Details

Site Domain theguardian.com
Base Domain theguardian.com
Scan Status Ok
Last Scan2024-04-25T09:02:42+00:00
Next Scan 2024-05-02T09:02:42+00:00

Last Scan

Scanned2024-04-25T09:02:42+00:00
URL https://theguardian.com/robots.txt
Redirect https://www.theguardian.com/robots.txt
Redirect Domain www.theguardian.com
Redirect Base theguardian.com
Domain IPs 151.101.1.111, 151.101.129.111, 151.101.193.111, 151.101.65.111
Redirect IPs 151.101.1.111, 151.101.129.111, 151.101.193.111, 151.101.65.111, 2a04:4e42:200::367, 2a04:4e42:400::367, 2a04:4e42:600::367, 2a04:4e42::367
Response IP 199.232.45.111
Found Yes
Hash 28858e12429620c4a327cd328be53d3c25abe3afc7142b8f69456062645284e1
SimHash cf01552bc7d0

Groups

*

Rule Path
Disallow /sendarticle/
Disallow /Users/
Disallow /users/
Disallow /*/print$
Disallow /email/
Disallow /contactus/
Disallow /share/
Disallow /websearch
Disallow /*?commentpage=
Disallow /whsmiths/
Disallow /external/overture/
Disallow /discussion/report-abuse/*
Disallow /discussion/report-abuse-ajax/*
Disallow /discussion/comment-permalink/*
Disallow /discussion/report-abuse/*
Disallow /discussion/user-report-abuse/*
Disallow /discussion/handlers/*
Disallow /discussion/your-profile
Disallow /discussion/your-comments
Disallow /discussion/edit-profile
Disallow /discussion/search/comments
Disallow /discussion/*
Disallow /search
Disallow /music/artist/*
Disallow /music/album/*
Disallow /books/data/*
Disallow /settings/
Disallow /embed/
Disallow /*styles/js-on.css$
Disallow /sport/olympics/2008/events/*
Disallow /sport/olympics/2008/medals/*
Disallow /f/healthcheck
Disallow /sections
Disallow /top-stories
Disallow /most-read/sport
Disallow /articles
Disallow /global$
Disallow /*/feedarticle/*
Disallow /travel/2013/aug/22/been-there-readers-competition?*
Disallow /preference/*
Disallow /59666047/
Disallow /print/
Disallow /info/tech-feedback
Disallow /production-monitoring/
Disallow *.emailjson
Disallow *.emailtxt
Disallow /headline.txt
Disallow *?*dcr=apps*

mediapartners-google

Rule Path
Disallow

newsnow

Rule Path
Disallow /

gptbot

Rule Path
Disallow /

ccbot

Rule Path
Disallow /

turnitinbot

Rule Path
Disallow /

petalbot

Rule Path
Disallow /

moodlebot

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

https://hada.news

Rule Path
Disallow /

https://www.imediaethics.org

Rule Path
Disallow /

mojeek

Rule Path
Disallow /

jenkersbot

Rule Path
Disallow /

seekr

Rule Path
Disallow /

turnitin

Rule Path
Disallow /

youbot

Rule Path
Disallow /

ia_archiver

Rule Path
Disallow /

archive.org_bot

Rule Path
Disallow /

arquivo-web-crawler

Rule Path
Disallow /

coccocbot-web

Rule Path
Disallow /

seznambot

Rule Path
Disallow /

perplexitybot

Rule Path
Disallow /

yacy

Rule Path
Disallow /

yandex

Rule Path
Disallow /

anthropic-ai

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

Other Records

Field Value
sitemap http://www.theguardian.com/sitemaps/news.xml
sitemap http://www.theguardian.com/sitemaps/video.xml

Comments

  • This is the robots.txt file for theguardian.com
  • Guardian content is made available under our terms and conditions of use.
  • Any other uses are not permitted, incl. but not limited to: for large language
  • models (LLMs), machine learning and/or artificial intelligence-related
  • purposes; with any of the aforementioned technologies; and/or for any
  • commercial purposes. Contact licensing@theguardian.com for assistance