halsoliv.expressen.se
robots.txt

Robots Exclusion Standard data for halsoliv.expressen.se

Resource Scan

Scan Details

Site Domain halsoliv.expressen.se
Base Domain expressen.se
Scan Status Ok
Last Scan2025-08-04T10:15:28+00:00
Next Scan 2025-08-11T10:15:28+00:00

Last Scan

Scanned2025-08-04T10:15:28+00:00
URL https://halsoliv.expressen.se/robots.txt
Domain IPs 146.75.117.91, 2a04:4e42:8d::347
Response IP 151.101.37.91
Found Yes
Hash 1a8814c8c51ee1d8fe43c2e4c8f39c14c104e99780690ec8310afcc3712093a9
SimHash e81c117ced75

Groups

ccbot

Rule Path
Disallow /

chatgpt-user

Rule Path
Disallow /

gptbot

Rule Path
Disallow /

google-extended

Rule Path
Disallow /

google-cloudvertexbot

Rule Path
Disallow /

omgilibot

Rule Path
Disallow /

omgili

Rule Path
Disallow /

facebookbot

Rule Path
Disallow /

claudebot

Rule Path
Disallow /

diffbot

Rule Path
Disallow /

duckassistbot

Rule Path
Disallow /

perplexitybot

Rule Path
Disallow /

cohere-ai

Rule Path
Disallow /

cohere-training-data-crawler

Rule Path
Disallow /

meta-externalagent

Rule Path
Disallow /

meta-externalfetcher

Rule Path
Disallow /

timpibot

Rule Path
Disallow /

webzio-extended

Rule Path
Disallow /

youbot

Rule Path
Disallow /

amazonbot

Rule Path
Disallow /

bytespider

Rule Path
Disallow /

anthropic-ai

Rule Path
Disallow /

oai-searchbot

Rule Path
Disallow /

velenpublicwebcrawler

Rule Path
Disallow /

Other Records

Field Value
sitemap https://halsoliv.expressen.se/sitemap.xml

Comments

  • Common Crawl robot, the resulting dataset is the primary training corpus in every LLM.
  • ChatGPT robot, used to improve the ChatGPT LLM.
  • ChatGPT robot, may be used to improve the ChatGPT LLM.
  • Robot used to improve Bard and Vertex AI LLMs.
  • Associated with Google Vertex AI agents
  • webz.io robot, the resulting dataset can and is purchased to train LLMs.
  • webz.io robot, the resulting dataset can and is purchased to train LLMs.
  • FacebookBot crawls public web pages to improve LLMs for Facebook's speech recognition technology.
  • Another agent used by Anthropic that is more specifically related to Claude
  • Diffbot crawls the web in or others to train their LLMs.
  • Uses scraped data on-the-fly to create answers for DuckAssist.
  • Used by perplexity.ai. Generates text based on scraped material.
  • Cohere’s chatbot.
  • Cohere’s chatbot.
  • Use cases such as training AI models or improving products by indexing content directly.
  • Crawler performs user-initiated fetches of individual links in support of some AI tools.
  • Used by Timpi to scrape data for training their Large Language Models.
  • Used by Webz.io to indicate that your site should not be included those using it to train AI models.
  • Crawler behind You.com’s AI search and browser assistant, indexing content for real-time answers.
  • Amazonbot is used to train Amazon services such as Alexa.
  • Bytespider is ByteDance's bot and may not respect robots.txt.
  • Robot used to improve Anthropic AI LLMs.
  • OpenAI search bot
  • Velen.io/Hunter.io "build business datasets and machine learning models to better understand the web" - seems to focus on collecting email adresses for spam though.