howtosumo.com
robots.txt

Robots Exclusion Standard data for howtosumo.com

Archived Snapshots

Resource Scan

Scan Details

Site Domain	howtosumo.com
Base Domain	howtosumo.com
Scan Status	Failed
Failure Stage	Fetching resource.
Failure Reason	Couldn't connect to server.
Last Scan	2024-08-28T17:17:10+00:00
Next Scan	2024-11-26T17:17:10+00:00

Last Successful Scan

Scanned	2024-02-01T16:25:32+00:00
URL	https://howtosumo.com/robots.txt
Domain IPs	104.21.94.105, 172.67.222.92, 2606:4700:3031::ac43:de5c, 2606:4700:3032::6815:5e69
Response IP	104.21.94.105
Found	Yes
Hash	5b0e0b30f9018e6677ffa85bf09284a25ff2fc1a8bd8822ec83c2c738f2b9f15
SimHash	6c5041c9c5f7

Groups

anthropic-ai

Rule	Path
Disallow	/

Rule

Path

Disallow

archive.org

Rule	Path
Disallow	/api.php
Disallow	/index.php
Disallow	/Special%3A

Rule

Path

Disallow

/api.php

Disallow

/index.php

Disallow

/Special%3A

ccbot

Rule	Path
Disallow	/

Rule

Path

Disallow

doc

Rule	Path
Disallow	/

Rule

Path

Disallow

download ninja

Rule	Path
Disallow	/

Rule

Path

Disallow

fetch

Rule	Path
Disallow	/

Rule

Path

Disallow

gptbot

Rule	Path
Disallow	/

Rule

Path

Disallow

hmse_robot

Rule	Path
Disallow	/

Rule

Path

Disallow

httrack

Rule	Path
Disallow	/

Rule

Path

Disallow

k2spider

Rule	Path
Disallow	/

Rule

Path

Disallow

larbin

Rule	Path
Disallow	/

Rule

Path

Disallow

libwww

Rule	Path
Disallow	/

Rule

Path

Disallow

linko

Rule	Path
Disallow	/

Rule

Path

Disallow

microsoft.url.control

Rule	Path
Disallow	/

Rule

Path

Disallow

msiecrawler

Rule	Path
Disallow	/

Rule

Path

Disallow

npbot

Rule	Path
Disallow	/

Rule

Path

Disallow

offline explorer

Rule	Path
Disallow	/

Rule

Path

Disallow

sitecheck.internetseer.com

Rule	Path
Disallow	/

Rule

Path

Disallow

sitesnagger

Rule	Path
Disallow	/

Rule

Path

Disallow

teleport

Rule	Path
Disallow	/

Rule

Path

Disallow

teleportpro

Rule	Path
Disallow	/

Rule

Path

Disallow

ubicrawler

Rule	Path
Disallow	/

Rule

Path

Disallow

webcopier

Rule	Path
Disallow	/

Rule

Path

Disallow

webreaper

Rule	Path
Disallow	/

Rule

Path

Disallow

webstripper

Rule	Path
Disallow	/

Rule

Path

Disallow

webzip

Rule	Path
Disallow	/

Rule

Path

Disallow

wget

Rule	Path
Disallow	/

Rule

Path

Disallow

xenu

Rule	Path
Disallow	/

Rule

Path

Disallow

zao

Rule	Path
Disallow	/

Rule

Path

Disallow

zealbot

Rule	Path
Disallow	/

Rule

Path

Disallow

zyborg

Rule	Path
Disallow	/

Rule

Path

Disallow

adsbot-google

Rule	Path
Allow	/

Rule

Path

Allow

mediapartners-google

Rule	Path
Allow	/

Rule

Path

Allow

googlebot

Rule	Path
Allow	/Special%3ANewPages
Allow	/Special%3ASitemap
Allow	/Special%3ACategoryListing
Allow	/

Rule

Path

Allow

/Special%3ANewPages

Allow

/Special%3ASitemap

Allow

/Special%3ACategoryListing

Allow

*

Rule	Path
Allow	/Special%3ABlock
Allow	/Special%3ABlockList
Allow	/Special%3ACategorylisting
Allow	/Special%3ACategoryListing
Allow	/Special%3ACharity
Allow	/Special%3AEmailUser
Allow	/Special%3ALSearch
Allow	/Special%3ANewPages
Allow	/Special%3AQABox
Allow	/Special%3ASearchAd
Allow	/Special%3ASitemap
Allow	/Special%3AThankAuthors
Allow	/Special%3AUserLogin
Allow	/index.php?*action=credits
Allow	/index.php?*MathShowImage
Allow	/index.php?*printable
Disallow	/index.php
Disallow	/*feed%3Drss
Disallow	/*action%3Ddelete
Disallow	/*action%3Dhistory
Disallow	/Special%3A
Disallow	/*platform%3D
Disallow	/*variant%3D

Rule

Path

Allow

/Special%3ABlock

Allow

/Special%3ABlockList

Allow

/Special%3ACategorylisting

Allow

/Special%3ACategoryListing

Allow

/Special%3ACharity

Allow

/Special%3AEmailUser

Allow

/Special%3ALSearch

Allow

/Special%3ANewPages

Allow

/Special%3AQABox

Allow

/Special%3ASearchAd

Allow

/Special%3ASitemap

Allow

/Special%3AThankAuthors

Allow

/Special%3AUserLogin

Allow

/index.php?*action=credits

Allow

/index.php?*MathShowImage

Allow

/index.php?*printable

Disallow

/index.php

Disallow

/*feed%3Drss

Disallow

/*action%3Ddelete

Disallow

/*action%3Dhistory

Disallow

/Special%3A

Disallow

/*platform%3D

Disallow

/*variant%3D

Comments

robots.txt for https://www.wikihow.com
based on wikipedia.org's robots.txt
Crawlers that are kind enough to obey, but which we'd rather not have
unless they're feeding search engines.
Sitemap: https://www.wikihow.com/sitemap_index.xml
If your bot supports such a thing using the 'Crawl-delay' or another
instruction, please let us know. We can add it to our robots.txt.
Friendly, low-speed bots are welcome viewing article pages, but not
dynamically-generated pages please. Article pages contain our site's
real content.
Requests many pages per second
http://www.nameprotect.com/botinfo.html
Some bots are known to be trouble, particularly those designed to copy
entire sites. Please obey robots.txt.
wget in recursive mode uses too many resources for us.
Please read the man page and use it properly; there is a
--wait option you can use to set the delay between hits,
for instance. Please wait 3 seconds between each request.

howtosumo.comrobots.txt

Resource Scan

Scan Details

Last Successful Scan

Groups

anthropic-ai

archive.org

ccbot

doc

download ninja

fetch

gptbot

hmse_robot

httrack

k2spider

larbin

libwww

linko

microsoft.url.control

msiecrawler

npbot

offline explorer

sitecheck.internetseer.com

sitesnagger

teleport

teleportpro

ubicrawler

webcopier

webreaper

webstripper

webzip

wget

xenu

zao

zealbot

zyborg

adsbot-google

mediapartners-google

googlebot

*

Comments

howtosumo.com
robots.txt