Web Dev/Info Sec Guys -> info regarding your robots.txt and indexing behavior (very short and very important)
I have recently been experimenting with how Google, Bing, Yandex, CensysBot, UptimeBot etc handle various inconsistencies in robots.txt.
This will take you 5 minutes to read and hopefully you will know how serious not setting up robots.txt can be, and how exposed you and your org. become, when not done right:
Take these non-real world examples:
---------- (File needs to be world readable)
User-Agent: *
------- This is improper syntax and it doesnt actaully state to disallow
I have also seen alot of:
User-Agent: *
Disallow: /
Allow: *.php
Allow: /gfx
---- etc
------- This is improper syntax and it will (most crawlers) will not index
Here is the PoC:
Test site for PoC: opensea.io
# I kick off a curl (silent - follow redirects - set UA
curl -sL --user-agent 'moz' https://opensea.io/robots.txt;echo
User-agent: *
Disallow: /cdn-cgi/
Sitemap: https://opensea.io/sitemap.xml
BTW --- This site is *massive* they arent covering much - in the sitemap file
How to validate the behavior on Google when a site of a large size has a robots.txt like this: Incognito google query:
Note: Whenever your Google Dorking — do it incognito, or you will have history and skewed results based on your past browsing history.
So case in point — Look at the amount of results — 2.62million:
This isn't surprising at all, but it is interesting when your looking at a site of this site. Keep in mind these things about Google results:
* Max of 1000 results (100 x 10 pg) it doesn't matter your query
* It will try to goto the next page(s) if u keep hitting end key
* If you append &num=100 to the end of the URL it will start list with 100.
Looking at how Yahoo.com handles it:
Proper robots.txt file:
User-Agent: *
Crawl-Delay: 2
Allow: /gfx
Allow: *.php
Allow: /guestbook
Disallow: /
Sitemap: hxxps://mysite.com/sitemap.xml
Nothing less or more is needed. Think of it like a firewall out of the box, if you ever got an enterprise grade FW like a cisco asa, they come with one rule. DENY ANY / any.. Think along those lines ;p
dj substance
merry xmas 2o23