Web Dev/Info Sec Guys -> info regarding your robots.txt and indexing behavior (very short and very important)

3 min readDec 24, 2023

I have recently been experimenting with how Google, Bing, Yandex, CensysBot, UptimeBot etc handle various inconsistencies in robots.txt.
This will take you 5 minutes to read and hopefully you will know how serious not setting up robots.txt can be, and how exposed you and your org. become, when not done right:

Take these non-real world examples:

---------- (File needs to be world readable)
User-Agent: *

------- This is improper syntax and it doesnt actaully state to disallow

I have also seen alot of:
User-Agent: *
Disallow: /
Allow: *.php
Allow: /gfx
---- etc

------- This is improper syntax and it will (most crawlers) will not index

Here is the PoC:

Test site for PoC: opensea.io

# I kick off a curl (silent - follow redirects - set UA

curl -sL --user-agent 'moz' https://opensea.io/robots.txt;echo

User-agent: *
Disallow: /cdn-cgi/
Sitemap: https://opensea.io/sitemap.xml

BTW --- This site is *massive* they arent covering much - in the sitemap file

Google Dork / Hacking / dj substance / 9x
Lets examine how Google index’s and displays this configuration of robots.txt

How to validate the behavior on Google when a site of a large size has a robots.txt like this: Incognito google query:

Note: Whenever your Google Dorking — do it incognito, or you will have history and skewed results based on your past browsing history.

So case in point — Look at the amount of results — 2.62million:

DJ Substance / Dropping the ill trance beats and elite progressive breaks mixes
Due to not paying much care to robots.txt config. in this case, they may be exposing sensitive data

This isn't surprising at all, but it is interesting when your looking at a site of this site. Keep in mind these things about Google results:
* Max of 1000 results (100 x 10 pg) it doesn't matter your query
* It will try to goto the next page(s) if u keep hitting end key
* If you append &num=100 to the end of the URL it will start list with 100.

Looking at how Yahoo.com handles it:

Yahoo — robots.txt / dj substance
Yahoo.com is very similiar in terms of the “dork” used and the massive amt of results
DJ Substance / Bringing you the hacking / scene info in 2023 / 9x
Its not quite 2.4 million but 360,000 files is a large attack surface.
DJ Substance / Hacking / Elite Trance Music/ NYC / 9x
In Summary The outcome of not controlling what is indexed can be catastrophic
Proper robots.txt file:

User-Agent: *
Crawl-Delay: 2
Allow: /gfx
Allow: *.php
Allow: /guestbook
Disallow: /

Sitemap: hxxps://mysite.com/sitemap.xml

Nothing less or more is needed. Think of it like a firewall out of the box, if you ever got an enterprise grade FW like a cisco asa, they come with one rule. DENY ANY / any.. Think along those lines ;p

dj substance

merry xmas 2o23




twenty years professionally as a Network Engineer, more recently I have focused on red teaming mostly, but I am always up for learning and exchanging info