The ramifications of not properly tweaking your robots.txt for your www

DJ SUBSTANCE
4 min readSep 27, 2023

--

DJ Substance / Elite Trance Music
DJ Substasnce explains why robots.txt is so critical for your sites security\

Lets first define what robots.txt file does when its in your root path of your website.:

DJ Substance Citing Google for robots.txt
Robots.txt is absolutely essential to have and have the right permissions.

In simple terms, you web directory structure should look like this (this is how i do it)

~ =web root of tranceattic. — com (example site my site)
- index.php — Since it is so critical to get the proper chmod / permissions on your web content keep this in mind — You want to see 3 “r” (read)

DJ Substance / Citing Google Chmod

Proper: -rw-r — r — 1 substance substance 30029 Sep 27 00:00 index.php
Proper (for directories) chmod 0755 <dir name> (like: drwxr-xr-x)

bash$ ls -altr | grep drw # Best way to view your dir permissions
drwxr-xr-x 2 4096 Sep 25 18:31 error
drwxr-xr-x 188 4096 Sep 26 07:29 guestbtook
drwxr-xr-x 2 4096 Sep 26 08:10 css
drwxr-xr-x 2 4096 Sep 26 08:11 js
drwxr-xr-x 2 4096 Sep 26 08:29 gf
           _           _        _        _   
_ __ ___ | |__ ___ | |_ ___ | |___ _| |_
| '__/ _ \| '_ \ / _ \| __/ __|| __\ \/ / __|
| | | (_) | |_) | (_) | |_\__ \| |_ > <| |_
|_| \___/|_.__/ \___/ \__|___(_)__/_/\_\\__|

If you hit a site, always check its robots.txt. If they are missing robots.txt
or it just redirects back to the homepage, this is good news for the hacker
enthususiast. What this means when they misconfigure or dont configure robots
permissions from search engine indexing. If you goto google type:

site: <target w/o robots.txt>.com

You should get many many results. If the site is huge, learn your google dorks
fine some really good stuff. Thats not what this is about.

As far as I can tell, when I see robots configured like this - it mine as well
not be there:

User-Agent: * <- match all search /crawlers and let them index *
Disallow: <- without putting a / here . its a wildcard open
https://extractorsolutions.com/robots.txt
Lets take a look at https://walmart.com/robots.txt
# These are comments anything starting with #
#Disallow select URLs
User-agent: *
Disallow: /0/
Disallow: /55875582/walmart-us/catalog/
# The line above is instructing all search crawlers/indexers OFF that path
# to not list on search results

Disallow: /account/
Disallow: /api/
Disallow: /collection/api/logger
# the above line is the kind of line that the infosec guy is lookgin for

Disallow: /cp/-201
Disallow: /cp/-302
Disallow: /cp/-306
Disallow: /cp/-309
Disallow: /cp/-506
Disallow: /cp/-509
Disallow: /cp/api/logger
Disallow: /cp/api/wpa
Disallow: /cservice/
Disallow: /cservice/ya_index.gsp
Disallow: /electrode/api/logger
Disallow: /email_collect/
Disallow: /giftregistry/
Disallow: /msp
Disallow: /nonConfig/api/wpa
Disallow: /popup_security.jsp
Disallow: /product/electrode/api/logger
Disallow: /product/electrode/api/wpa
Allow: /reviews/product/
Allow: /reviews/seller/
Disallow: /search
Disallow: /search/api/wpa
Disallow: */api/wpa
# As far as I know the above line will not index files or paths ending in
# /api/wpa

Disallow: */midas/*
Disallow: */wrd.walmart.com/*
Disallow: wrd.walmart.com/
Disallow: /search/search-ng.do
Disallow: /solutions/
Disallow: /store/ajax/detail-navigation
Disallow: /store/ajax/local-store
Disallow: /store/ajax/preferred
Disallow: /store/ajax/search
Disallow: /store/ajax/visited-store
Disallow: /store/category/
Disallow: /store/electrode/api/fetch-coupons
Disallow: /store/electrode/api/logger
Disallow: /store/electrode/api/p13n
Disallow: /store/electrode/api/search
Disallow: /store/electrode/api/stores
Allow: /store/finder
# Notice a random Allow in the middle of the disALLOWS. disrecommended
Disallow: /store/popular_in_grade/
Disallow: /storeLocator/
Disallow: /storeLocator/ca_storefinder_results.do
Disallow: /tealeaf
Disallow: /topic/electrode/api/logger
Disallow: /topic/electrode/api/wpa
Disallow: /typeahead/
Disallow: /wmflows/
Disallow: */undefined/*
Disallow: /store/*/search
Disallow: /c/kp/*/page/*
Disallow: /feeds/*
Disallow: /orders

#Crawler specific settings
User-agent: Adsbot-Google

User-agent: Mediapartners-Google

# slow down Yahoo
User-agent: Slurp
Crawl-delay: 5

Depending on the size of the site, my recommendation is make your robots like this. More like a default firewall policy from the factory, it is implict deny:

User-Agent: *
Allow: *.php
Allow: /webdemos/
Allow: /guestbook/
Disallow: /index.php?id=*
Disallow: /

# If anyone has any reason for why not to do this please comment
# Also My website makes calls back to index.php with GET params
# I dont think those pages index'd so this works for me.

Conclusion:

If you find a site with no robots or a misconfigured file like I showed at the beginning, goto search engines and try: site:<domain.com> You should find the entire website indexed. Nuff said

substance

--

--

DJ SUBSTANCE

twenty years professionally as a Network Engineer, more recently I have focused on red teaming mostly, but I am always up for learning and exchanging info