Often overlooked recon step — analyze the sitemap.xml of target

DJ SUBSTANCE
3 min readJul 18, 2023

Sitemap.xml is still (should be used anyways for SEO) in addition to robots.txt for a guide so the crawlers/bots know what pages to crawl and not. Sitemap.xml takes it a step further:

Example

https://www.jiffylube.com/sitemap.xml

DJ Substance displaying sitemap.xml
DJ Substance displaying sitemap.xml on a random target

In order to get this more readable, so you know your attack surface better, make the following script, but first wget a sitemap.xml file.

wget https://jiffylube.com/sitemap.xml
# Verify the sitemap.xml is populated

#!/bin/bash

# Call this smscrape.sh ; chmod +x smscrape.sh
# If you have issues, type set -x (turn on bash debugging)
# post a comment if your sitemap isnt parsing ill help fix it


if [[ $# -eq 0 ]]; then
echo "Please provide the sitemap.xml file as an argument."
exit 1
fi

sitemap_file=$1

# Check if the provided file exists
if [[ ! -f $sitemap_file ]]; then
echo "File $sitemap_file does not exist."
exit 1
fi

# Define colors
light_cyan='\033[1;36m'
cyan='\033[0;36m'
purple='\033[0;35m'
white='\033[1;37m'
gray='\033[0;37m'
blue='\033[0;34m'
pink='\033[0;35m'
hot_pink='\033[1;35m'
nc='\033[0m' # No color

# Use awk to extract path and information, format output, and highlight priority
awk -F'[<>]' -v light_cyan="$light_cyan" -v cyan="$cyan" -v purple="$purple" -v white="$white" -v gray="$gray" -v blue="$blue" -v pink="$pink" -v hot_pink="$hot_pink" -v nc="$nc" \
'/<loc>/{loc=$3; i++}
/<lastmod>/{lastmod=$3}
/<changefreq>/{changefreq=$3}
/<priority>/{priority=$3; color = (priority > 0.56) ? hot_pink : pink}
/<\/url>/{
printf purple "[%d] " nc cyan "URL: " nc white "%s\n" cyan "%s: " nc gray "%s\n" cyan "%s: " nc purple "%s\n" cyan "%s: " color "%s\n\n",
i, loc, "Last Modified", lastmod, "Change Frequency", changefreq, "Priority", priority}' $sitemap_file

# eof

Save the file above in smscrape.sh and chmod +x smscrape.sh it.

Most people dont realize how much info the sitemap can provide. Lets just use this jiffylube.com example. To execute the script above:

bash$ ./smscrape.sh sitemap.xml        # only takes 1 arg

Lets take a look at the output, it should be colored in Bash, if not try: TERM=”ansi”, then type reset:

<snip>

<snip> — take a look at the size of this site

The point of me showing you this, and recommending this approach to add to your recon tool box, is not just the URL (You should be looking for dynamic links like &id=23&test=123 — as you could SQLMap those), however in this case I find the Last Mod, Change Freq. and Priority interesting, and information that wont be found elsewhere.

Tip:

Modify the script to sort by Priority 1 (As far as i know it is 0–1 so .5 is valid) 1 being most high priority, what this means and with the change frequency, is you now know, what files to “keep an eye on” . Once you find a potential attack target that is modified often, start making nightly copies of the file (use a proxy). Over a week, diff <file1> <2> < 3 > etc.

If this doesnt jump out as a useful tool, then probably should read on Mozilla html5 basics. Hopefully this helps ya’ll

any feedback always appreciated.

Substance

https:/djsubstance.org

--

--

DJ SUBSTANCE

twenty years professionally as a Network Engineer, more recently I have focused on red teaming mostly, but I am always up for learning and exchanging info