The criticality of getting robots.txt correct and the pitfalls of not
According to our best friend ChatGPT —
According to the statistics from AI 1.5billion sites are up — but a FRACTION are actaually worth a damn or used. (this includes shady sites malware sites, all that those are considered active). Also to clear up (no pun) one fact that very few people are aware of. These definitions are important to know:
Darknet vs Clearnet — If i asked you this in person, what would you tell me the difference is?
This is *verified* look it up if you dont believe me:
Darknet: Encrypted networks requiring specific software for access, not indexed by search engines, often [NOT always] associated with anonymity and illicit activities.
Clearnet: The publicly accessible internet, indexed by search engines, without the need for special access tools.
A perfect example of something one would not think is darknet could be a company like Oracle’s backend servers. We dont have access to them, they arent on search indexers are they? They are “Dark” .. get it. this is my understanding. on to the topic at hand. All information is 100% accurate as of today:
Why robots.txt :robots.txt
file is a text file web-admins create to instruct webbots/indexers (typically search engine robots, but not limited to manytimes it is sites like uptime.com and spyse.com and other enumeration type crawlers), the file (IF it is respected — this is a choice) how to and what to crawl pages (crawl = index — make a record of the following per page:
- Site URL (full URI — so be careful you are not exposing session tokens)
- Site <title> — It should be <147px (not characters)
- Site <meta description> — It should be < around 500–700px — look it up
- other metadata like “rich snippets” or rating, flags for threats, etc.
Because this image is awesome for this article, i am borrowing it from ahrefs.com —
Incase you havent really noticed, the first 2–3 results on Google, Bing, etc are usually saying “sponsered” — they usually redirect you through some BS like ad.doubleclick.net and someone gets $ for your clicks. I typically dont find these even useful, so IMO avoid clicking promoted / advertised links.
Now that we know what the search index’ers (or robots) are doing on our site — one more thing if your wondering — will they find my website if i dont advertise, the answer is hell yes. Expect it. If you dont want to show up on any type of search listing, you need to do more then just disallow in robots, you should setup .htpasswd and make the site inaccessable or move http to a non-standard port..
Wrapping up the imporant take-aways:
We know that robots.txt:
1) is case sensitive — it must be “robots.txt” not “rObotS.txt”
2) must be world-readable (chmod 644 works good)
3) must be in the web root ( you should be able to hit site.com/robots.txt )
4) isnt mandatory (your website will work without it)
5) is the first place i go when i am pentesting a application/website
So just to summarize — according to the experts it “is part of the Robots Exclusion Protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users” (Cited wiki) :
The robots.txt
file is placed at the root of the website (e.g., https://example.com/robots.txt
) and suggests which areas of the site should not be processed or scanned by web crawlers.
- * Remember ** Respecting robots.txt is not mandatory it gets ignored
However, with proper permission and coding, you should be able to control whats indexed.
There are a bunch of free tools to check the accessibility online:
^ https:// seositecheckup.com /tools/robotstxt-test
I just checked https://medium.com/robots.txt
This is now how I do my robots.txt,I will get into it in a second, but looking at that, if you have no exp. with this file or its contents, this is exactly where hackers are going to start looking for resources they should not have access to? Right? Why would they not. With that in mind, sometimes its better to not announce a sensitive directory and do things like:
Start the directory you want not indexed with a . ie: .logs
Set the permissions to not world readable, etc
Breifly, we go thru the file:
User-Agent: * <- this should be at the end if you have multiple ua rules
Disallow: /m/ <- If respected - Google/Bing/etc will not index anything in m
Disallow: /me/ <- If respected - this folder will not be indexed
....
This does not mean that you cannot visit the pages within these folders, in fact you may find things you had no idea were there or had no business seeing. Lets look at a more complex example of a disallow . Suppose you have a dynamic index.php file that serves up a table (from a mySQL DB) — and it lists 100 files that are all <a href= links that will download when clicked. That would be straight forward in robots typically:
Allow: /index.php
However — this script has a download counter unique for each file, it calls itself with a GET request hxxp://example.com/index.php?id=25 — Will download file #25, increment counter in the DB and update the page .
This poses an issue for indexing, if you dont know why then, if you just Allow: index.php, (ive done it wrong), it will think that
hxxp://example.com/index.php?id=25 — Is a page
hxxp://example.com/index.php?id=24 — separate page
hxxp://example.com/index.php?id=23 — separate page and so on
This is not what you want, I recommend disallowing with wildcards and some operators the fact there are parameters on the url.
To prevent search engines from indexing URLs that contain query parameters (like the id
parameter in your example), while still allowing them to index a page without parameters (e.g., index.php
), you use Disallow
directive / with a wildcard character. However, it's important to note that the use of wildcards (*
) in robots.txt
is supported by some but not all search engines.
Major search engines like Google and Bing do support wildcards, which allows for more flexible and precise control over what is allowed or disallowed for crawling.
Here is how you might set up your robots.txt
to disallow crawling of index.php
with any query parameters, while still allowing index.php
itself to be crawled:
Take medium.com’s robots.txt —
Disallow: /search?q$
Disallow: /search?q=
Disallow: /*/search?q=
Disallow: /*/search/*?q=
There perfect examples
The Breakdown — as far as my experience in watching the way and the different “bots” (good and bad — i would say its 25% / 75% . as in 75% are bad/annoying) — my recomm. drop any inbound webtraffic with no User-Agent set. Unless you have a reason to allow it, if there is no useragent it is likely just a script or someone connecting manually. All browser will set some type of UA — all legit robots will identify themselves, but this is easily modified by the client so dont rely on it. It is simliar to the “referer” (it is spelled wrong, but that is how the w3 (w3 = world wide web consordium[sp?] if you are not famliar with this term or the organization you should get famliar . i refer to them as the w3 gods) .
What are these terms refered to as- Disallow? Allow?
Many aspects of web req/resp and the headers are called “Directives” and then values, very simliar to key/value pairs. In this case the primary directives for robots.txt are Allow and Disallow.
Briefly explaining how to not create many copies of your single page if it is dynamic — initally. I have done this wrong before, and it took 2 years to resync and get “right” with all the search engines (they seem to communicate to a degree) —
Many sites will put stupid ascii’s and disclaimers at the top of the robots file, i would not recommend this — its just extra bandwidth and bytes being send for no reason, everyone knows about robots.txt and it is no secret whats there. Just keep it small and I will briefly explain very soon how In my opinion we can minimize this file and keep it as secure and proper indexing as possible:
Complex dynamic pages and robots.txt:
There is no doubt you will have to experiement to get your site (if your running a CMS look up the recommendations but keep mine in mind — they dont particularly have security as the primary concern in Drupal an WP — they are trying to maxamize your hits and visibilty — which you dont always need. For instance do you really need the login page index’d ? Perhaps. If the site is the Dept of Defense, i would say not ;p
Option 1 — assuming your index.php (or aspx) is in the directory /search
(Keep in mind when you see a URL like this, the file is not search it is hitting, it is an index.(whatever language) inside search folder. (ie: https:// www.example com/search/index.php?q=123 <- This is what the actaul full URI would be — if you dont believe me, take facebook.com, for example:
Visit https://www.facebook.com/index.php — You will notice it attempts to clean up the url and you dont see the filename. so:
hxxps://www[dot]facebook[dot]com/?id=123
is the same as
hxxps://www[dot]facebook[dot]com/index.php?id=123
The Disallow Directive:
Cannot think of any situations where one of these examples would not suit your needs:
Disallow: /search?q$
#1 This directive aims to disallow URLs that exactly match/search?q
without any value assigned toq
. The$
symbol at the end of the directive signifies the end of the URL, meaning nothing comes afterq
. However, it's important to note that the$
end-of-line anchor is not universally supported by all web crawlers. If a crawler does support it, this rule would block/search?q
but not/search?q=something
.
Disallow: /search?q=
#2 This directive tells web robots not to crawl any URL that starts with/search?q=
followed by any value. It effectively blocks access to any search results page on the site where the search term is provided in theq
query parameter. Unlike the previous directive, this one does not end with a$
, so it applies to any URL that begins with/search?q=
, regardless of what follows the=
.
Disallow: /*/search?q=
#3 This directive is a bit more specific in its path but broader in its application than the previous ones. The initial*
acts as a wildcard for any directory name, meaning it matches any URL that contains/search?q=
preceded by any single directory or path segment. For example, it would match/articles/search?q=
but not/search?q=
directly under the root path. This rule is used to block search pages that are located within any subdirectory, but not at the root level.
Disallow: /*/search/*?q=
#4 This further customises and makes the path specific — while maintaining a broad match on the query parameter. This applies to any URL/URI with/search/
located within any path (or directly and no path prefixed) (as indicated by the initial*
) and followed by another subpath (the second*
), where the URL also includes a?q=
parameter. For example, it would block URLs like/indexer/search/results?q=
. This directive specifically targets search result pages that are nested within a subdirectory and then further within a 'search' directory, with additional path elements involved.
Conclusion — security should be your focus when your in charge of making this file (and the sitemap.xml which works hand in hand with robots.txt).
More or less, think of a robots.txt file that:
1) 30–40% of visitors will be automated potentially and not human
2) For the first month of your new site, setup a WAF rule to drop traffic with no user agent — but initally keep the rule in “report only” mode so it counts
After a month or so, evaluate your rules and if your seeing alot of traffic with no UA and can tell by its activitys that they are not helpful, block it.
This is not typically done (if ever) in robots.txt, use Cloudflare or a reverse proxy (or .htaccess) to handle this.
3) example of robots.txt I wrote recently:
User-Agent: *
Crawl-Delay: 2
Allow: /*.php
Allow: /gfx
Allow: /blog
Allow: /links
Allow: /msp
Disallow: /wp-admin
Disallow: /xmlrpc.php
# Catch all deny
Disallow: /
Sitemap: https://samplemsp.net/sitemap.xml
Many sites you will notice, handle Google, Bing, and other bots differently, I have never had a need for this. They way ive found which is best to evaluate this is to think of it like a firewall. When you first unwrap it and hook it up, all it has is
deny/ deny any /any — implicit deny
That is where I start with robots.txt Start by denying everything, then add allow’s for what you want to allow to be crawled. You can use www.technicalseo.com for an amazing set of free tools to help with these things.
One last thing, I see so many improperly crafted robots.txt, it prompted me to write this, lets look at one final example, here is a robots.txt on a unnamed site :
The way this would be interpreted would be do not crawl anything in the 2 paths that are very specific, however, everything else on the site is fair game. If you are wondering why I keep mentioning why its so important to get this right. Lets look at Tesla.com’s website for a minute. Tesla makes great cars from what i understand but they did not do this right.
I tried to find any robots.txt on tesla.com — none was found, in fact, I got 403 forbidden. N
important: make a point of seeing what your company or your website has exposed on the search engines, l
Remember — Google is not the only search engine
Always check Bing.com, Yandex.ru, Ecosia.com, brave search is decent too
antyone rememebr astalavista.box.sk ;p ?
peace
substance