White Hat Institute

“Defend the Web” write-up (Intro 7)

“Defend the Web” write-up (Intro 7— robots.txt data leak vulnerability)

The robots.txt record is used to inform web crawlers and other well-intentioned robots about a website’s layout. It is freely available and can be comprehended by humans fast and simply. The robots.txt file tells crawlers where to look for XML sitemap files, how fast the site can be crawled, and, most importantly, which pages and folders do not crawl.

Before crawling a webpage, a good robot looks for a robots.txt file and, if one exists, usually follows the commands contained within. Bad robots, on the other hand, may choose to disregard it or worse. In fact, some malicious robots and penetration test robots intentionally hunt for robots.txt files in order to access the restricted portions of websites. 

The forbidden list in the robots.txt file can be used as a map by a malevolent actor — human or robot — trying to find private or confidential material on a website. It’s the most obvious place to start looking. As a result, if a site administrator believes they are utilizing the robots.txt file to secure their content and keep pages hidden, they are almost certainly doing the opposite.

Let’s take a look at the challenge and see how we can exploit this vulnerability.

Type “robots.txt” to the end of the root domain (https://defendtheweb.net/robots.txt) and hit “Enter.”

Defend the Web - Intro 7-1

Robot.txt file might contain some good information like passwords, usernames, and accounts.

Defend the Web - Intro 7-2

Let’s look at the “jf94jhg03.txt” file and see what’s in there. Copy the path and paste it to the end of the root domain (https://defendtheweb.net/extras/playground/jf94jhg03.txt), then hit “Enter.”

Defend the Web - Intro 7-3

As you can see, we managed to retrieve the user account information. Copy credentials and use them on the login page to pass the challenge.

The robots exclusion standard will neither assist in the removal of a URL from a search engine’s index nor will it prevent a search engine from indexing a URL. Even if search engines have been told not to crawl a URL, they usually add it to their index. Crawling and indexing URLs are two separate tasks, and the robots.txt file has no effect on URL indexing.

For pages that need to be private but publically accessible, use Noindex rather than Disallow. This ensures that if a good crawler comes across a URL that shouldn’t be indexed, it won’t be indexed. It is acceptable for a crawler to visit the URL but not to index the content for material with this specified level of security. For pages that should be private and not publicly accessible, password protection or IP whitelisting are the best solutions.

Consider using your robots.txt file to set up a honeypot if you want to enhance your security to another level. In robots.txt, include a disallow directive that sounds enticing to malicious actors, such as “Disallow: /secure/logins.html.” Then, on the disallowed resource, enable IP logging. Any IP addresses that try to load “logins.html” should be blocked from accessing any other part of your website in the future.