BOOK THIS SPACE FOR AD
ARTICLE ADEver wonder how past web data could lead to huge bounties? We will learn how to turn a Wayback machine, into a gold mining machine! Stick around, to learn how to analyze historical data, to get some interesting leads! Our Focus here is to unlock the full potential of past data. We will explore key techniques for extracting valuable information. Lastly, we will apply our insights to identify and exploit those potential vulnerabilities.
Watch this video in case you are too lazy to read :)The winners in wayback data collection category are both gau and waymore. Using those tools has its own advantages and disadvantages. For instance, if you need speed, you can use gau, since it’s the fastest way to fetch a lot of URLs. On the other hand, if you’re looking for coverage, I suggest implementing the waymore into your workflow.
These two tools work pretty similarly, in the sense of gathering data from the same providers, but the main difference is how they build on different programming languages — the gau is built on golang and waymore is on Python, which is much more slower.
Another difference is that gau uses multiple threads and blacklisting options (to exclude a lot of extensions you don’t need), but waymore wins on coverage, since it has a feature to download past collected endpoints, where some extra URLs could be found or potential secret leaks. So select any of these tools to your preference. Personally, I will use them both.
So to start the manual inspection, first what we want to do is just get some root directories. I will use gau.txt file which contains many endpoints collected in this video. You can do this easily by using grep. It has a regex option that will help us to get only HTTP or HTTPS:
grep -oP '^https?://(?:[^/]*/){2}' gau.txtThe part of {2} in the regex expression, specifies how many in-depth you want to go. For example, using this will give me results like target.com/dir, but {3} will give me target.com/dir/subdir.
What I want to do next as well, is just sort them out to have unique items. I might want to do httpx, match status code 200, and add them to the file called root-dirs.txt:
grep -oP '^https?://(?:[^/]*/){2}' gau.txt | sort -u | httpx -mc 200 | tee root-dirs.txtThose results could be inspected manually later or by any other tool. For instance, you can run nuclei, just to discover some technologies that they running. Of course, you can not only run on technology checking but also could run to check for specific vulnerabilities.
The next thing you probably want to do with wayback data is to utilize gau output to gather the parameter names. You could use those parameter names for your wordlist when guessing other parameters while hacking manually by trying to fuzz some endpoints. For this purpose, I recommend unfurl CLI utility. It has a keys module that basically will grep out the keys from the URI parameters, which you should sort with unique results:
cat gau.txt | unfurl keys | sort -uSome of those results should be filtered as well. I will use grep with regex, with an invert option. For example, I don’t need an underscore, question mark, forward and backward slashes, so the final set of CLI commands will look like this:
cat gau.txt | unfurl keys | sort -u | grep -vE '_|/|\?|\\'For other results, I recommend removing it manually. You can use any text editor for that. I like using vscode. The fewer results, the less load we will get on the target server. At the end, you want to leave out only those parameter names that actually could resemble the parameters of the request.
Next, what we can do, is also just analyze that gau file a little bit more. This time we might want to use grep, just for filtering out the specified extensions. You might know that my favorite extensions are pdfs — PDF files:
cat gau.txt | grep -E '\.pdf' | sort -uAdditionally what you also want to do with wayback extensions, is run httpx with this command and later save it to the gau-pdfs.txt:
cat gau.txt | grep -E '\.pdf' | sort -u | httpx -mc 200 | tee gau-pdfs.txtYou will also want to do the same .doc, .docx, .xls, .xlsx files, etc. just to check for some sensitive information. Of course, checking the sensitive information you want to do this manually, but having alive links is pretty useful for later.
Other pretty useful extensions are .php, .asp, .jsp… Rarely you can see .py or .rb and .do or even .action. Those extensions could be also passed to another tool, for example — nuclei. Use nuclei fuzzer, to fuzz for SQL injections, XSS, maybe you want to use Jaeles (even though it’s pretty outdated nowadays). Just append the params, that you already have collected previously.
Another interesting way to use wayback data — getting login endpoints. I will also use grep with regex, just to check for “login”, “register”, “sign up”, “sign in”, “sign in”, “sign up”, “dashboard” checking if they are alive again and saving them to the file:
cat gau.txt | grep -Ei 'login|register|signup|signin|sign-up|sign-in|dashboard|admin' | httpx -mc 200 > auth-endpoints.txtEventually, you will have a lot of data to work on when testing manually. Those could be used for login bypasses, sometimes if you end up viewing some old endpoints you could try for SQL injections to log in. The extra mile would be registering an account and exploring the authenticated functionality. Sometimes there could be also bypasses for registering if you use that company’s email and there is no email validation.
If you find this information useful, please share this article on your social media, I will greatly appreciate it! I am active on Twitter, check out some content I post there daily! If you are interested in video content, check my YouTube. Also, if you want to reach me personally, you can visit my Discord server. Cheers!