Crawling for Emails In Websites  — OSINT Methodology

2 weeks ago 66
BOOK THIS SPACE FOR AD
ARTICLE AD

Jason Jacobs, MSc.

Original Content — Ideogram Generated Image

I’ve taken interest in the Open-Source Intelligence (OSINT) research space with my first article geared towards sources for “Tracking Malware and Ransomware Activity” and I have to say, I owe some further effort to share my expertise today.

Scope

I’ve chosen my own local business website as a test target for this exercise so if you take a browse at the demonstrated target, all information and description outlined are directly connected to the author of this article.

Justification

Why, you might ask? In the stretch of my own understanding, I can offer a few reasons as to how we can justify this research which are as outlined:

Business ResearchMarketing ResearchMapping contacts for an organizationFinding reasonable connections for agreements

🚨 DISCLAIMER: The activities involved in the depth of your research are something for which you accept liability and responsibility, so be sure to check the terms of service of the organizations and websites, as well as the legal requirements for data handling in the countries where you conduct such research.

Active Footprinting

This method is primarily chosen to perform this research since I want to directly interact with the website in scope by crawling the webpages within the site and use pattern matching to extract the data based on our requirements.

If your local research environment already has Go installed, go ahead and also install Katana and Nuclei to follow through:

go install github.com/projectdiscovery/katana/cmd/katana@latest
go install -v github.com/projectdiscovery/nuclei/v3/cmd/nuclei@latest

Katana is used for crawling the website in scope to reveal all internal URL paths while Nuclei is used with a pattern matching template for revealing emails.

Using the email-extractor.yaml template for Nuclei, I started some initial exercise with some limitations so some additional methods are demonstrated after this attempt.

Saving the template to my local Linux environment, I proceeded with the following command:

wget https://raw.githubusercontent.com/projectdiscovery/nuclei-templates/refs/heads/main/http/miscellaneous/email-extractor.yaml

Crawling and Harvesting (Katana + Nuclei)

katana -u site.com | nuclei -t email-extractor.yaml

The template identified two (2) emails from scraping my site but they were not emails belonging to me so I redacted them from the attached image. The emails seemed to belong to the developer for a plugin used within our site.

Redacted emails from our site.

Maybe the anti-bot mechanisms is preventing the discovery of the other email attached to the site.

Harvesting Emails with JavaScript

The anti-bot mechanism can be circumvented to some extent by using in-browser JavaScript from the browser’s developer tools.

First Method — Harvesting from the Current Page

Using the same pattern matching mechanism from the email-extractor.yaml Nuclei template, I transferred this to JavaScript code while adding recognition for .dev emails as well.

Copying this to the browser’s developer tools while having the target site opened on the current tab reveals some results.

Results from scraping the current page with Javascript

The only problem with this method is that it only scrapes the current webpage. This ignores the fact that emails may be present on other paths of the website in scope.

Second Method — Harvesting from the Crawled Website

This method is built to crawl the URL paths of the website in scope then harvest emails, thereby, capturing emails that may have been not accounted for in the first method.

Full site crawl + Email Harvest

The code response first reveals a pending Promise which is fulfilled and returns the emails discovered based on the pattern matching specified.

Remember to beware of any Data-Privacy or Data Handling laws and/or compliance requirements that may apply to the activities of performing such research to ensure the safety of yourself and the contacts involved.

Happy Researching!

Read Entire Article