Create a Python Web Spider in 60 Seconds or Less

1 year ago 74
BOOK THIS SPACE FOR AD
ARTICLE AD

Create a Python Web Spider in 60 Seconds or Less

image

Photo by Christopher Gower on Unsplash A Web spider is an automated approach to identifying links and other resources hosted on a target website. In today’s world, this can be used in anything from data mining to mapping an attack surface during offensive security assessments. This post demonstrates how to create a Python web spider and add further customization to fit your own programming requirements.

Web Spider vs Web Scraper

Before diving in, let’s define the difference between a web spider and web scraper. These are two common terms easily confused online: A Spider, also referred to as crawler, is a bot-like program that systematically indexes pages on a site. This can be thought of as an inventory tool taking a record of all available resources. It is often used by search engines to catalog results, such as the Googlebot. A Scraper on the other hand, may utilize similar techniques but will actually process the data on the resulting pages. It will search for specific information to extract and use in various ways. The data found by scrapers can be leveraged to: populate databases, monitor activity, collect sensitive information, and more.

Prerequisites

To create our spider, you will need Python 3.6+ and a library called Taser. This will be used to lay the foundation of our spider and can be installed directly from PyPi with the following command: pip3 install taserpip3 install taser Taser is an acronym that stands for “Testing and Security Resource”. It is a purpose-built module containing various classes and functions for security related tooling. I created the library after spending far too much time writing similar code across multiple projects. Although it started with only HTTP frameworks in mind, it has since adapted to support multiple protocols and is applicable outside the security space.

Creating the Spider

Taser’s HTTP protocol has a built-in Spider class that can be invoked directly from the Python interpreter. No coding necessary. Simply drop into a Python shell, import the Spider class, initialize it with your target site, and you’re done. Within seconds you have a categorized list of URL’s!

image

By default, the spider will employ techniques such as user-agent randomization to help prevent CAPTCHA requests. This can be further enhanced with built-in rotating proxy support, custom headers, and other variables. These settings can be modified during initialization or through changing the class variables directly.

image

Customizing the Spider

This is great, but how can you make use of this in your own programming? Using inheritance users can modify almost every aspect of the spider’s functionality to better fit their use-case. In this section, I will show you how to adjust the print function to work with Windows operating systems. The spider class’s outputHandler method is responsible for all output and uses Taser’s printx function to display results in color. Unfortunately, this feature is only for Unix users through the use of ANSI escape characters. When run on a Windows host this method generates illegible results.

image

Through inheritance, however, we can change the class method to use Taser’s custom logging adapter. The benefit of this approach is automatic OS detection. Unix users can continue to see colored results while still supporting Windows output. These changes are demonstrated in the custom MySpider class below:

As you can see, when running our new spider script on both operating systems we get clear, legible results:

image

Conclusion

While the use of spiders and scrapers is not illegal, they often violate a site’s terms of service. It is recommended to only use the information provided against systems you own or have explicit permission to test. Thanks for reading! Find out more about me at m8sec.dev and follow for more articles on Python. Disclaimer: All content is provided for educational purposes only. Author is not responsible for use of information. Never test against systems you don’t own or have explicit permission.

Read Entire Article