Monday, October 28, 2024

Programming Scripting Automation: Web Scraper For Image Downloader (Simple ImageScraper V.1.0.0)

Web scraper illustration

Hello everyone! Welcome to my blog. I hope you are all in good health. Today, I’m excited to introduce a new tool: an image downloader that offers flexible XPath configuration. This means you can easily adapt the program to work with different web pages.

While it’s a simple program, it has been incredibly helpful for me when I need to save a large number of images from the web. The automation is highly efficient, as it saves me the time of manually opening and saving each image. I can focus on other tasks while the program continues to crawl and download images.

Let me show you how it works—let's get started!

 

Program Design

Program design

For this simple program, I chose not to use the MVC design pattern. Instead, I kept it as a single file without any external modules to maintain simplicity and save time on the design process.

Here, you can see the storage class, which contains the default configuration and several methods.

Storage class


JSON Standard For XPath Values

The initial idea behind writing the code was simplicity. However, I realized that once I convert the program to binary, it will be impossible to change the XPath for different web pages.

XPath values stored in the xpath_config.json file

Using JSON allows for greater flexibility, making it easy to modify the XPath values to suit the HTML structure and elements of various web pages.
 
 

Using The requests_html Module As The Page Scraper

I can achieve the same result using Beautiful Soup (bs4) along with the etree class, but requests_html is simpler and works directly without the need for etree.

Implementation of requests_html in the Code


Progam Interface

As I mentioned earlier, I prefer a text-based interface because it is highly efficient. It allows programmers to focus solely on the code and the functionality of the program without getting distracted, ultimately saving time and effort.

Program interface


Debugging Time

I believe it took me just half a day to create this program, including research and related tasks. It didn't require much time because the program doesn't need validations, an MVC structure, or complex logic.

For example I want to download all images from this post: https://myportfolioreview13.blogspot.com/2024/10/linux-os-hardening-your-ssh-server.html,

Download process

This is the process for automatically downloading images one by one. The results will be saved in the default location of the program.

The images are all downloaded


The Program Is Converted To Binary For Easy Distribution

By converting it to binary, I can use it on another VM or device without needing to install any additional modules, as all the necessary modules are already bundled with the binary program.

Program in binary format

You can watch the complete demo in the video below:


End...

That's all for now. I'm grateful for this opportunity to share with you. I hope to see you again soon. Good night!