Wednesday, November 6, 2024

Programming Scripting Automation: Scraping My Own Blog II

Web scraper Illustration

    Good afternoon good people. I hope you're in good shape and healthy. Today I want to share you a modified code of Scraping My Own Blog!, I used to use BS4 as the core to scrape the web but now I'm experimentally using requests_html as the core to scrape the web, the code is not much difference but I add some more adaptation in the code.


Design

The purpose of building this tool is to scrape the post date and time, post title, and post article snippet. I’ve organized the code into a single file containing a storage class, a data scraping class, and an interface class. I enjoy using an object-oriented programming (OOP) approach because it makes the code more structured and organized.

Storage class

Data sraper class

Interface class

Since this is a simple scraper, I’m not using the MVC modular pattern, as this saves time in building the script.

This adaptation also adds a new feature: the program now supports JavaScript. If a website or blog requires JavaScript to render content, the scraper will wait for the JavaScript to load, ensuring it can fetch the data successfully.

In this new version of the scraper, I’m using a JSON configuration for the XPath code. This way, when I need to adapt the XPath for another site or blog, I can easily do so by updating the JSON configuration. 

Reading XPath from json file

XPath code in json file


Testing The Program

For easier review, I’ve prepared a video demonstration to walk you through the process:

Python Data Scraper II


End

Thank you very much for reviewing and visiting my blog. I hope to see you again in the next article!