![]() |
Web scraper Illustration |
Good afternoon good people. I hope you're in good shape and healthy. Today I want to share you a modified code of Scraping My Own Blog!, I used to use BS4 as the core to scrape the web but now I'm experimentally using requests_html as the core to scrape the web, the code is not much difference but I add some more adaptation in the code.
Design
The purpose of building this tool is to scrape the post date and time, post title, and post article snippet. I’ve organized the code into a single file containing a storage class, a data scraping class, and an interface class. I enjoy using an object-oriented programming (OOP) approach because it makes the code more structured and organized.
![]() | |
Storage class |
![]() |
Data sraper class |
![]() |
Interface class |
Since this is a simple scraper, I’m not using the MVC modular pattern, as this saves time in building the script.
This adaptation also adds a new feature: the program now supports JavaScript. If a website or blog requires JavaScript to render content, the scraper will wait for the JavaScript to load, ensuring it can fetch the data successfully.
In this new version of the scraper, I’m using a JSON configuration for the XPath code. This way, when I need to adapt the XPath for another site or blog, I can easily do so by updating the JSON configuration.
![]() |
Reading XPath from json file |
![]() |
XPath code in json file |
Testing The Program
For easier review, I’ve prepared a video demonstration to walk you through the process:
End
Thank you very much for reviewing and visiting my blog. I hope to see you again in the next article!