Building A Robust Python Web Scraper

Kwabena Asante
2 min readDec 22, 2020

--

Robust Web Scrappers

As the year 2020 draws gradually to a close, I will like to share my findings working on a Python Web Scraper project.

So for those new to web scraping your question will be what is web scrapping?
Searching around for definitions online, these are some definitions I found:
- https://en.wikipedia.org/wiki/Web_scraping
-
https://www.imperva.com/learn/application-security/web-scraping-attack/

Putting it together I will say, Web scraping is an automated way of extracting or harvesting for content and data from websites.
This programs are sometimes referred to as bots.

My version of the Web Scrapping program was code named Sourcedi, and the tech stack used was python and Redis as a caching layer.
And I chose python because it offers a lot of libraries which makes it easy to build web scrapers with less code.
The main libraries I used was Selenium and Beautifulsoup

Embarking on this project, it thought me some few things:
1. Beautifulsoup won’t always work you might need to use Selenium.
The reason is some pages load their content using JavaScript, so JavaScript has to be enabled to render the page.
Beautifulsoup does not offer that Selenium does.

2. Selenium can be run on a command line
So I wanted to run the scraper using a Linux terminal on an EC2 instance. After reading and researching on selenium, I discovered that
it provides an option to configure it to run in headless mode. Also, running in headless mode improves speed and performance.

3. Beautifulsoup did better extracting well formatted text
In addition, I wanted a way to extract text from the pages making sure I kept the output well formatted.
I couldn’t do that with Selenium so I had to parse the extracted HTML using Beautifulsoup which was able to return a well structured text output.

In conclusion, these are my findings embarking on this project. I am open to corrections and comments. Thanks.

References

https://en.wikipedia.org/wiki/Web_scraping,
https://www.imperva.com/learn/application-security/web-scraping-attack/

--

--