Building A Robust Python Web Scraper

2 min readDec 22, 2020

As the year 2020 draws gradually to a close, I will like to share my findings working on a Python Web Scraper project.

So for those new to web scraping your question will be what is web scrapping?
Searching around for definitions online, these are some definitions I found:
- https://en.wikipedia.org/wiki/Web_scraping
- https://www.imperva.com/learn/application-security/web-scraping-attack/

Putting it together I will say, Web scraping is an automated way of extracting or harvesting for content and data from websites.
This programs are sometimes referred to as bots.

My version of the Web Scrapping program was code named Sourcedi, and the tech stack used was python and Redis as a caching layer.
And I chose python because it offers a lot of libraries which makes it easy to build web scrapers with less code.
The main libraries I used was Selenium and Beautifulsoup

Embarking on this project, it thought me some few things:
1. Beautifulsoup won’t always work you might need to use Selenium.
The reason is some pages load their content using JavaScript, so JavaScript has to be enabled to render the page.
Beautifulsoup does not offer that Selenium does.

2. Selenium can be run on a command line
So I wanted to run the scraper using a Linux terminal on an EC2 instance. After reading and researching on selenium, I discovered that
it provides an option to configure it to run in headless mode. Also, running in headless mode improves speed and performance.

3. Beautifulsoup did better extracting well formatted text
In addition, I wanted a way to extract text from the pages making sure I kept the output well formatted.
I couldn’t do that with Selenium so I had to parse the extracted HTML using Beautifulsoup which was able to return a well structured text output.

In conclusion, these are my findings embarking on this project. I am open to corrections and comments. Thanks.

References

https://en.wikipedia.org/wiki/Web_scraping,
https://www.imperva.com/learn/application-security/web-scraping-attack/

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Kwabena Asante

3 Followers

1 Following

Software Engineer — email: asantekwabena2013@gmail.com — https://kwabena.dev

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from Kwabena Asante

Facts No One Will Ever Tell You About Search Engines.

Kwabena Asante

Facts No One Will Ever Tell You About Search Engines.

Technology has become a core part of our day-to-day existence; with the likes of Microsoft, Facebook, Apple, Amazon, Netflix and Google…

Jan 1, 2021

Kwabena Asante

2020 Software Engineers’ nuggets

Before the year ends I will want to write on something beyond the technologies, frameworks, programming languages, libraries and the likes…

Dec 25, 2020

Kwabena Asante

In-Memory CSV Files For Automated Tests

In my line of work, we do build a lot of REST APIs for uploading and manipulating CSV files. Our core stack is Python with Django Web…

Dec 11, 2020

See all from Kwabena Asante

Recommended from Medium

FastAPI: The Ultimate Guide to Building Fast and Efficient APIs

DevOps.dev

Saurabh Pathak

FastAPI: The Ultimate Guide to Building Fast and Efficient APIs

Hey there, fellow developers! Today, we’re diving into FastAPI, a modern Python framework that’s taking the web development world by storm…

Nov 26, 2024

This new IDE from Google is an absolute game changer

Coding Beauty

Tari Ibaba

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Mar 11

197

5 AI Projects You Can Build This Weekend (with Python)

TDS Archive

Shaw Talebi

5 AI Projects You Can Build This Weekend (with Python)

From beginner-friendly to advanced

Oct 9, 2024

Article: Streamlit Part 3: Form Validation Part 1

Rick Hightower

Article: Streamlit Part 3: Form Validation Part 1

Form Validation Part 1

Nov 12, 2024

How To Use LLMs To Turn English Instructions Into Executable SQL

Level Up Coding

Ahmed Besbes

How To Use LLMs To Turn English Instructions Into Executable SQL

An Overview of the Vanna Python Package

Oct 3, 2023

Part 7: Real-World Projects — Harnessing SQL for Business Intelligence and Data-Driven Decisions

Python in Plain English

Muhammad Muhsi Sidik

Part 7: Real-World Projects — Harnessing SQL for Business Intelligence and Data-Driven Decisions

In the modern business landscape, data has become one of the most valuable assets for organizations, driving everything from marketing…

Oct 25, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech