So you need some data eh? Often times it’s extremely difficult to find open source data sets with exactly what you’re looking for or public APIs. In these situations, the go-to is we scraping. In this blog, let’s dive into what web scraping is and the step by steps!

What is it?

Web scraping allows you to gather data from the website of your choice! However each website has different HTML structures so often times web scrapers are built to explore one specific website. It’s important to learn the following things about the website of your choice:

  • Structure of the web pages with relevant data

Website Structures

Websites are created using HTML (Hypertext Markup Language), along with CSS (Cascading Style Sheets) and JavaScript. HTML pieces are separated by tags and look like this:

Here, the first heading is within “h1” and the first paragraph is within “p”. For an actual website we need to find out which tags contain the information we’re most interested in tell our scraper to scrape this specific information.

Website Tools

The best tools to help you scrape are the Python modules request and BeautifulSoup which will both be used to parse the HTML website.

Scraping Data from Bookstore

We’re interested in scraping data from the following book store — http://books.toscrape.com/. Let’s first look at the main page of the website:

Ewww! Let’s make this cleaner:

Finding Book URLs

Let’s now try and grab the individual URL for each specific book — we need this because that’s the only way we’ll be able to access the details for each book.

If you go to the website, and right click on “inspect”, you’ll be able to see the HTML tags and elements that correspond to that respective part of the webpage.

If we’re looking to access the above book link, we need to dive into the “product_pod” “class” using some BeautifulSoup tags:

Woohoo! We got the first URL! Now let’s try to gather all of the product URLs on the main webpage. We can do this by using the BeautifulSoup “find()” function:

Let’s add the base url to the above link in a clean function and voila, we’re ready to collect all of the book links:

Using this function, we can now collect all of the book URLs and then dive deeper into each URL to collect the specific data for each book.

Conclusive remarks:

  • You won’t always have all of the data that you need.

Data Enthusiast with a background in Engineering.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store