So you need some data eh? Often times it’s extremely difficult to find open source data sets with exactly what you’re looking for or public APIs. In these situations, the go-to is we scraping. In this blog, let’s dive into what web scraping is and the step by steps!
What is it?
Web scraping allows you to gather data from the website of your choice! However each website has different HTML structures so often times web scrapers are built to explore one specific website. It’s important to learn the following things about the website of your choice:
- Structure of the web pages with relevant data
- How to access these web pages
Here, the first heading is within “h1” and the first paragraph is within “p”. For an actual website we need to find out which tags contain the information we’re most interested in tell our scraper to scrape this specific information.
The best tools to help you scrape are the Python modules request and BeautifulSoup which will both be used to parse the HTML website.
Scraping Data from Bookstore
We’re interested in scraping data from the following book store — http://books.toscrape.com/. Let’s first look at the main page of the website:
Ewww! Let’s make this cleaner:
Finding Book URLs
Let’s now try and grab the individual URL for each specific book — we need this because that’s the only way we’ll be able to access the details for each book.
If you go to the website, and right click on “inspect”, you’ll be able to see the HTML tags and elements that correspond to that respective part of the webpage.
If we’re looking to access the above book link, we need to dive into the “product_pod” “class” using some BeautifulSoup tags:
Woohoo! We got the first URL! Now let’s try to gather all of the product URLs on the main webpage. We can do this by using the BeautifulSoup “find()” function:
Let’s add the base url to the above link in a clean function and voila, we’re ready to collect all of the book links:
Using this function, we can now collect all of the book URLs and then dive deeper into each URL to collect the specific data for each book.
- You won’t always have all of the data that you need.
- BeautifulSoup and Request are easy to use tools to scrape data from any website so you can collect all the data you want.