Let’s Build a Web Scraper with Python & BeautifulSoup4
This post was originally published on my blog — https://thecodingpie.com
Ever wondered how to automate the process of scraping a website, collecting data, and exporting it to a useful format like CSV? If you are doing data science/machine learning then you may have been in this situation several times.
That’s why I wrote this tutorial, In this tutorial, you will learn all about Web Scraping by building a Python script that will scrape a movie website and fetches useful information, and finally, it will export the collected data to a CSV (Comma Separated Values) file.
And the good thing is that you don’t have to do web scraping manually by your hands anymore!
Sounds interesting? Then let’s jump right in.
You can download the finished code here from my Github repo — Web Scraper
What is Web Scraping
Web Scraping is the process of collecting useful/needed information from any website on the internet. Like any other process, there are two ways to do it: one is to manually copy-paste the needed data from the website. And the other way, the way of legends, is to smartly automate it!
I hope you want to be in the second category. But there are some challenges in doing so…
The first challenge is that not all website owners like the process of scraping their website. So if you are going to do web scraping on a website, then please make sure that they allow you to do so.
The second challenge is that not all websites are alike. I mean the script you wrote for one website can’t be used in other websites. Because the structures of both websites are entirely different. And maybe you even can’t use that same script on the same website after several days, because web developers change their website’s layout all the time to battle with the web scrapers.
Web Scraping Alternative
If there are so many challenges, is there any alternative? Yes, there is an alternative called API. Application Programming Interface is the only legal and stable way of getting data from any website.
Most of the websites provide an API through which you can get the data you wanted in a more sweet format like JSON or XML. But there’s a catch, you may have to pay money. Of course, there may be a free plan, but in the long run, you have to pay them in order to use their precious data.
That’s where the concept of web scraping comes in handy!
What We are Going to Build
We will learn all about Web Scraping using Python and BeautifulSoup4 by building a real-world project.
I don’t want to give you a headache by teaching you how to scrape an ever-changing dynamic website. So, I built a static movie website, named TopMovies, which contains a list of the top 25 IMDb movies. This is the website we are going to scrape. So before moving forward please examine it first — TopMovies.
See the TopMovies website has a list of the top 25 IMDb movies. Each movie holds the following details in it:
title
genre
rating
length
— movie runtimeyear
budget
gross
img
We are going to scrape those details from that TopMovies website. Then after getting all those details, we are going to export it to a useful format like CSV so that you can later import it into your data science project and can do some predictions without any worry!
In short, we are going to scrape the following website:
Export the scraped data into a CSV file like this:
And, later, if you want, you can read it as a Pandas DataFrame like below inside Jupyter notebook and you can do all your analysis and predictions easily!:
If you are not a Data Science/Machine learning person, then don’t worry about this last image, just forgot it!
By doing this simple project, you will learn the skills to build any sort of Web Scraper that is capable of scraping almost any website you wish to scrap. And also you will learn how to generate CSV files using Python.
How we are going to do that?
It’s very straight forward:
- First, we will fetch the web page we want using the
requests
library. - Then, we will turn that page into a
BeautifulSoup
object with the help of a suitable parser likelxml
. This will make the scraping process a lot easier. - Then we will scrape all the needed data from that soup object.
- Finally, we will export all the scraped data into a file called
top25.csv
with the help of thecsv
module.
That’s it!
Prerequisites
- You should be good at python3.
- You should have a decent understanding of HTML and a little bit of CSS.
- You should have python3.4 or a higher version installed on your computer. You can read this post to learn how to setup python3 on any operating system — https://realpython.com/installing-python/
- You should have venv installed.
- Finally, you will need a modern code editor like visual studio code. You can download visual studio code from here according to your Operating system — https://code.visualstudio.com/download.
With these things set up, now let’s get started.
Initial Setups
- First, create a folder named
web_scraper
anywhere on your computer. - Then open it inside visual studio code.
- Now let’s create a new virtual environment using venv and activate it. To do that:
- From within your text editor, Open Terminal > New Terminal.
- Then type:
python3 -m venv venv
This command will create a virtual environment named venv for us.
- To activate it, if you are on windows, type the following:
venv\Scripts\activate.bat
- If you are on Linux/Mac, then type this instead:
source venv/bin/activate
Now you should see something like this:
- Finally, create a new file named
scraper.py
directly inside theweb_scraper
folder:
Now you should have a file structure similar to this:
Note: If you are still confused about how to set up a virtual environment, then read this Quick Guide.
That’s it you have done your initial setups, now it’s time for the fun stuff!
Getting the WebPage
Let me ask you a question. What you will do first in order to scrape a website manually, I mean to copy-paste the data from a website?
First, you need to open up the web browser and type the URL, right? Because in order to get data from a web page, you must load it first. And that’s exactly what we are going to do here.
First, we need to load the web page from the website. But we are not going to use the web browser at all. Instead, we are going to use a Python module called requests
.
So, type the following command in the terminal and install the requests
module:
pip install requests
Then, in the scraper.py
file type:
import requests# fetch the web page
page = requests.get('https://the-coding-pie.github.io/top_movies/')
- This code will
get
theResponse
object in whole from the URLhttps://the-coding-pie.github.io/top_movies/
. But we want the web page itself or the web pagecontent
itself, right?
In order to get the web page content
, you have to access the content
from the page
variable like this:
If you print(page.content)
:
then you will see the HTML of the web page like this:
But there’s a problem there. If you take a look at the type(page.content)
:
Then you can see it is of the type bytes
:
We can’t parse those bytes type! Bytes types are useless unless you convert them into some other useful format/type.
What should we do now?
BeautifulSoup for the Rescue!
Beautiful Soup is a Python library for pulling data out of HTML and XML format like above.
BeautfulSoup
with the help of a parser
transforms a complex HTML document into a complex tree of Python objects.
Note: I don’t want to go in-depth about how the BeautifulSoup works in this tutorial. If you are curious to know that, then please use this link — Official Beautiful Soup Docs.
In short, with the help of BeautfulSoup
and a parser
, we can easily navigate, search, scrape, and modify the parsed HTML/XML content like above (bytes type) by treating everything in it as a Python Object!
So, let’s install the BeautifulSoup4
and a parser like lxml
. Type the following commands in the terminal window:
pip install beautifulsoup4pip install lxml
lxml
is the recommended parser by the BeautifulSoup community. There is also an alternative likehtml5lib
. But we are going to stick with thelxml
parser.
Now type the following code in the scraper.py
file at the very top:
from bs4 import BeautifulSoup
- Here we are importing
bs4
from theBeautifulSoup
library.
Then below this line — page = requests.get(‘https://the-coding-pie.github.io/top_movies/')
, type the following:
# turn page into a BeautifulSoup object
soup = BeautifulSoup(page.content, 'lxml')
- Here, we are converting our
page.content
which is of typebytes
to aBeautifulSoup
object.
Let’s Scrape the Page
Now we have that whole web page in our hands (in a useful format). One of the two jobs left is to scrape it. So let’s do that. We need to scrape the following things from the web page:
titles
— all the movie titlesgenres
— all the genresratings
— all the movie ratingslengths
— all the movie runtimesyears
— all the years the movie was releasedbudgets
— all the budgetsgrosses
— all the gross informationimg_urls
— src URLs of all the image.
So let’s do them one by one.
First, let’s scrape all the titles
:
The title
we are looking for is inside an HTML element called <h3>
. Wait, how do I know that?
It’s simple:
- Open up the URL you want to scrape for inside a browser.
In our case, open this TopMovies website. Then:
- Inspect the data with the help of Developer Tools on your browser. In my case, I am using Chrome, so
- right-click on the element you want to scrape,
- And click on Inspect
- Now a new box will pop up like this:
- And See, I told you that the
title
we are looking for is inside an HTML element called<h3>
:
Now we know where our data is sitting, let’s scrape it. Type this below the last line you typed:
""" first, scraping using find_all() method """
# scrape all the titles
titles = []
for h3 in soup.find_all('h3'):
titles.append(h3.string.strip())
Here’s the line by line explanation of the above code:
- We are going to store all our titles inside an array called
titles
and that’s what we are doing in the first line, we are creating thattitles
array. - Then we use the
find_all()
method in thesoup
object which we earlier created, to find all theh3
elements we need. Thisfind_all()
method returns an iterable list. So we loop through all those foundh3
elements and… - And in the last line, inside the
for loop
, we take theh3.string
value. Whystring
value? Because eachh3
in whole will be like this<h3> Title inside </h3>
. But we only need the innermoststring
inside that, right? So we useh3.string
. After taking it, we.strip()
it for removing all the trailing whitespaces. Then we.append()
it to thetitles
array.
Whoo, there’s a lot going on in there. So please take a moment to understand it. This is the exact step we are going to repeat from here on to scrape all the other data.
The soup
object we initially created using the HTML bytes
type data gives us so many built-in methods to easily navigate, and scrape the HTML tree. find_all()
is just one of them. We will explore a few of them as we go down the road.
And the reason why we are storing our scraped data inside a Python list is that it will be a lot easier to convert those lists into a CSV file and that’s why we are doing that.
Now we have scraped all the titles
we need. Now let’s move on to scraping all the genres
. Type the following code:
# genres
genres = []
for genre in soup.find_all('p', class_='genre'):
genres.append(genre.string.strip())
- It’s very similar but this time we are finding all the
<p>
elements with theclass_=’genre’
.class
is a reserved keyword in python, so we can’t use it and that’s why we are giving the underscore(_
) after theclass
.
The rest is self-explanatory.
Now let’s scrape all the ratings
, but using a different method called select()
:
""" scraping using css_selector eg: select('span.class_name') """
# ratings, selecting all span with class="rating"
ratings = []
for rating in soup.select('span.rating'):
ratings.append(rating.string.strip())
- the
select()
method is used to find all the elements using CSS selector like syntax. Here we are selecting all thespan
with the class rating like this —span.rating
. Then we store them in a Python list namedratings
.
Now it’s time for a small exercise. Using the select()
method you have to scrape all the lengths
(movie runtimes), and years
(the year the movie released). I can give you two hints:
- Each movie
length
is inside aspan
with the classlength
(span.length
). - And each
year
is inside aspan
with the classyear
(span.year
).
The code will be very similar to the above piece of code. You just have to change the corresponding parts.
If you did it, then congratulations! Make sure to cross-check your code with the solution below. If you were unable to do that, then no worry, just type the following solution.
The solution:
# lengths, selecting all span with class="length"
lengths = []
for length in soup.select('span.length'):
lengths.append(length.string.strip())# years, selecting all span with class="year"
years = []
for year in soup.select('span.year'):
years.append(year.string.strip())
- I think no explanation is needed here.
The remaining things to scrape are the budgets
, grosses
, and img_urls
. Here we are going to use the good old find_all()
method to do that:
""" scraping by navigating through elements eg: div.span.string """
# budget
budgets = []
for budget in soup.find_all('div', class_='budget'):
# from <div class="budget"></div>, get the span.string
budgets.append(budget.span.string.strip())# gross
grosses = []
for gross in soup.find_all('div', class_='gross'):
grosses.append(gross.span.string.strip())
""" parsing all the "src" attribute's value of <img /> tag """
img_urls = []
for img in soup.find_all('img', class_='poster'):
img_urls.append(img.get('src').strip())
- The one thing to note here is that, in the last few lines, we try to get the img’s
src
attributes. Because there’s where the img’s URL is located. To access any of theattributes
of an element, we can use the.get()
method after finding that particular element.Beautiful Soup
stores all the element’sattributes
as a Python dictionary at the time of converting bytes type toBeautifulSoup
type. And that’s why we are using the.get()
method to access the values in the dictionary.
And that’s it, we have successfully scraped all the needed data.
Now let’s export those data to a CSV file.
Creating a CSV file
In order to generate CSV files using Python, we need a module named csv
. It’s a built-in module, so you don’t have to install it. You just have to import it at the very top of the scraper.py
file.
So type this at the very top:
import csv
Now at the very bottom of the file, type the following code:
""" writing data to CSV """# open top25.csv file in "write" mode
with open('top25.csv', 'w') as file:
# create a "writer" object
writer = csv.writer(file, delimiter=',') # use "writer" obj to write
# you should give a "list"
writer.writerow(["title", "genre", "ratings", "length", "year", "budget", "gross", "img_url"]) for i in range(25):
writer.writerow([
titles[i],
genres[i],
ratings[i],
lengths[i],
years[i],
budgets[i],
grosses[i],
img_urls[i]
])
- First, we open the file in
‘w’
mode.‘w’
for write mode. If there’s no file with the given filename exists then it will create one. And if such a file exists, then it will overwrite that file. Here we are opening/creating a new file namedtop25.csv
. - Then we create a
csv.writer()
object by giving thefile
and the comma‘,’
as thedelimiter
character. - Then using that
writer
object, wewrite.row()
. The first row we wrote is for captions, you can think of them as the table headings. - Then finally, we loop 25 times and in each iteration, we write one row which will be a single movie. Each row is all about a single movie.
That’s it, let’s try to run our script. I hope your terminal (inside your code editor) is already opened and your venv
is active. Now type this:
python scraper.py
If everything went smooth, then you should have a new file created namely top25.csv
in the same directory and it will contain data like this:
If you got any errors, then please make sure that the code you typed up to this point inside the scraper.py
file is exactly like the final code below…
Final Code
import requests
from bs4 import BeautifulSoup
import csv# fetch the web page
page = requests.get('https://the-coding-pie.github.io/top_movies/')# turn page into a BeautifulSoup object
soup = BeautifulSoup(page.content, 'lxml')
""" first, scraping using find_all() method """
# scrape all the titles
titles = []
for h3 in soup.find_all('h3'):
titles.append(h3.string.strip())# genres
genres = []
for genre in soup.find_all('p', class_='genre'):
genres.append(genre.string.strip())
""" scraping using css_selector eg: select('span.class_name') """
# ratings, selecting all span with class="rating"
ratings = []
for rating in soup.select('span.rating'):
ratings.append(rating.string.strip())# lengths, selecting all span with class="length"
lengths = []
for length in soup.select('span.length'):
lengths.append(length.string.strip())# years, selecting all span with class="year"
years = []
for year in soup.select('span.year'):
years.append(year.string.strip())
""" scraping by navigating through elements eg: div.span.string """
# budget
budgets = []
for budget in soup.find_all('div', class_='budget'):
# from <div class="budget"></div>, get the span.string
budgets.append(budget.span.string.strip())# gross
grosses = []
for gross in soup.find_all('div', class_='gross'):
grosses.append(gross.span.string.strip())
""" parsing all the "src" attribute's value of <img /> tag """
img_urls = []
for img in soup.find_all('img', class_='poster'):
img_urls.append(img.get('src').strip())
""" writing data to CSV """# open top25.csv file in "write" mode
with open('top25.csv', 'w') as file:
# create a "writer" object
writer = csv.writer(file, delimiter=',') # use "writer" obj to write
# you should give a "list"
writer.writerow(["title", "genre", "ratings", "length", "year", "budget", "gross", "img_url"]) for i in range(25):
writer.writerow([
titles[i],
genres[i],
ratings[i],
lengths[i],
years[i],
budgets[i],
grosses[i],
img_urls[i]
])
Wrapping Up
I hope you enjoyed this tutorial. In some places, I intentionally skipped the explanation part. Because those codes were simple and self-explanatory. That’s why I left it to you to decode it on your own.
True learning takes place when you try things on your own. By simply following a tutorial won’t make you a better programmer. You have to use your own brain.
If you still have any error, first try to decode it on your own by googling it.
If you didn’t find any solutions, then only comment on them below. Because you should know how to find and resolve a bug on your own and that’s a skill that every programmer should have!
And that’s it, Thank you ;)