Web Scraping With Beautiful Soup : An Introduction

One of the most popular library for web scraping with Python is Beautiful Soup.

Beautiful Soup is a Python library for extracting data from HTML and XML documents. It converts the file into a parsed tree structure that can be further used to traverse, search and identify elements to extract and scrape data from.

A few distinguishing features of Beautiful Soup compared to other Python libraries are :

  • Beautiful Soup can parse documents with broken, incomplete, misspelled, or missing tags.
  • Specific selected sections of the content can also be parsed, resulting in saving memory and time.
  • This library is able to handle duplicate and multi-valued attributes.
  • Document-based encoding is handled automatically by Beautiful Soup. Encoding details can also be provided to the Beautiful Soup constructor.

You can learn more about Beautiful Soup through the documentation here.

In this tutorial, we will be scraping a sandbox website containing a list of countries, scrapethissite.com, and exporting it as CSV.

Getting Started

Open up your favorite text editor.

Import all the necessary libraries. You can install what you don’t have yet with pip install [library]

from bs4 import BeautifulSoup as BSoup
import requests
import csv

Let’s define the important variables.

url = "https://www.scrapethissite.com/pages/simple" # the target URL to be scraped
columns = ['name', 'capital', 'population', 'area'] # the exported table headers
dataSet = [] # where we store the scraped data and will be exported as CSV
response = requests.get(url) # fetching the HTML content from URL

Scraping The Data

Loading the content into a Beautiful Soup Object enables us to use its methods such as .text, .find(), .find_all() which really helps us to extract content from the web. So, that’s exactly what we’ll do.

During this process, we need to choose a parser or a standard HTML format for the object to use.

In this case, we will be using 'html5lib', a HTML parser provided by the html5lib library.

The html5lib parser is known for its lenient parsing behavior and its ability to handle poorly formed HTML documents, making it a popular choice for parsing HTML in web scraping applications.

soup = BSoup(response.content, 'html5lib')

Use inspect element to check where the data we want to scrape are located.

Inspect The Countries

Now, we will be using .find_all() to get all the country blocks (where the data are located) marked by col-md-4 and country classes.

countries = soup.find_all(attrs={'class' : 'col-md-4 country'})

Loop through each country block to extract their name, capital, population and area.

Don’t forget to append them all into the dataSet.

for country in countries :
    name = country.find('h3').text.strip()
    capital = country.find(attrs={'class' : 'country-capital'}).text.strip()
    population = country.find(attrs={'class' : 'country-population'}).text.strip()
    area = country.find(attrs={'class' : 'country-area'}).text.strip()

    dataSet.append([name, capital, population, area])

Exporting To CSV

Finally, we will export the scraped dataSet to CSV.

def writeto_csv(data, filename, columns):
    with open(filename, 'w+', newline='', encoding="UTF-8") as file:
        writer = csv.DictWriter(file, fieldnames=columns)
        writer.writeheader()
        writer = csv.writer(file)
        for element in data:
            writer.writerows([element])

writeto_csv(dataSet, 'countries.csv', columns)

Full Code

from bs4 import BeautifulSoup as BSoup
import requests
import csv

url = "https://www.scrapethissite.com/pages/simple"
columns = ['name', 'capital', 'population', 'area']
dataSet = []
response = requests.get(url)

soup = BSoup(response.content, 'html5lib')

countries = soup.find_all(attrs={'class' : 'col-md-4 country'})

for country in countries :
    name = country.find('h3').text.strip()
    capital = country.find(attrs={'class' : 'country-capital'}).text.strip()
    population = country.find(attrs={'class' : 'country-population'}).text.strip()
    area = country.find(attrs={'class' : 'country-area'}).text.strip()

    dataSet.append([name, capital, population, area])

def writeto_csv(data, filename, columns):
    with open(filename, 'w+', newline='', encoding="UTF-8") as file:
        writer = csv.DictWriter(file, fieldnames=columns)
        writer.writeheader()
        writer = csv.writer(file)
        for element in data:
            writer.writerows([element])

writeto_csv(dataSet, 'countries.csv', columns)

Conclusion

That’s how you can scrape a website with Beautiful Soup.

It’s a good start and practice for scraping with Beautiful Soup. It has even more functionalities that you can explore.

Practice is the key to mastering web scraping.

I have another beginner friendly tutorial scraping the same website with lxml and also PyQuery, it could be suitable for you if you want to increase your arsenal of web scraping tools. Check out my tutorial on web scraping with lxml and web scraping with PyQuery.

Happy scraping!

Leave a Comment