Python

Beautiful Soup: A Python Library for Web Scraping

Table of Contents

Beautiful Soup is a Python library for extracting data from XML and HTML files. It creates a parse tree from the page’s source code, which can be used to extract the data in a hierarchical and readable manner.

Introduction

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways to navigate, search, and modify the parse tree. It typically saves programmers hours or days of work.

Installation

You can install Beautiful Soup using pip:


pip install beautifulsoup4

Or you can download it from source and install it manually.

Basic Usage

Here is a basic usage example:


from bs4 import BeautifulSoup

html_doc = """
<html><head><title>Lorem ipsum</title></head>
<body>
<p class="title"><b>Stet clita</b></p>

<p class="story">Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
<a href="http://example.com/test1" class="link" id="link1">Link1</a>,
<a href="http://example.com/test2" class="link" id="link2">Link2</a> and
<a href="http://example.com/test2" class="link" id="link3">Link3</a>;
sed diam voluptua.</p>

<p class="story">Lorem ipsum dolor sit amet, consetetur sadipscing elitr, ...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
"""

soup = BeautifulSoup(html_doc, 'html.parser')

In this example, we first import the BeautifulSoup class from the bs4 module. We then create an instance of the BeautifulSoup class by passing in our HTML document and specifying that we want to use Python’s built-in HTML parser.

Once you’ve created a Beautiful Soup object, you can navigate the parse tree like so:


print(soup.title)  # <title>Lorem ipsum</title>
print(soup.title.name)  # title
print(soup.title.string)  # Lorem ipsum

In this example, we navigate the parse tree by accessing different parts of it. The title attribute gives us direct access to the <title> tag in our document. We can then use the name and string attributes to get the name of the tag and its contents respectively.

Searching the Parse Tree

You can search the parse tree for specific tags using various methods:


print(soup.find_all('a'))  
# Output: [<a class="link" href="http://example.com/test1" id="link1">Link1</a>, <a class="link" href="http://example.com/test2" id="link2">Link2</a>, <a class="link" href="http://example.com/test2" id="link3">Link3</a>]

In this example, we use the find_all method to find all <a> tags in our document. This returns a list of Tag objects, each representing one match.

Modifying the Parse Tree

You can modify the parse tree by changing tag names and attributes:


tag = soup.a
print(tag)  # <a class="link" href="http://example.com/test1" id="link1">Link1</a>

# Change the tag's name:
tag.name = 'strong'
print(tag)  # <strong class="link" href="http://example.com/test1" id="link1">Link1</strong>

In this example, we first access a <a> tag and print it out. We then change the tag’s name to ‘b’, effectively changing the tag type.

Outputting HTML and XML

You can output your modified parse tree as HTML or XML:


print(str(soup))  # Prints the entire document
print(soup.prettify())  # Prints the entire document with pretty formatting

In this example, we use the str function to print out our modified parse tree as a string of HTML. We can also use the prettify method to print it out in a more readable format.

Conclusion

Beautiful Soup is a powerful library for web scraping in Python. It provides simple and Pythonic ways to navigate, search, and modify parse trees. With Beautiful Soup, you can easily extract the data you need from HTML and XML documents.

To top