Beautiful Soup is a Python library for extracting data from XML and HTML files. It creates a parse tree from the page’s source code, which can be used to extract the data in a hierarchical and readable manner.
Table of Contents
- Introduction
- Installation
- Basic Usage
- Navigating the Parse Tree
- Searching the Parse Tree
- Modifying the Parse Tree
- Outputting HTML and XML
- Conclusion
Introduction
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways to navigate, search, and modify the parse tree. It typically saves programmers hours or days of work.
Installation
You can install Beautiful Soup using pip:
pip install beautifulsoup4
Or you can download it from source and install it manually.
Basic Usage
Here is a basic usage example:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>Lorem ipsum</title></head>
<body>
<p class="title"><b>Stet clita</b></p>
<p class="story">Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
<a href="http://example.com/test1" class="link" id="link1">Link1</a>,
<a href="http://example.com/test2" class="link" id="link2">Link2</a> and
<a href="http://example.com/test2" class="link" id="link3">Link3</a>;
sed diam voluptua.</p>
<p class="story">Lorem ipsum dolor sit amet, consetetur sadipscing elitr, ...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
"""
soup = BeautifulSoup(html_doc, 'html.parser')
In this example, we first import the BeautifulSoup
class from the bs4
module. We then create an instance of the BeautifulSoup
class by passing in our HTML document and specifying that we want to use Python’s built-in HTML parser.
Navigating the Parse Tree
Once you’ve created a Beautiful Soup object, you can navigate the parse tree like so:
print(soup.title) # <title>Lorem ipsum</title>
print(soup.title.name) # title
print(soup.title.string) # Lorem ipsum
In this example, we navigate the parse tree by accessing different parts of it. The title
attribute gives us direct access to the <title>
tag in our document. We can then use the name
and string
attributes to get the name of the tag and its contents respectively.
Searching the Parse Tree
You can search the parse tree for specific tags using various methods:
print(soup.find_all('a'))
# Output: [<a class="link" href="http://example.com/test1" id="link1">Link1</a>, <a class="link" href="http://example.com/test2" id="link2">Link2</a>, <a class="link" href="http://example.com/test2" id="link3">Link3</a>]
In this example, we use the find_all
method to find all <a>
tags in our document. This returns a list of Tag objects, each representing one match.
Modifying the Parse Tree
You can modify the parse tree by changing tag names and attributes:
tag = soup.a
print(tag) # <a class="link" href="http://example.com/test1" id="link1">Link1</a>
# Change the tag's name:
tag.name = 'strong'
print(tag) # <strong class="link" href="http://example.com/test1" id="link1">Link1</strong>
In this example, we first access a <a>
tag and print it out. We then change the tag’s name to ‘b’, effectively changing the tag type.
Outputting HTML and XML
You can output your modified parse tree as HTML or XML:
print(str(soup)) # Prints the entire document
print(soup.prettify()) # Prints the entire document with pretty formatting
In this example, we use the str
function to print out our modified parse tree as a string of HTML. We can also use the prettify
method to print it out in a more readable format.
Conclusion
Beautiful Soup is a powerful library for web scraping in Python. It provides simple and Pythonic ways to navigate, search, and modify parse trees. With Beautiful Soup, you can easily extract the data you need from HTML and XML documents.