Using LangChain HTMLHeaderTextSplitter

Machine Learning Python

The LangChain HTMLHeaderTextSplitter is a text splitter that splits a complete LangChain document into smaller parts. LangChain’s Documents are loaded using the LangChain document loader. However, the texts are completely available. In this format, they are often too long and, for certain use cases, contain too much information that does not quite fit into the context window of a LLM model. Additionally they are not ideal for use in RAG applications or when searching through vector databases.

That’s why they have to be splitted. This is where the LangChain Text Splitters come into play. You can break long pieces of text into manageable sections.

Table of Contents

How Text Splitters Work

Text splitters function as follows:

  1. Divide the text into small, meaningful chunks, usually sentences.
  2. Combine these small chunks into larger ones until a certain size is reached, as measured by a specific function.
  3. Once the text reaches that size, create a new section with some overlap to maintain context between sections.

The respective LangChain Text Splitter differ in how they divide a LangChain document and how large the chunks split at the end are.

HTMLHeaderTextSplitter

The HTMLHeaderTextSplitter is a LangChain splitter that splits the text at element level depending on the structure of the HTML document. It also adds metadata for each header that is relevant for a specific chunk. It can return chunks element by element. But it can also combine elements with the same metadata. This makes it special when parsing web pages.

Installation

To use the HTMLHeaderTextSplitter, install LangChain Text Splitters using pip:

pip install langchain-text-splitters

Example 1 – From HTML String

In the following example we create a string attribute which contains an HTML content with h1,h2 and h3 headings. Furthermore, we also assign meta information to the respective heading elements such as "Header 1", "Header 2" and "Header 3".


from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html lang="en">
<head>
    <title>Hotel Room Information</title>
</head>
<body>
    <h1>Room Details</h1>
    
    <div class="room-info">
        <h2>Deluxe King Room</h2>
        
        <h3>Room Type:</h3>
        <p>King Bed</p>
        
        <h3>Occupancy:</h3>
        <p>2 Guests</p>
        
        <h3>Room Features:</h3>
        <ul>
            <li>Air Conditioning</li>
            <li>Free WiFi</li>
        </ul>
        
        <h3>Rate:</h3>
        <p>$150 per night (excluding taxes and fees)</p>
    </div>
    
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
print(html_header_splits)

Output (formatted):

[
Document(page_content='Deluxe King Room Room Type: Occupancy: Room Features: Rate:', metadata={'Header 1': 'Room Details'}), 
Document(page_content='King Bed', metadata={'Header 1': 'Room Details', 'Header 2': 'Deluxe King Room', 'Header 3': 'Room Type:'}), 
Document(page_content='2 Guests', metadata={'Header 1': 'Room Details', 'Header 2': 'Deluxe King Room', 'Header 3': 'Occupancy:'}), 
Document(page_content='Air Conditioning Free WiFi', metadata={'Header 1': 'Room Details', 'Header 2': 'Deluxe King Room', 'Header 3': 'Room Features:'}), 
Document(page_content='$150 per night (excluding taxes and fees)', metadata={'Header 1': 'Room Details', 'Header 2': 'Deluxe King Room', 'Header 3': 'Rate:'})
 ]

Example 2 – Load from URL

In this second example, we’ll load an HTML document from a URL and split it using the HTMLHeaderTextSplitter.


from langchain_text_splitters import HTMLHeaderTextSplitter

url = "https://developers-blog.org"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

print(html_header_splits)

Conclusion

Since HTML can vary greatly and often contain many errors, the algorithm may not assign text to headers correctly. Otherwise, the HTMLHeaderTextSplitter is a good LangChain splitter for splitting HTML documents for later use in vector databases and RAG applications.

To top