Using Langchain RecursiveCharacterTextSplitter

Machine Learning Python

Introduction

Langchain is a powerful library that offers a range of language processing tools, including text splitting. The RecursiveCharacterTextSplitter is one such tool that divides large texts into smaller chunks based on a specified chunk size and characters. This article will guide you in understanding how to use this splitter effectively.

Table of Contents

Installation

To install the RecursiveCharacterTextSplitter, you can use pip:


pip install langchain-community langchain-text-splitters

The RecursiveCharacterTextSplitter is a LangChain text splitter that enables the division of large texts into smaller chunks. It achieves this by utilizing a set of characters and attempting to maintain the cohesion of paragraphs, sentences, and words. By default, the list of separators includes ["\n\n", "\n", " ", ""], but it can be tailored to suit your requirements.

Parameters

The RecursiveCharacterTextSplitter has three main parameters:

  1. Documents: The first parameter is a list of the input documents you wish to split.

  2. Chunk size: The maximum number of characters in each chunk when splitting text is determined by the second parameter. This parameter specifies the length of each chunk.

  3. Separators: By default, the separators used to split the text into chunks are ["\n\n", "\n", " ", ""]. However, you have the option to override them.

Example

Here’s an example of how you can use RecursiveCharacterTextSplitter:


from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = WebBaseLoader("https://developers-blog.org")
webpage_documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap  = 100,
    length_function = len,
)

texts = text_splitter.split_documents(webpage_documents)
print(len(texts))  # prints the number of chunks
print(texts[0:3])  # prints the first chunk

In this example, we’re loading a web page using WebBaseLoader and splitting it into chunks of maximum size 200 characters. The overlap between consecutive chunks is set to 100 characters. Each chunk is then printed out separately.

Conclusion

And there we have it. We’ve delved into the inner workings of LangChain’s RecursiveCharacterTextSplitter and seen how it can be used effectively for splitting large texts into smaller chunks based on specified parameters. This tool is a powerful addition to any language processing workflow in Langchain.

To top