国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Home Technology peripherals AI 7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya

7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya

Apr 19, 2025 am 10:11 AM

LangChain Text Splitters: Optimizing LLM Input for Efficiency and Accuracy

Our previous article covered LangChain's document loaders. However, LLMs have context window size limitations (measured in tokens). Exceeding this limit truncates data, compromising accuracy and increasing costs. The solution? Send only relevant data to the LLM, requiring data splitting. Enter LangChain's Text Splitters.

7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya

Key Concepts:

  1. The Crucial Role of Text Splitters: Understand why efficient text splitting is vital for optimizing LLM applications, balancing context window size and cost.
  2. Diverse Text Splitting Techniques: Explore various methods, including character counts, token counts, recursive splitting, and techniques tailored to HTML, code, and JSON structures.
  3. LangChain Text Splitter Implementation: Learn practical application, including installation, code examples for text splitting, and handling diverse data formats.
  4. Semantic Splitting for Enhanced Relevance: Discover how sentence embeddings and cosine similarity create semantically coherent chunks, maximizing relevance.

Table of Contents:

  • What are Text Splitters?
  • Data Splitting Methods
  • Character Count-Based Splitting
  • Recursive Splitting
  • Token Count-Based Splitting
  • Handling HTML
  • Code-Specific Splitting
  • JSON Data Handling
  • Semantic Chunking
  • Frequently Asked Questions

What are Text Splitters?

Text splitters divide large text into smaller, manageable chunks for improved LLM query relevance. They work directly on raw text or LangChain document objects. Multiple methods cater to different content types and use cases.

Data Splitting Methods

LangChain Text Splitters are crucial for efficient large document processing. They improve performance, contextual understanding, enable parallel processing, and facilitate better data management. Let's examine several methods:

Prerequisites: Install the package using pip install langchain_text_splitters

Character Count-Based Splitting

This method splits text based on character count, using a specified separator.

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import CharacterTextSplitter

# Load data (replace with your PDF path)
loader = UnstructuredPDFLoader('how-to-formulate-successful-business-strategy.pdf', mode='single')
data = loader.load()

text_splitter = CharacterTextSplitter(separator="\n", chunk_size=500, chunk_overlap=0, is_separator_regex=False)
texts = text_splitter.split_documents(data)
len(texts) # Output: Number of chunks

This example splits text into 500-character chunks, using newline characters as separators.

Recursive Splitting

This uses multiple separators sequentially until chunks are below chunk_size. Useful for sentence-level splitting.

from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", r"(?>> 293

# ... (rest of the code remains similar)

Token Count-Based Splitting

LLMs use tokens; splitting by token count is more accurate. This example uses the o200k_base encoding (check the GitHub link for model/encoding mappings).

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(encoding_name='o200k_base', chunk_size=50, chunk_overlap=0)
texts = text_splitter.split_documents(data)
len(texts) # Output: Number of chunks

Recursive splitting can also be combined with token counting.

For plain text, recursive splitting with character or token counting is generally preferred.

Handling HTML

For structured data like HTML, splitting should respect the structure. This example splits based on HTML headers.

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on, return_each_element=True)
html_header_splits = html_splitter.split_text_from_url('https://diataxis.fr/')
len(html_header_splits) # Output: Number of chunks

HTMLSectionSplitter allows splitting based on other sections.

Code-Specific Splitting

Programming languages have unique structures. This example uses syntax-aware splitting for Python code.

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

# ... (Python code example) ...

python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=100, chunk_overlap=0)
python_docs = python_splitter.create_documents([PYTHON_CODE])

JSON Data Handling

Nested JSON objects can be split while preserving key relationships.

from langchain_text_splitters import RecursiveJsonSplitter

# ... (JSON data example) ...

splitter = RecursiveJsonSplitter(max_chunk_size=200, min_chunk_size=20)
chunks = splitter.split_text(json_data, convert_lists=True)

Semantic Chunking

This method uses sentence embeddings and cosine similarity to group semantically related sentences.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings # Requires OpenAI API key

# ... (code using OpenAIEmbeddings and SemanticChunker) ...

Conclusion

LangChain offers various text splitting methods, each suited for different data types. Choosing the right method optimizes LLM input, improving accuracy and reducing costs.

Frequently Asked Questions

(Q&A section remains largely the same, with minor wording adjustments for clarity and flow.)

The above is the detailed content of 7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

From Adoption To Advantage: 10 Trends Shaping Enterprise LLMs In 2025 From Adoption To Advantage: 10 Trends Shaping Enterprise LLMs In 2025 Jun 20, 2025 am 11:13 AM

Here are ten compelling trends reshaping the enterprise AI landscape.Rising Financial Commitment to LLMsOrganizations are significantly increasing their investments in LLMs, with 72% expecting their spending to rise this year. Currently, nearly 40% a

AI Investor Stuck At A Standstill? 3 Strategic Paths To Buy, Build, Or Partner With AI Vendors AI Investor Stuck At A Standstill? 3 Strategic Paths To Buy, Build, Or Partner With AI Vendors Jul 02, 2025 am 11:13 AM

Investing is booming, but capital alone isn’t enough. With valuations rising and distinctiveness fading, investors in AI-focused venture funds must make a key decision: Buy, build, or partner to gain an edge? Here’s how to evaluate each option—and pr

The Unstoppable Growth Of Generative AI (AI Outlook Part 1) The Unstoppable Growth Of Generative AI (AI Outlook Part 1) Jun 21, 2025 am 11:11 AM

Disclosure: My company, Tirias Research, has consulted for IBM, Nvidia, and other companies mentioned in this article.Growth driversThe surge in generative AI adoption was more dramatic than even the most optimistic projections could predict. Then, a

These Startups Are Helping Businesses Show Up In AI Search Summaries These Startups Are Helping Businesses Show Up In AI Search Summaries Jun 20, 2025 am 11:16 AM

Those days are numbered, thanks to AI. Search traffic for businesses like travel site Kayak and edtech company Chegg is declining, partly because 60% of searches on sites like Google aren’t resulting in users clicking any links, according to one stud

AGI And AI Superintelligence Are Going To Sharply Hit The Human Ceiling Assumption Barrier AGI And AI Superintelligence Are Going To Sharply Hit The Human Ceiling Assumption Barrier Jul 04, 2025 am 11:10 AM

Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here). Heading Toward AGI And

Build Your First LLM Application: A Beginner's Tutorial Build Your First LLM Application: A Beginner's Tutorial Jun 24, 2025 am 10:13 AM

Have you ever tried to build your own Large Language Model (LLM) application? Ever wondered how people are making their own LLM application to increase their productivity? LLM applications have proven to be useful in every aspect

AMD Keeps Building Momentum In AI, With Plenty Of Work Still To Do AMD Keeps Building Momentum In AI, With Plenty Of Work Still To Do Jun 28, 2025 am 11:15 AM

Overall, I think the event was important for showing how AMD is moving the ball down the field for customers and developers. Under Su, AMD’s M.O. is to have clear, ambitious plans and execute against them. Her “say/do” ratio is high. The company does

Kimi K2: The Most Powerful Open-Source Agentic Model Kimi K2: The Most Powerful Open-Source Agentic Model Jul 12, 2025 am 09:16 AM

Remember the flood of open-source Chinese models that disrupted the GenAI industry earlier this year? While DeepSeek took most of the headlines, Kimi K1.5 was one of the prominent names in the list. And the model was quite cool.

See all articles