TODAY'S ISSUE
TODAY’S DAILY DOSE OF DATA SCIENCE
5 Chunking Strategies For RAG
Here’s the typical workflow of a RAG application:
Since the additional document(s) can be pretty large, step 1 also involves chunking, wherein a large document is divided into smaller/manageable pieces.
This step is crucial since it ensures the text fits the input size of the embedding model.
Moreover, it enhances the efficiency and accuracy of the retrieval step, which directly impacts the quality of generated responses (we discussed this yesterday).
Here are five chunking strategies for RAG:
Let’s understand them today!
Note: Yesterday, we discussed techniques to build robust NLP systems that rely on pairwise content similarity (RAG is one of them). Read here in case you missed it: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
1) Fixed-size chunking
The most intuitive and straightforward way to generate chunks is by splitting the text into uniform segments based on a pre-defined number of characters, words, or tokens.
Since a direct split can disrupt the semantic flow, it is recommended to maintain some overlap between two consecutive chunks (the blue part above).
This is simple to implement. Also, since all chunks are of equal size, it simplifies batch processing.
But there is a big problem. This usually breaks sentences (or ideas) in between. Thus, important information will likely get distributed between chunks.
2) Semantic chunking
The idea is simple.
- Segment the document based on meaningful units like sentences, paragraphs, or thematic sections.
- Next, create embeddings for each segment.
- Let’s say I start with the first segment and its embedding.
- If the first segment’s embedding has a high cosine similarity with that of the second segment, both segments form a chunk.
- This continues until cosine similarity drops significantly.
- The moment it does, we start a new chunk and repeat.
Here’s what the output could look like:
Unlike fixed-size chunks, this maintains the natural flow of language and preserves complete ideas.
Since each chunk is richer, it improves the retrieval accuracy, which, in turn, produces more coherent and relevant responses by the LLM.
A minor problem is that it depends on a threshold to determine if cosine similarity has dropped significantly, which can vary from document to document.
3) Recursive chunking
This is also simple.
First, chunk based on inherent separators like paragraphs, or sections.
Next, split each chunk into smaller chunks if the size exceeds a pre-defined chunk size limit. If, however, the chunk fits the chunk-size limit, no further splitting is done.
Here’s what the output could look like:
As shown above:
- First, we define two chunks (the two paragraphs in purple).
- Next, paragraph 1 is further split into smaller chunks.
Unlike fixed-size chunks, this approach also maintains the natural flow of language and preserves complete ideas.
However, there is some extra overhead in terms of implementation and computational complexity.
4) Document structure-based chunking
This is another intuitive approach.
It utilizes the inherent structure of documents, like headings, sections, or paragraphs, to define chunk boundaries.
This way, it maintains structural integrity by aligning with the document’s logical sections.
Here’s what the output could look like:
That said, this approach assumes that the document has a clear structure, which may not be true.
Also, chunks may vary in length, possibly exceeding model token limits. You can try merging it with recursive splitting.
5) LLM-based chunking
Since every approach has upsides and downsides, why not use the LLM to create chunks?
The LLM can be prompted to generate semantically isolated and meaningful chunks.
Quite evidently, this method will ensure high semantic accuracy since the LLM can understand context and meaning beyond simple heuristics (used in the above four approaches).
The only problem is that it is the most computationally demanding chunking technique of all five techniques discussed here.
Also, since LLMs typically have a limited context window, that is something to be taken care of.
Each technique has its own advantages and trade-offs.
I have observed that semantic chunking works pretty well in many cases, but again, you need to test.
The choice will heavily depend on the nature of your content, the capabilities of the embedding model, computational resources, etc.
We shall be doing a hands-on demo of these strategies pretty soon.
In the meantime, in case you missed it, yesterday, we discussed techniques to build robust NLP systems that rely on pairwise content similarity (RAG is one of them).
Read here: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
👉 Over to you: What other chunking strategies do you know?
IN CASE YOU MISSED IT
A crash course on building RAG systems (Part 1-3)​
Over the last few weeks, we have spent plenty of time understanding the key components of real-world NLP systems (like the deep dives on ​bi-encoders and cross-encoders​ for context pair similarity scoring).
RAG is another key NLP system that got massive attention due to one of the key challenges it solved around LLMs.
More specifically, if you know how to build a reliable RAG system, you can bypass the challenge and cost of ​fine-tuning LLMs​.
That’s a considerable cost saving for enterprises.
And at the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Can you drive more revenue?
- Can you scale ML training/inference?
- Can you predict trends before they happen?
Thus, the objective of this crash course is to help you implement reliable RAG systems, understand the underlying challenges, and develop expertise in building RAG apps on LLMs, which every industry cares about now.
We have published three parts so far:
- Learn the foundations of RAG and build them from scratch in the ​first part here​.​
- Learn how to evaluate your RAG systems in the ​second part here​.​
- Learn how to optimize your RAG systems in the ​third part here​.​
Of course, if you have never worked with LLMs, that’s okay. We cover everything in a practical and beginner-friendly way.
THAT'S A WRAP
No-Fluff Industry ML resources to
Succeed in DS/ML roles
At the end of the day, all businesses care about impact. That’s it!
- Can you reduce costs?
- Drive revenue?
- Can you scale ML models?
- Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
- Learn sophisticated graph architectures and how to train them on graph data in this crash course.
- So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
- Run large models on small devices using Quantization techniques.
- Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
- Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
- Learn how to scale and implement ML model training in this practical guide.
- Learn 5 techniques with implementation to reliably test ML models in production.
- Learn how to build and implement privacy-first ML systems using Federated Learning.
- Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Advertise to 450k+ data professionals
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., around the world.