Cross-post: https://www.linkedin.com/pulse/different-ways-text-chunking-generating-embeddings-arumilli-u8hjc
In previous posts https://www.alightservices.com/2024/10/03/an-introduction-to-text-chunking-for-purposes-of-vector-embedding/ I have talked about the reason for chunking and the concept. This post goes a little bit deeper and based on another person’ blog: Five Levels of Chunking Strategies in RAG| Notes from Greg’s Video | by Anurag Mishra | Medium and Github: RetrievalTutorials/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb at main · FullStackRetrieval-com/RetrievalTutorials · GitHub
For retrieval to work properly, documents need to be chunked, embeddings need to be generated and stored in vector databases. But when we chunk documents, how do we maintain context? What if a chunk’s primary information is in a different chunk but some specific information in a different chunk without context?
There are different ways of chunking.
Split by characters such as . ! ? Then combine the splitted sentences until max chunk size. Fast and easy but does not retain context.
Some people overlap some content for maintaining context.
Document based for example HTML. Chunk document by Headers. Well written HTML documents usually have context, but there is still the question of chunk size. What if a certain segment of HTML larger than max tokens of embedder?
The above methods are easy, faster and low cost.
The next set of methods are costly.
Another method is using embeddings for generating chunks. There are several approaches of using this method. Because embeddings of text and similarity of the embeddings generated determine if two texts are similar or not. Create small chunks based on sentences. Append sentences until chunk size meanwhile comparing similarity and some kind of relevancy threshold such as 0.95 and then start creating a new chunk and if needed some overlap. But this method needs lot of calls to embedders and could become costly.
Some text documents contain some information on Topic A, then discusses about Topic B and continue discussing Topic A. The previously discussed methods don’t handle this well.
This can be accomplished by either using embedders or LLMs. i.e chunk into smaller pieces and append relevant pieces of chunks. This is very costly because uses LLMs. The Github repo of https://twitter.com/GregKamradt Greg Kamradt has explained these concepts and provided code samples in Python.
WebVeta does the chunking, embedding, storing, retrieving and calling LLMs for generating responses and can be easily embedded in your websites with 2 – 3 lines of HTML! SMEs can focus on your business goals and still provide an advanced internal search engine for people who come to your website and looking for more information.
WebVeta currently running a KicksStarter campaign – https://www.kickstarter.com/projects/kantikalyanarumilli/webveta-power-your-website-with-ai-search
Or contact me for a free trial of WebVeta.
–
Mr. Kanti Kalyan Arumilli
B.Tech, M.B.A
Founder & CEO, Lead Full-Stack .Net developer
ALight Technology And Services Limited
Phone / SMS / WhatsApp on the following 3 numbers:
+91-789-362-6688, +1-480-347-6849, +44-07718-273-964
+44-33-3303-1284 (Preferred number if calling from U.K, No WhatsApp)
kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.