Different ways of text chunking for generating embeddings - ALight Technology And Services Limited (U.K) ALight Technologies USA Inc

Cross-post: https://www.linkedin.com/pulse/different-ways-text-chunking-generating-embeddings-arumilli-u8hjc

In previous posts https://www.alightservices.com/2024/10/03/an-introduction-to-text-chunking-for-purposes-of-vector-embedding/ I have talked about the reason for chunking and the concept. This post goes a little bit deeper and based on another person’ blog: Five Levels of Chunking Strategies in RAG| Notes from Greg’s Video | by Anurag Mishra | Medium and Github: RetrievalTutorials/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb at main · FullStackRetrieval-com/RetrievalTutorials · GitHub

For retrieval to work properly, documents need to be chunked, embeddings need to be generated and stored in vector databases. But when we chunk documents, how do we maintain context? What if a chunk’s primary information is in a different chunk but some specific information in a different chunk without context?

There are different ways of chunking.

Split by characters such as . ! ? Then combine the splitted sentences until max chunk size. Fast and easy but does not retain context.

Some people overlap some content for maintaining context.

Document based for example HTML. Chunk document by Headers. Well written HTML documents usually have context, but there is still the question of chunk size. What if a certain segment of HTML larger than max tokens of embedder?

The above methods are easy, faster and low cost.

The next set of methods are costly.

Another method is using embeddings for generating chunks. There are several approaches of using this method. Because embeddings of text and similarity of the embeddings generated determine if two texts are similar or not. Create small chunks based on sentences. Append sentences until chunk size meanwhile comparing similarity and some kind of relevancy threshold such as 0.95 and then start creating a new chunk and if needed some overlap. But this method needs lot of calls to embedders and could become costly.

Some text documents contain some information on Topic A, then discusses about Topic B and continue discussing Topic A. The previously discussed methods don’t handle this well.

This can be accomplished by either using embedders or LLMs. i.e chunk into smaller pieces and append relevant pieces of chunks. This is very costly because uses LLMs. The Github repo of https://twitter.com/GregKamradt Greg Kamradt has explained these concepts and provided code samples in Python.

WebVeta does the chunking, embedding, storing, retrieving and calling LLMs for generating responses and can be easily embedded in your websites with 2 – 3 lines of HTML! SMEs can focus on your business goals and still provide an advanced internal search engine for people who come to your website and looking for more information.

WebVeta currently running a KicksStarter campaign – https://www.kickstarter.com/projects/kantikalyanarumilli/webveta-power-your-website-with-ai-search

Or contact me for a free trial of WebVeta.

–

Mr. Kanti Kalyan Arumilli

Arumilli Kanti Kalyan, Founder & CEO

B.Tech, M.B.A

Founder & CEO, Lead Full-Stack .Net developer

ALight Technology And Services Limited

ALight Technologies USA Inc

Youtube

Facebook

Phone / SMS / WhatsApp on the following 3 numbers:

+91-789-362-6688, +1-480-347-6849, +44-07718-273-964

kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.