WebVeta - Advanced, unified, consistent search for your website(s), from content of your website(s), blogs(s). First 50 customers, who sign-up prior to 15/05/2024 get unlimited access to existing features, newer features for at least 1 year. Sign up now! https://webveta.alightservices.com/
Categories
.Net C# Solr

Some performance tips when ingesting documents into Solr

As mentioned in several blog posts earlier, I have been building PodDB on Microsoft.Net platform and Solr. Solr is built on top of Apache Lucene.

Lucene.Net is a very high-performance library for working directly with Apache Lucene, SolrNet is a library for working with Solr. Solr is very customizable, fault-tolerant and has several additional features available out of the box and is built on top of Lucene. Working with SolrNet can be a bit slow because all the API calls are routed via a REST API. The usual overhead of establishing network connection, serializing and deserializing JSON or XML.

Over the past few days, I have been working on a small subset of documents (approximately 275 – 300, the same would be part of the Alpha release) and trying to tweak the settings for optimal search relevance. This required trying various Solr configurations, re-indexing data etc… The very first version of the data ingestion component (does much more pre-processing rather than just ingesting into solr) used to take about approximately 10 minutes. And now the performance has been optimized and the ingestion happens within 15 seconds. i.e over 4000% performance gain and entirely programming related.

The trick used was one of the oldest tricks in the book – batch processing. Instead of one document at a time for writing into a MySQL database and writing into Solr, I rewrote the application to ingest in batches and the application was much faster.

Batching with multi-threading might be even faster.

In other words instead of calling solr.Add() for each document, create the documents, hold them in a list, call solr.AddRange().

Similarly for solr.Commit() and solr.Optimize() batch the calls i.e call those methods once for every 1000 or so documents rather than every document.

Assuming doc is a Solr document that needs to be written. For example:

//NO
solr.Add(doc1);
solr.Add(doc2);
solr.Add(doc3);

//YES
var lst = new List<ENTITY>();
lst.Add(doc1);
lst.Add(doc2);
lst.Add(doc3);

solr.AddRange(lst);

I like to share knowledge, I am hoping this blog post helps someone.

My 2 cents to the world of the blogosphere!

I don’t have any fake aliases, nor any virtual aliases like some of the the psycho spy R&AW traitors of India. NOT associated – “ass”, eass, female “es”, “eka”, “ok”, “okay”, “is”, “erra”, yerra, karan, kamalakar, diwakar, kareem, karan, sowmya, zinnabathuni, bojja srinivas (was a friend and batchmate 1998 – 2002), mukesh golla (was a friend and classmate 1998 – 2002), thota veera, uttam’s, bandhavi’s, bhattaru’s, thota’s, bojja’s, bhattaru’s or Arumilli srinivas or Arumilli uttam (may be they are part of a different Arumilli family – not my family).

Mr. Kanti Kalyan Arumilli

Arumilli Kanti Kalyan, Founder & CEO
Arumilli Kanti Kalyan, Founder & CEO

B.Tech, M.B.A

Facebook

LinkedIn

Threads

Instagram

Youtube

Founder & CEO, Lead Full-Stack .Net developer

ALight Technology And Services Limited

ALight Technologies USA Inc

Youtube

Facebook

LinkedIn

Phone / SMS / WhatsApp on the following 3 numbers:

+91-789-362-6688, +1-480-347-6849, +44-07718-273-964

+44-33-3303-1284 (Preferred number if calling from U.K, No WhatsApp)

kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.