Some performance tips when ingesting documents into Solr - ALight Technology And Services Limited (U.K) ALight Technologies USA Inc

As mentioned in several blog posts earlier, I have been building PodDB on Microsoft.Net platform and Solr. Solr is built on top of Apache Lucene.

Lucene.Net is a very high-performance library for working directly with Apache Lucene, SolrNet is a library for working with Solr. Solr is very customizable, fault-tolerant and has several additional features available out of the box and is built on top of Lucene. Working with SolrNet can be a bit slow because all the API calls are routed via a REST API. The usual overhead of establishing network connection, serializing and deserializing JSON or XML.

Over the past few days, I have been working on a small subset of documents (approximately 275 – 300, the same would be part of the Alpha release) and trying to tweak the settings for optimal search relevance. This required trying various Solr configurations, re-indexing data etc… The very first version of the data ingestion component (does much more pre-processing rather than just ingesting into solr) used to take about approximately 10 minutes. And now the performance has been optimized and the ingestion happens within 15 seconds. i.e over 4000% performance gain and entirely programming related.

The trick used was one of the oldest tricks in the book – batch processing. Instead of one document at a time for writing into a MySQL database and writing into Solr, I rewrote the application to ingest in batches and the application was much faster.

Batching with multi-threading might be even faster.

In other words instead of calling solr.Add() for each document, create the documents, hold them in a list, call solr.AddRange().

Similarly for solr.Commit() and solr.Optimize() batch the calls i.e call those methods once for every 1000 or so documents rather than every document.

Assuming doc is a Solr document that needs to be written. For example:

//NO
solr.Add(doc1);
solr.Add(doc2);
solr.Add(doc3);

//YES
var lst = new List<ENTITY>();
lst.Add(doc1);
lst.Add(doc2);
lst.Add(doc3);

solr.AddRange(lst);

I like to share knowledge, I am hoping this blog post helps someone.

My 2 cents to the world of the blogosphere!

–

Mr. Kanti Kalyan Arumilli

Arumilli Kanti Kalyan, Founder & CEO

B.Tech, M.B.A

Founder & CEO, Lead Full-Stack .Net developer

ALight Technology And Services Limited

ALight Technologies USA Inc

Youtube

Facebook

Phone / SMS / WhatsApp on the following 3 numbers:

+91-789-362-6688, +1-480-347-6849, +44-07718-273-964

kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.