As mentioned in several blog posts earlier, I have been building PodDB on Microsoft.Net platform and Solr. Solr is built on top of Apache Lucene.
Lucene.Net is a very high-performance library for working directly with Apache Lucene, SolrNet is a library for working with Solr. Solr is very customizable, fault-tolerant and has several additional features available out of the box and is built on top of Lucene. Working with SolrNet can be a bit slow because all the API calls are routed via a REST API. The usual overhead of establishing network connection, serializing and deserializing JSON or XML.
Over the past few days, I have been working on a small subset of documents (approximately 275 – 300, the same would be part of the Alpha release) and trying to tweak the settings for optimal search relevance. This required trying various Solr configurations, re-indexing data etc… The very first version of the data ingestion component (does much more pre-processing rather than just ingesting into solr) used to take about approximately 10 minutes. And now the performance has been optimized and the ingestion happens within 15 seconds. i.e over 4000% performance gain and entirely programming related.
The trick used was one of the oldest tricks in the book – batch processing. Instead of one document at a time for writing into a MySQL database and writing into Solr, I rewrote the application to ingest in batches and the application was much faster.
Batching with multi-threading might be even faster.
In other words instead of calling solr.Add() for each document, create the documents, hold them in a list, call solr.AddRange().
Similarly for solr.Commit() and solr.Optimize() batch the calls i.e call those methods once for every 1000 or so documents rather than every document.
Assuming doc is a Solr document that needs to be written. For example:
//NO
solr.Add(doc1);
solr.Add(doc2);
solr.Add(doc3);
//YES
var lst = new List<ENTITY>();
lst.Add(doc1);
lst.Add(doc2);
lst.Add(doc3);
solr.AddRange(lst);
I like to share knowledge, I am hoping this blog post helps someone.
My 2 cents to the world of the blogosphere!
–
Mr. Kanti Kalyan Arumilli
B.Tech, M.B.A
Founder & CEO, Lead Full-Stack .Net developer
ALight Technology And Services Limited
Phone / SMS / WhatsApp on the following 3 numbers:
+91-789-362-6688, +1-480-347-6849, +44-07718-273-964
+44-33-3303-1284 (Preferred number if calling from U.K, No WhatsApp)
kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.