WebVeta - Advanced, unified, consistent search for your website(s), from content of your website(s), blogs(s). First 50 customers, who sign-up prior to 15/05/2024 get unlimited access to existing features, newer features for at least 1 year. Sign up now! https://webveta.alightservices.com/
Categories
.Net C# Lucene MultiFieldQueryParser

Lucene with C#

As part of development effort for PodDB – A search engine for podcasts – A product of ALight Technology And Services Limited, I have been deciding between Lucene.Net and Solr. I strongly suggest Solr over Lucene.Net if you want to scale. For smaller datasets, Lucene.Net shouldn’t be a problem. But, if you want to scale for larger datasets, want built-in sharding, replication features out of the box choose Solr. For smaller datasets and if you know, you wouldn’t be scaling into bigger datasets, Lucene.Net shouldn’t be a problem and as a matter of fact very efficient. With that said, I do have plans of scaling PodDB, if PodDB gains traction, so I chose Solr.

But for the sake of knowledge sharing, in this article, I am going to show how to use Lucene.Net for full-text indexing. I would not go over complex scenarios and at the same time, this article is NOT a Hello World for Lucene.Net.

Moreover, Lucene.Net does not seem to be under active development. As of this blog post date – September 13tth 2022, there are no GitHub commits over the past 2 – 3 months. As software developers, technical leads, and architects we have the responsibility in making the proper choices for the underlying technology stack. Although, ALight Technology And Services Limited is not an enterprise yet, still, I would like to make decisions suitable over the long time.

Now let’s dig into some code.

Lucene.Net version 4 is in pre-release. Use the pre-release versions of Lucene.

Because Lucene.Net is in beta and there could be lot’s of breaking changes, the compatibility version needs to be declared in code.

private const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;

Now we specify the directory where we want the indexes to be written and some initialization code.

var dir = FSDirectory.Open(indexDirectory);
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);
var writer = new IndexWriter(dir, indexConfig);

Now, we use the IndexWriter for writing documents. There are primarily 2 types of string fields that are important.

  1. TextField – The string data is indexed for full-text
  2. StringField – The data is not indexed for full-data but can be searched like normal strings for fields such as id etc…

Based on the above-mentioned types, determine the data that needs full-text search capabilities and the data that would not need full-text search capabilities and if certain data needs to be stored in Lucene.

var doc = new Document
 {
    new TextField("Title", "Some Data", Field.Store.YES),
    new TextField("Description", "Description", Field.Store.YES),
    new StringField("Id", id, Field.Store.YES)
};

You can add as many TextField and StringField instances as needed. You can even create seperate instances of TextField and StringField and call doc.Add().

If you want to optimize the search results provided by Lucene, you can even specify the Boost of the TextField. By default the Boost i.e weight given to any field in 1.0. But can specify a higher weighting for a certain field. For example, if a keyword is in title you might want to boost the entity.

Add the doc instance to writer and flush();

writer.AddDocument(doc);
writer.Flush(triggerMerge: false, applyAllDeletes: false);

For speed and efficiency batch the documents before calling Flush, instead of calling Flush for every document.

Assuming you have built your indexes. Now let’s start to retrieve.

using var dir = FSDirectory.Open(indexPath);
var analyzer = new StandardAnalyzer(AppLuceneVersion);

var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);
using var writer = new IndexWriter(dir, indexConfig);
using var lreader = writer.GetReader(applyAllDeletes: true);
var searcher = new IndexSearcher(lreader);

var exactQuery = new PhraseQuery();
exactQuery.Add(new Term("Id", id));
var search = searcher.Search(exactQuery, null, 1);
var docs = search.ScoreDocs;

if (docs?.Length == 1)
{
    Document d = searcher.Doc(docs[0].Doc);
    var title = d.Get("Title"));

}

The above source code for retrieving document based on Id, not for full-text search. The first few lines of code are standard initializers. Then we instantiated a PhraseQuery, we specified the search should happen on “Id” field. Then if there is a match, we retrieved the Title of the matching document.

Now let’s see how we can search based on Title and Description as mentioned above:

using var dir = FSDirectory.Open(indexPath);
var analyzer = new StandardAnalyzer(AppLuceneVersion);

var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);
using var writer = new IndexWriter(dir, indexConfig);
using var lreader = writer.GetReader(applyAllDeletes: true);
var searcher = new IndexSearcher(lreader);

string[] fnames = { "Title", "Description" };
var multiFieldQP = new MultiFieldQueryParser(AppLuceneVersion, fnames, analyzer);

Query query = multiFieldQP.Parse("My Search Term");
var search = searcher.Search(query, null, 10);

Console.WriteLine(search.TotalHits);

ScoreDoc[] docs = search.ScoreDocs;
for(var doc in docs) {
    Document d = searcher.Doc(docs[i].Doc);

    var Id = d.Get("Id");
    var Title = d.Get("Title");
    var Description = d.Get("Description");
}

In the above source code, we have the standard initializers in the first few lines. Then we are specifying the columns on which the search should happen in the fnames variable. Then we instantiated a MultiFieldQueryParser to enable searching on multiple fields. Then we built the query by specifying the search term. Advanced boolean queries can also be created in this step. Then the search is performed, we can specify how many documents the result should contain, in this case, we specified 10 results. The rest of the code is regarding fetching the field values.

I am hoping this blog article helps someone.

Categories
.Net C# Lucene Solr

Lucene vs Solr

I played around with Lucene.Net and Solr. Solr is built on top of Lucene.

Lucene.Net is a port of Lucene library written in C# for working with Lucene on Microsoft .Net stack.

Lucene is a library built by Apache Software Foundation. Lucene provides full-text search capabilities. There are few other alternatives such as Sphinx, full-text search capabilities built into RDBMS’s such as Microsoft SQL Server, MySQL, MariaDB, PostgreSQL etc… However, full-text search capabilities in RDBMS’s are not as efficient as Lucene.

Solr and ElasticSearch are built on top of Lucene. ElasticSearch is more suitable and efficient for time-series data.

Now let’s see more about Solr vs Lucene.

Solr provides some additional features such as replication, web app GUI, collecting and publishing metrics, fault-tolerant etc… Solr provides HTTP REST-based API’s for management and for adding documents, searching documents etc…

Directly working with Lucene would provide access to more fine-grained control.

Because Solr provides REST based API’s there is the overhead of establishing HTTP connection, formatting the requests, JSON serialization, and deserialization at both ends i.e client making the call and the Solr server. By directly working with Lucene this overhead does not exist.

If searching through the documents happens on the same server, working directly with Lucene might be efficient. Specifically in lesser data scenarios, but if huge datasets and scaling are a concern, Solr might be the proper approach.

If server infrastructure requirements require separate search servers and a bunch of application servers query the search servers for data, Solr might be more useful and easier because of existing support replication and HTTP API’s.

If performance is of the highest importance and still fine-grained control is needed, custom-built applications should expose the data from search servers and some other more efficient protocols such as gRPC could be used and obviously, replication mechanisms need to be custom-built.