As part of development effort for PodDB – A search engine for podcasts – A product of ALight Technology And Services Limited, I have been deciding between Lucene.Net and Solr. I strongly suggest Solr over Lucene.Net if you want to scale. For smaller datasets, Lucene.Net shouldn’t be a problem. But, if you want to scale for larger datasets, want built-in sharding, replication features out of the box choose Solr. For smaller datasets and if you know, you wouldn’t be scaling into bigger datasets, Lucene.Net shouldn’t be a problem and as a matter of fact very efficient. With that said, I do have plans of scaling PodDB, if PodDB gains traction, so I chose Solr.
But for the sake of knowledge sharing, in this article, I am going to show how to use Lucene.Net for full-text indexing. I would not go over complex scenarios and at the same time, this article is NOT a Hello World for Lucene.Net.
Moreover, Lucene.Net does not seem to be under active development. As of this blog post date – September 13tth 2022, there are no GitHub commits over the past 2 – 3 months. As software developers, technical leads, and architects we have the responsibility in making the proper choices for the underlying technology stack. Although, ALight Technology And Services Limited is not an enterprise yet, still, I would like to make decisions suitable over the long time.
Now let’s dig into some code.
Lucene.Net version 4 is in pre-release. Use the pre-release versions of Lucene.
Because Lucene.Net is in beta and there could be lot’s of breaking changes, the compatibility version needs to be declared in code.
private const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
Now we specify the directory where we want the indexes to be written and some initialization code.
var dir = FSDirectory.Open(indexDirectory);
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);
var writer = new IndexWriter(dir, indexConfig);
Now, we use the IndexWriter for writing documents. There are primarily 2 types of string fields that are important.
- TextField – The string data is indexed for full-text
- StringField – The data is not indexed for full-data but can be searched like normal strings for fields such as id etc…
Based on the above-mentioned types, determine the data that needs full-text search capabilities and the data that would not need full-text search capabilities and if certain data needs to be stored in Lucene.
var doc = new Document
{
new TextField("Title", "Some Data", Field.Store.YES),
new TextField("Description", "Description", Field.Store.YES),
new StringField("Id", id, Field.Store.YES)
};
You can add as many TextField and StringField instances as needed. You can even create seperate instances of TextField and StringField and call doc.Add().
If you want to optimize the search results provided by Lucene, you can even specify the Boost of the TextField. By default the Boost i.e weight given to any field in 1.0. But can specify a higher weighting for a certain field. For example, if a keyword is in title you might want to boost the entity.
Add the doc instance to writer and flush();
writer.AddDocument(doc);
writer.Flush(triggerMerge: false, applyAllDeletes: false);
For speed and efficiency batch the documents before calling Flush, instead of calling Flush for every document.
Assuming you have built your indexes. Now let’s start to retrieve.
using var dir = FSDirectory.Open(indexPath);
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);
using var writer = new IndexWriter(dir, indexConfig);
using var lreader = writer.GetReader(applyAllDeletes: true);
var searcher = new IndexSearcher(lreader);
var exactQuery = new PhraseQuery();
exactQuery.Add(new Term("Id", id));
var search = searcher.Search(exactQuery, null, 1);
var docs = search.ScoreDocs;
if (docs?.Length == 1)
{
Document d = searcher.Doc(docs[0].Doc);
var title = d.Get("Title"));
}
The above source code for retrieving document based on Id, not for full-text search. The first few lines of code are standard initializers. Then we instantiated a PhraseQuery, we specified the search should happen on “Id” field. Then if there is a match, we retrieved the Title of the matching document.
Now let’s see how we can search based on Title and Description as mentioned above:
using var dir = FSDirectory.Open(indexPath);
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var indexConfig = new IndexWriterConfig(AppLuceneVersion, analyzer);
using var writer = new IndexWriter(dir, indexConfig);
using var lreader = writer.GetReader(applyAllDeletes: true);
var searcher = new IndexSearcher(lreader);
string[] fnames = { "Title", "Description" };
var multiFieldQP = new MultiFieldQueryParser(AppLuceneVersion, fnames, analyzer);
Query query = multiFieldQP.Parse("My Search Term");
var search = searcher.Search(query, null, 10);
Console.WriteLine(search.TotalHits);
ScoreDoc[] docs = search.ScoreDocs;
for(var doc in docs) {
Document d = searcher.Doc(docs[i].Doc);
var Id = d.Get("Id");
var Title = d.Get("Title");
var Description = d.Get("Description");
}
In the above source code, we have the standard initializers in the first few lines. Then we are specifying the columns on which the search should happen in the fnames variable. Then we instantiated a MultiFieldQueryParser to enable searching on multiple fields. Then we built the query by specifying the search term. Advanced boolean queries can also be created in this step. Then the search is performed, we can specify how many documents the result should contain, in this case, we specified 10 results. The rest of the code is regarding fetching the field values.
I am hoping this blog article helps someone.