I've just changed the search functionality of this blog to use Lucene.NET and all of the quick, rich searching functionality that that provides. It is testament to whoever first wrote the original blog engine search functionality, that it was as relatively straight forward to do as it was. Below are the steps and an explanation of some of the code changes I made - I'm not going to explain what Lucene is or how it works, there are plenty of resources out there that alreday do this, just use your favourite search engine to find them.
In terms of setting up Lucene.NET all I had to do was download the latest version of the source code from the SVN repository, give it a strong name, build it and add a reference in the BlogEngine.Core project, to the Lucene.NET dll. Because I compress some of the data that I add to the index I also had to download SharpZipLib, which is what Lucene uses for compression, and copy the dll to the BlogEngine.NETweb project, bin folder.
Other than marking BlogEngine.Core.Page as serializable, I only had to change one class - BlogEngine.Core.Search. In the Search class I then basically, just had to change the implementation of BuildCatalog() and BuildResultSet(). BuildCatalog() now creates an instance if IndexWriter and also a BinaryFormatter as I need to serialize the items that I add to the index. Once I've built the index I commit and optimize it and then close the IndexWriter, which forces the index to be written to the in memory directory that I passed to the constructor of IndexWriter. When writing this code I closely followed a code project article by Andrew Smith which contains more details of what each of the Lucene.NET classes do.
private static void BuildCatalog()
{
OnIndexBuilding();
lock (_SyncRoot)
{
IFormatter formatter = new BinaryFormatter();
IndexWriter indexWriter = new IndexWriter(_dir, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
foreach (Post post in Post.Posts)
{
if (!post.IsVisibleToPublic)
continue;
AddItem(post, indexWriter, formatter);
// Rest of the code remains as it was before except for extra parameters passed to AddItem
...
}
indexWriter.Commit();
indexWriter.Optimize();
indexWriter.Close();
}
OnIndexBuild();
}
Most of the interesting Lucene functionality takes place in AddItem(), which needed to be completely re-written as obviously we are now using a Lucene.NET index rather than a hashmap. In AddItem() I add fields for the item ID, the item itself (serialized), the title, the item content and the item description. I will explain later in the article why I needed to add the ID. I added the item itself so that I can get it back to display in the search results. It would have been nice to have just used the ID to read it again but because we are storing different types of items (posts, pages, comments) it would have meant adding a type field as well. If my blog contained lots of posts and comments etc, then this would be an optimization that I may need to consider again. When searching, we do NOT want to search on the serialized item which is why Field.Index.NO is specified. You will also notice that Field.Store.COMPRESS is used so that the serialized data is also compressed - this was why I had to download SharpZipLib (see above). The title, content and description fields are then added and these fields are searchable and so Field.Index.ANALYZED is passed into the field constructor. The final thing to note is that for non-comment items I call SetBoost() - this is so that comments appear lower down the search results than other items which is what happened in the original search functionality.
private static void AddItem(IPublishable item, IndexWriter writer, IFormatter formatter)
{
Document doc = new Document();
doc.Add(new Field(GUID_FLD, item.Id.ToString(), Field.Store.YES, Field.Index.ANALYZED));
using (MemoryStream stream = new MemoryStream())
{
formatter.Serialize(stream, item);
doc.Add(new Field(POST_FLD, Convert.ToBase64String(stream.GetBuffer()), Field.Store.COMPRESS, Field.Index.NO));
}
doc.Add(new Field(TITLE_FLD, item.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
string content = HttpUtility.HtmlDecode(Utils.StripHtml(item.Content));
if (item is Comment)
{
content += HttpUtility.HtmlDecode(Utils.StripHtml(item.Author));
}
else
{
// Comments shouldn't be as prominent as other items in the search results.
doc.SetBoost(NON_COMMENT_BOOST);
}
doc.Add(new Field(CONTENT_FLD, content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
doc.Add(new Field(DESCR_FLD, item.Description, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
writer.AddDocument(doc);
}
The BuildResultSet() method is where the searching takes place. To summarise I use the search term to build a Lucene query for each of the fields I wish to search on (title, content and description). I call SetBoost on the title and description queries so that items with matching titles for example, are ranked higher in the search results than articles with matching content - this again comes from the original implementation of searching. Once the queries are built, I use an IndexSearcher to search for matching items, currently this is harcoded to return a maximum of 100 results, which should be plenty for now, but could easily be changed to be a new blog setting. Finally, I loop round the search results and deserialize the item field which is then added to the results list dependent upon whether it is a comment or not and whether the user has selected to include comments in the search results.
private static List<Result> BuildResultSet(string searchTerm, bool includeComments)
{
List<Result> results = new List<Result>();
BooleanQuery query = new BooleanQuery();
Query titleQuery = new QueryParser(Lucene.Net.Util.Version.LUCENE_CURRENT, TITLE_FLD, _analyzer).Parse(searchTerm);
Query contentQuery = new QueryParser(Lucene.Net.Util.Version.LUCENE_CURRENT, CONTENT_FLD, _analyzer).Parse(searchTerm);
Query descrQuery = new QueryParser(Lucene.Net.Util.Version.LUCENE_CURRENT, DESCR_FLD, _analyzer).Parse(searchTerm);
titleQuery.SetBoost(TITLE_BOOST);
descrQuery.SetBoost(DESCR_BOOST);
query.Add(titleQuery, BooleanClause.Occur.SHOULD);
query.Add(contentQuery, BooleanClause.Occur.SHOULD);
query.Add(descrQuery, BooleanClause.Occur.SHOULD);
IFormatter formatter = new BinaryFormatter();
IndexSearcher searcher = new IndexSearcher(_dir, true);
TopDocs docs = searcher.Search(query, new CachingWrapperFilter(new QueryWrapperFilter(query)), 100);
foreach (ScoreDoc score in docs.scoreDocs)
{
IPublishable item = null;
Document document = searcher.Doc(score.doc);
using (MemoryStream stream = new MemoryStream())
{
byte[] rawData = Convert.FromBase64String(document.GetField(POST_FLD).StringValue());
stream.Write(rawData, 0, rawData.Length);
stream.Seek(0, SeekOrigin.Begin);
item = formatter.Deserialize(stream) as IPublishable;
}
if (item != null)
{
Result result = new Result();
result.Item = item;
result.Rank = (int)Math.Round(score.score);
if (!(item is Comment) || includeComments)
{
results.Add(result);
}
}
}
return results;
}
That is pretty much all of the changes that needed to be made to get Lucene.NET integrated into BlogENgine.NET. There were a couple of other changes I chose to make however. I had to create an overloaded version of AddItem() which took just a single item parameter and then created the BinaryFormatter and IndexWriter itself which gets called when a new post or something is added. I also chose to create a delete method and update method which weren't implemented originally, but were very straight forward to write using Lucene.NET and improves the performance of editing or deleting a post etc. The update method simply calls DeleteItem() and then AddItem(). The DeleteItem() method is the reason that I needed to add the ID field to the index. The ID is used to search for the item to delete - the implementation of DeleteItem() is shown below.
private static void DeleteItem(IPublishable item)
{
IndexReader reader = IndexReader.Open(_dir, false);
reader.DeleteDocuments(new Term(GUID_FLD, item.Id.ToString()));
reader.Commit();
reader.Close();
}
To try it out just browse to the search page or if you are interested in using this code then feel free to download it and just remember to mark the Page class as serializable.