Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

Lucene.net vs. Indexing Services, or “How to conquer your fear of index management”

Long-winded background for this long-winded post can be found here and here. The short version is: I have an app that searches using a full-text search of a document repository consisting of Office docs and PDFs.

The current version uses Microsoft Indexing Service and it is all but obsolete. Which is fine with me for the time being because of the economics of the situation. Namely, the app isn’t big enough to warrant putting the effort into updating it just for the sake of the technology.

Two things happened recently though that made me decide it was time to update. First was Simone Chiaretta’s masterfully-timed tutorial series on getting started with Lucene.NET. The second was the boss discovering the current version doesn’t actually work.

By many orders of magnitude, the boss is the biggest user of this application. And recently, he went about searching for a relatively common term: R&D. He was met with a nicely formatted 500 Server Error page and asked me if I would be so bold as to fix it.

A day and a half later and I simply had no fix. The existing app uses a SQL query to search the Indexing Service which, at the time, I thought was tres clever. But after trying a dozen methods of escaping it, there was no way I could get it to accept an ampersand. Furthermore, I also discovered the page failed when including words like ‘AND’ and ‘OR’. This problem was fixable but required some parsing of the search term and again, the cost/benefit for making this sort of thing bullet-proof just wasn’t there.

So when Simone’s tutorial started coming across my RSS feed, it just made sense to re-think the problem.

The main reason I resisted moving away from Indexing Service for so long is thus: I don’t need to manage the index myself. Once configured, the only thing I needed to do to add a document to the index was drop it into a folder.

But therein lay the problem: Because I don’t manage the index, I have no control over it. During my travels, I discovered I was getting false positives for some terms and that it was not returning all documents in some cases.

How did I discover this? I implemented Lucene and compared the results. If I discovered a discrepancy and it was because of how I indexed with Lucene, well, then I tweaked the indexing process. If the discrepancy was with Indexing Services…well, then I said, “&*%$ it! You’re getting replaced!”

Another bit of fortuitousness came in the form of one Brian Donahue. One of my fears going into this was how I would get the text out of the documents in order to index it. I had waking nightmares of Windows API calls and IFilters, dreading having to deal with this. So when, after outlining my pain on Twitter, Brian responded with, and I’m paraphrasing, “Here, take this code. It will do that for you and it WORKS RIGHT OUT OF THE BOX!”. I don’t mind admitting I developed an unhealthy infatuation with him for a short time after that.

In fact, it was through the process of managing the index myself with Lucene that I discovered one reason some documents weren’t getting indexed. They were RTF documents but they had a .doc extension so it was using the wrong IFilter to capture the text. It was through Brian’s code for extracting text that I figured this out. It threw errors on some documents and I couldn’t figure out why until I tried to Save As… when working with them in Word. Change the extension to its rightful .rtf and the indexing process hummed along. But with the Indexing Service, these documents simply weren’t indexed. No error, no notification. It’s possible a message was posted to the event log but that’s a little too passive even for me.

I hope to have some more technical details but I want to wait until Simone is further along so I don’t duplicate. I’ll very likely piggy-back off a couple of his posts but to summarize: Lucene.NET rocks and should be used any time you have a button labeled “Search”, “Find”, “Locate”, or “Git it!” Fear not the index-management process because like WebForms, the problem isn’t hard enough that it needs to be abstracted.

Kyle the Co-located

This entry was posted in Lucene.NET. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • John Zastrow

     Never mind .

  • John Zastrow

     Did the material from Brian Donahue ever get posted?

  • Herc

    I am in the same boat. 130 customers (sites) all relying on Index Server, buried in my app. Having Googled endlessly it looks like Lucene.Net is my only hope.

    Thanks for the post – I’m hoping this means it is all do-able.

    But more than that, I want to do auto-tagging and semantic type stuff, suggested search terms and all those 2010-type things, rather than 2000-type things.

    Any advice on any of this is warmly welcomed.

  • http://codebetter.com/members/kylebaley/default.aspx Kyle Baley

    Dean and Steve

    Will post it this week. Just had to get the go-ahead from Brian that it was publishable

  • http://www.mindtouch.com Steve Bjorg

    We use Lucene.Net in our product as well. It’s an amazing OSS project. Highly recommended.

    Our approach to IFilters has been to use an external app written in C++ to do the heavy lifting. This avoids our .NET app from being affected by any errors in the COM layer. The code is available under GPL here:
    http://viewvc.mindtouch.com/public/dekiwiki/trunk/src/tools/mindtouch.deki.filter/

    However, I’d still be curious how it would be done purely in .NET. Is the code shared somewhere?

  • http://www.infovark.com Dean Thrasher

    We’ve used Lucene.NET in our product, and are very pleased with its performance and abilities. (Though I wish the project team would release an official, updated version.)

    I’m curious about the code used to avoid WinAPI and IFilters. We could certainly use that on our current project!

  • http://codebetter.com/members/kylebaley/default.aspx Kyle Baley

    I considered SharePoint *very* briefly as well. It would require too radical an architecture shift. If I were re-building the app from the ground up, I’d probably look into it more.

  • http://twitter.com/pseale Peter

    As someone who has customized the current SharePoint search engine, from what I’ve seen of Lucene.NET they match up about evenly. Lucene looks like it gives you much more granular control over everything, from how you crawl items to how you store the index to how you rank items in your index. On the SP side, SharePoint gives you a mostly-works scalable default infrastructure, especially important if you’re scaling to many servers or your search index takes several hours to build–something Lucene.NET doesn’t provide (not that I see).

    I’d blog the differences and similarities but I’m still too ignorant on the Lucene side. Summary is, it looks like you won’t miss any features moving from SharePoint to Lucene.NET.

  • http://kevin-berridge.blotspot.com Kevin Berridge

    Thanks for this post and the links. I’m was indexing documents the same way you were, and was in the same boat of wanting to find a better way but not being sure it was worth it. Now I’ll probably jump in and get it updated.

  • http://http.//www.lybecker.com/Blog/ Anders Lybecker

    I use both Lucene.Net and SQL Server FullText Search – both have their advantages, but you have greater control and more features with Lucene.Net.

    Lets see with the future brings for Microsoft Enterprice Search with their purchase of the FAST search engine (http://www.microsoft.com/enterprisesearch/en/us/fast-customer.aspx)

    :-)
    Anders Lybecker

  • http://www.jondavis.net/techblog/ Jon Davis

    Agreed, Lucene.NET *rocks*, and it’s quite unfortunate that Microsoft hasn’t paid much attention to it from what I can see.