Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

Search functionality: heavy on the research, light on the development

The word “and” has always bugged me. I hate started sentences with it but sometimes can’t help myself. Whenever I have a list of three or more items in a sentence, I can never tell whether I should be a comma before the “and” separating the last two items. And Plus it causes me no end of grief in search interfaces.

The impetus behind my most recent foray into Lucene.NET was one query phrase in particular:

research and development

Specifically, I want to be able to find documents that contained this phrase in it. Not both the word “research” AND the word “development” but the phrase “research and development”. And Also, preferably it would return documents that contained “research & development” or, if you *really* want to impress someone, “r&d”.

In the spirit of my search term, I’ve been doing some research and development to try to figure this little query out. More of the latter at first but it’s been increasingly obvious that to do a decent search interface, you also need plenty of the former. To that end, this will be a typically epic free-form meandering of my process with my usual caveat: If any of this is useful to you, that’s not my fault. I won’t go into much detail here because: a) Simone Chiaretta will almost certainly cover it shortly (if he hasn’t already), and b) there is plenty of documenta—…actually, that’s not true. Oh well, I’m still not covering the inner workings of parsers and analyzers.

QueryParser.Parse

By all accounts, QueryParser is the class to use when dealing with user-entered input. You can use a fairly easy-to-learn syntax and let Lucene handle the heavy lifting of whether to search for an entire phrase or individual words. It also includes a way of parsing AND, OR, or NOT.

This has the most appeal to me for obvious reasons so it was the one I settled on first. Then came “research and development”. Searching for the phrase with quotes around it came back with false positives. I.e. documents that contained either research or development. So I halted the development and started some research.

StandardAnalyzer

This led to much reading about Analyzers. (And So I’ll echo many others’ sentiments by recommending the book, Lucene in Action, which has been a great resource.) I started out indexing and searching with the StandardAnalyzer. But this has a couple of side effects. For one, when indexing, it strips out common stop words, like the, a, and an. And As well as and.

On the search side, it will also do some parsing of the query phrase when you use it with QueryParser.Parse. In short, when you search for “research and development” (with quotes) using a StandardAnalyzer, the query is parsed to the following:

contents:”research development”

I.e. The and is taken out of the search phrase altogether. Not quite what I had in mind so a new track was laid.

SimpleAnalyzer

The SimpleAnalyzer indexes everything. Every word (and every position of every word if you tell it to). Obviously, the size of your index will grow considerably. In my testing ground, it quintupled in size from 19Mb to 100Mb based on 1600-odd Word and PDF documents.

On the search side, if you use a SimpleAnalyzer with the QueryParser, it does correctly identify the phrase “research and development” when you include it in quotes. So all appears happy and good…

…except that it doesn’t handle “r&d” (with or without the quotes) very well. The query is reduced to:

contents:”r d”

I.e. Find all documents with the letter r and the letter d as individual letters in them. Which, truth be told, isn’t such a bad thing on the surface. It means we’ll catch not only documents containing “r&d” but also those containing “R & D”. But by the same token (pun intended), it will also match documents containing “R. Buford D. Justice”

PhraseQuery

Another option I looked at was the PhraseQuery. If you use this, it will always search for the exact phrase. None o’ this “research development” or “r d” nonsuch.

But here, the analyzers come into play as well. If I search for “research and development”, that means the word and needs to be indexed. Which implies a SimpleAnalyzer during indexing. If I search for “r&d”, the SimpleAnalyzer won’t work because it breaks up words separated by ampersand.

From here…

That brings everyone up to speed to where I am now. I’ve posed the question on StackOverflow (my first!) and at the moment, the only answer to it suggests I write my own analyzer, one that acts like the StandardAnalyzer but doesn’t throw out the word and. That sounds reasonable to me, at least until someone searches for “research or development”.

Another option I’m considering is to tell the indexer to index specific phrases like “research and development” or “oil and gas” or other common ones used in the domain. Not sure I like the long-term maintenance of either option but search is a journey, not a destination, I suppose.

There’s a fundamental argument buried in here somewhere. Lucene gives you so much control over your indexing/searching that if you’re one of those Type A’s that can’t stand when something is just “good enough”, you can very easily drive yourself up the wall trying to optimize things. It really does require you to put some thought into how users will use your search. As much as Microsoft Indexing Services allowed me to throw up a search interface haphazardly, I believe you do yourself an injustice by not considering the ins and outs of the process.

 

By the way, a couple people asked about the code I used to extract text from Word and PDF docs. Lovingly provided by the venerable (and I hope I used the right word there) Brian Donahue, the relevant classes are attached in their entirety. The only thing different about this code snippet compared to others I’ve been sent in the past is that this one worked out of the box with absolutely no help from me. Seriously, I can’t even tell you what the internal method names are, that’s how little I looked at it. Call Parser.Parse(filename) and watch the magic fly.

Filter.cs
Parser.cs

Kyle the Found-ational

This entry was posted in Lucene.NET. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • http://www.infovark.com Dean Thrasher

    Hey Kyle, thanks for posting the code and your experiences with Lucene. We’ve been using Lucene.NET on our project for some time. It’s worked well for us — once we extract the text we need.

  • http://codebetter.com/members/kylebaley/default.aspx Kyle Baley

    Nik: First lesson of providing basic English lessons: No matter how hard you try, you *always* sound patronizing doing it.

    That’s exactly why I don’t like starting sentences with “and”. But this ain’t a dissertation. I’m williing to give up a little “correctness” for the sake of conversational style. Same reason I don’t mind ending sentences with a preposition occasionally even though I die a little inside for every sentence I do it in.

    Yes, I could have skipped both the And and the Also in that sentence. But then it wouldn’t have been as amusing to me.

  • http://anothercodingcatastrophe.co.uk Nik Radford

    Basic English lesson: You NEVER start sentences with “and”. Also, the last two elements in a comma separated list will always be “and” or “or” depending on the context of the list. Example: I have red, blue, green and yellow marbles.

    On the first paragraph, you have a section that looks like this “word “development” but the phrase “research and development”. And Also, preferably it” (with the and striked through) but both the word “And” and “Also” are redundant here. You could of start the sentence merely with “Preferably”.

    don’t mean to be patronizing or anything, just pointing out that you hate starting sentence with “and” because sentences should never start with “and”.

  • http://twitter.com/BlackTigerX Eber Irigoyen

    interesting… so Lucene doesn’t “just work”…

  • Ryan Roberts

    Are false positives a show stopper for your application?

    I have found it much easier to control ranking than it is to create complex analysers. If you reject results below a threshold you can control a lot of false positives.

    Have you tried the quick and dirty way of handling your and problem by using a proximity search “research and development”~2. That will match anywhere where the 2 terms and the standard analyser won’t break things.

  • Duncan Godwin

    Have you had a look a Solr (http://lucene.apache.org/solr/features.html)? It provides a good deal of features added on top of Lucene such as more like this, synonym support and facets. It makes Lucene into more of a full fledged search engine.

    There is a good library available for .NET:
    http://code.google.com/p/solrnet/