Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.
I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.
Mistake not a simple matching strategy for a simple or basic search approach as it can evolve into the most complex one.
Unlike classic boolean searches (i.e., AND, OR, XOR), SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.
Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.
I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!
In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.
Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.
So far, my one is cost effective and does not require of extra libraries.
PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!