A Simple Search Strategy to beat them all

Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.

I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.

Mistake not a simple matching strategy  for a simple or basic search approach as it can evolve into  the most complex one.

Unlike classic boolean searches (i.e., AND, OR, XOR),  SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.

Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.

I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!

In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.

Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.

So far, my one is cost effective and does not require of extra libraries.

PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!

6 Comments

  1. Thank you for stopping by.

    SM itself is not a new thing and is well described by Keith Rijsbergen in Chapter 5 of his old book (http://www.dcs.gla.ac.uk/Keith/pdf/Chapter5.pdf)

    It can evolve into a more complex thing when combined with other ideas as mentioned above. Some of these ideas are on the open. Others are not and should be kept locked.

    The test being conducted is to check whether search results coming from a blind, brute force approach are perceived by end users as coming from a semantic oracle. If so, then it is possible to fool end users; i.e to make them buy into a technology that does not exist.

    I suspect, many that claim to have a “semantic” technology are doing just that: fooling users, perhaps through other means.

  2. Thank you for the reference!
    As to the fooling users: I know cases, when it is certainly true. The search engine, however, goes far beyond just a relevance system itself. It is oftencase more important to figure out all those ranking factors (e.g. spam, not spam) that are included in the ranking formula.

Leave a comment