A Simple Search Strategy to beat them all

egarciaDecember 1, 2009December 1, 2009Data Mining, Machine Learning, Programming, Queries, Vector Space Models

Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.

I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.

Mistake not a simple matching strategy for a simple or basic search approach as it can evolve into the most complex one.

Unlike classic boolean searches (i.e., AND, OR, XOR), SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.

Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.

I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!

In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.

Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.

So far, my one is cost effective and does not require of extra libraries.

PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!

Published by egarcia

View all posts by egarcia

6 Comments

itman1975 says:

December 1, 2009 at 6:13 pm

Hi, very excellent post. Could you elaborate a little bit on the strategy you use?

Reply
E. Garcia says:

December 1, 2009 at 9:20 pm

Thank you for stopping by.

SM itself is not a new thing and is well described by Keith Rijsbergen in Chapter 5 of his old book (http://www.dcs.gla.ac.uk/Keith/pdf/Chapter5.pdf)

It can evolve into a more complex thing when combined with other ideas as mentioned above. Some of these ideas are on the open. Others are not and should be kept locked.

The test being conducted is to check whether search results coming from a blind, brute force approach are perceived by end users as coming from a semantic oracle. If so, then it is possible to fool end users; i.e to make them buy into a technology that does not exist.

I suspect, many that claim to have a “semantic” technology are doing just that: fooling users, perhaps through other means.

Reply
itman1975 says:

December 1, 2009 at 9:47 pm

Thank you for the reference!
As to the fooling users: I know cases, when it is certainly true. The search engine, however, goes far beyond just a relevance system itself. It is oftencase more important to figure out all those ranking factors (e.g. spam, not spam) that are included in the ranking formula.

Reply
E. Garcia says:

December 1, 2009 at 11:27 pm

Indeed. Been there, done that.

Reply
danfidalgo says:

April 12, 2010 at 9:28 am

Currently I am developing also an IR system, that uses ontology as knowledge database. Check it out 😉

http://whatisprymas.wordpress.com/

cheers

Reply
E. Garcia says:

April 13, 2010 at 8:05 am

Hi, dafidalgo:

Thank you for stopping by.

Sounds like a lot of fun. Happy to hear of your efforts. Good luck 🙂

Reply

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Share this:

Related

Published by egarcia

6 Comments

Leave a comment Cancel reply