We are currently doing some testing with a new experimental engine. The experiment consists in using OR as the default mode and IDF-only for scoring terms. IDF is precomputed straight from the inverted index which is also computed at query time. We are also trying replacing IDF with Entropy scores.

With large collections, the inverted index is written to a text file and read at query time.

Since local information (e.g., term freq) is ignored, keyword spam is not an issue.

Instead of a Vector Space Model, we use a cummulative sum of scores over IDF scores, such that is not necessary to compute cosine similarities (*).

So far the results of the experiment is that with multi-term queries two extreme clusters are obtained:

1. the top N ranked documents almost behave as being queried in AND mode and as obeying the Cluster Hypothesis.

2. the M ranked documents at the bottom behave as being queried either in EXACT mode or with a single-term query. (**)

Between these extremes we have some noisy results.  

If some have tried this before, we would love to hear about it. Contact us by email.

 

PS.

(*) In this way we don’t need to make independence assumptions.

(**) With few changes, M now behaves as being queried with single-term queries or few query terms, which is what we expected. The N set still is the more interesting. The middle cases are now quite noisy.

About these ads