In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.
However…
Here I would like to reference research at both sides of the Scope Hypothesis.
In the abstract of ”Revisiting the relationship between document length and relevance” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:
“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”
Really….?
However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (http://www.sigir2011.org/papershow.asp?PID=104), Zhou, Huang, and He (2011) state:
“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”
My take…
I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.
Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.
Quite interesting, I would love to read a followup.
Hi, itman1975:
Again, thank you for stopping by.
Very interesting, indeed.
There is no doubt that document length impacts term frequency. The question is whether document length impacts relevance. Two lines to explore are:
1. whether the results from all these studies are an artifact of the normalization scheme used, as there are many term frequency normalization schemes to chose from. There is even Dirichlet normalizations applied to BM25 models (http://ir.dcs.gla.ac.uk/smooth/f219-he.pdf) and document pivot normalization.
2. Clusters of index term densities spreaded over the document length. Without a clear dominant cluster of index terms, documents are more likely not to have enough scope or to be relevant. Document length alone is not enough.
1. Yes, I agree. This may be a side affect of non-linearity of term normalization.
2. Is it true in all cases? E.g. an article on skin cancer may not have any clusters of the word cancer. Does it mean that the document is not relevant with respect to ‘cancer’?
It all depends on how the clusters are defined. But sure, there would be exceptions to rules for almost any algorithm out there.
I might not be able to follow up this for few days as I’m going for surgery.
In the meantime, here are some great links to add to the last tutorial. Somehow I miss to include these gems.
http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=549257EC3BC5B8BD1D0195241A931602?doi=10.1.1.9.5255&rep=rep1&type=pdf
http://research.microsoft.com/en-us/people/tyliu/letor-tutorial-sigir08.pdf
http://research.microsoft.com/en-us/people/tyliu/learning_to_rank_tutorial_-_www_-_2008.pdf
No problem and good luck with the surgery!
In regard to the latter 2 links, I think that learning to rank is a separate topic that deserves a separate tutorial.
In regard to the clustering idea: I guess this would be a natural extension of the “divergence from randomness” idea. It will be nice to see it evaluated experimentally. Especially for longer documents, which are not homogeneous. Or, perhaps, one better represent long documents as a set of smaller units, which are retrieved independently.
Hi, there:
I’m back from surgery and recovering at home, finally.
The clustering concept I mentioned is not about a “divergence from randomness” idea.
The clusters I mentioned are constructed by normalizing frequencies within the 0 to 1 interval and then grouping these according to intervals of different sizes. This normalization is not ‘a la’ BM25, but it could be done in that way. I have not explored that yet.
Even acknowledging exceptions to rules, it turns out that documents with no clusters or just one cluster more likely have no scope while those about a given topic tend to exhibit clusters. Since longer documents are more likely to be about multiple topics, these in turns are more likely to have well defined clusters.
However, document length itself is not enough to assess neither the scope nor the relevance of documents. These things are influenced by too many variables.
One variable I’m examining is word positioning of content words in a sentence, in a group of sentences, in a paragraph, and in a group of paragraphs. It turns out that a surprise and lesson for SEOs is at the end of this line of thought. I’ll expand on this in a separate article.
The idea of writing tutorials on Learn to Rank Algorithms is a good one!
Glad to see you back. Looking forward for the development of this topic.
Hi, there.
Well, due to limitation constraints, I decided to do a post rather than a full article. See here:
http://irthoughts.wordpress.com/2012/02/13/hey-seos-on-information-gain-keyword-wallop-and-relevance/