• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: August 2011

The Scope Hypothesis in IR: Who is Right?

13 Saturday Aug 2011

Posted by egarcia in Data Mining, IR Tools, IR Tutorials, Queries

≈ 8 Comments

In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.

However…

Here I would like to reference research at both sides of the Scope Hypothesis.

In the abstract of ”Revisiting the relationship between document length and relevance” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:

“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”

Really….?

However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (http://www.sigir2011.org/papershow.asp?PID=104), Zhou, Huang, and He (2011) state:

“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”

My take…

I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.

Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.

BM25 and BM25F: Implications to SEO and Web Design

04 Thursday Aug 2011

Posted by egarcia in IR Tutorials

≈ 1 Comment

Yesterday we published two great tutorials on the BM25 and BM25F algorithms.

The “take away home” from the theory behind these algorithms:

1. A term (e.g., a keyword) has more information gain when it occurs for the very first time.

2. More likely, a term weights more in a title field than in other fields.

3. The weight of a term and its ocurrence frequency are not linearly related.

4. A linear combination of field scores that destroys term dependencies is contraindicated (See BM25F).

Most SEOs know well about 1 and 2.

As a term has more information gain during its first occurrences, a document about specific terms should mention these at the beginning, particularly in the title tag. For testing purposes and since end user assume that a large headline is the actual title of a document (which is not)  we like to repeat the title tag content in an h1 header that is placed prominently at the beginning of the copy. Keywords from the title are then repeated early in the document body. In this way, one can write for both end users and search engines. If a search engine uses some form of the above algorithms (which we don’t know if they do), that base is covered, too. You don’t have to adopt this strategy, unless you want. It is just our way of conducting tests, but is a flexible approach.

New Tutorials: Okapi BM25F and BM25

03 Wednesday Aug 2011

Posted by egarcia in IR Tutorials, Queries

≈ Leave a Comment

We have a new tutorial on Okapi Simple BM25 with Extension to Multiple Fields.

http://www.miislita.com/information-retrieval-tutorial/okapi-simple-bm25f-tutorial.pdf

Unlike the BM25, this model (known as Simple BM25F) incorporates the structure of documents into the scoring process.

 

In addition, we’ve uploaded a new, improved, and expanded version of the Okapi Best Match 25 tutorial.

http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf

 

Have a great IR day!

♣  

August 2011
M T W T F S S
« Jul   Sep »
1234567
891011121314
15161718192021
22232425262728
293031  

♣ Favorite Sites

  • Mi Islita

♣ Pages

  • About IR Thoughts

♣ Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

♣ Recent Posts

  • Puerto Rico’s Science and Technology Trust Fund: Innovation Island Blast II
  • The L’Hôpital Rule: Deriving the Geometric Mean
  • Understanding the L’Hôpital Rule
  • How to Create Windows Metro Style Apps with JavaScript
  • Electronic Drugs and Hackers
  • Why a Social and Search Presence is Important for You
  • NY SES – 2012: My little briefing
  • Hello, World. I’m SWM.
  • SES NY – See You All There!
  • Which separators to use with title tags?
  • A Study of Puerto Rico Newspaper Home Pages
  • Hey, SEOs: On Information Gain, Keyword Wallop, and Relevance
  • Social Media and Puerto Rico Local Brands
  • When and Why not to take arithmetic averages
  • l’Hopital’s Rule and the 0^0 Power Controversy

♣ Archives

  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

♣ Category Cloud

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Image Compression Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.