• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Queries

The Scope Hypothesis in IR: Who is Right?

13 Saturday Aug 2011

Posted by egarcia in Data Mining, IR Tools, IR Tutorials, Queries

≈ 8 Comments

In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.

However…

Here I would like to reference research at both sides of the Scope Hypothesis.

In the abstract of ”Revisiting the relationship between document length and relevance” (
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf
), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:

“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”

Really….?

However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (
http://www.sigir2011.org/papershow.asp?PID=104
), Zhou, Huang, and He (2011) state:

“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”

My take…

I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.

Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.

New Tutorials: Okapi BM25F and BM25

03 Wednesday Aug 2011

Posted by egarcia in IR Tutorials, Queries

≈ 5 Comments

We have a new tutorial on Okapi Simple BM25 with Extension to Multiple Fields.


http://www.miislita.com/information-retrieval-tutorial/okapi-simple-bm25f-tutorial.pdf

Unlike the BM25, this model (known as Simple BM25F) incorporates the structure of documents into the scoring process.

 

In addition, we’ve uploaded a new, improved, and expanded version of the Okapi Best Match 25 tutorial.


http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf

 

Have a great IR day!

A Tutorial on Okapi BM25

01 Friday Jul 2011

Posted by egarcia in IR Tutorials, Queries

≈ 4 Comments

We have uploaded a new tutorial: Okapi BM25. See
http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf

This is a tutorial on the classic Okapi Best Match 25.

Enjoy it.

 

Minerazzi Web Crawler: Now Individually Crawling Form Fields

27 Monday Jun 2011

Posted by egarcia in Data Mining, Programming, Queries

≈ Leave a Comment

These are additional updates made over the weekend to our crawler at
http://www.minerazzi.com/labs/crawlinker.php

Changes made include the ability to detect information contained inside selection menus, textareas, and input fields. These changes simplify the examination of name/value pairs used by forms. Very useful when a document contains multiple forms, when query mechanisms of target databases must be identified, or when one need to assess whether a database is susceptible to query injections or script infections. In the latter, a security component is more than obvious.

We also moved both the Markup and Robots Text File Reports to the Source Code section. This is a new section that is now listed as the last reports of the application.

Federated Searches and The Search Paradox

14 Tuesday Jun 2011

Posted by egarcia in Data Mining, Queries

≈ 2 Comments

The problem with federated searches (aka, parallel, broadcast, or meta searches) is that it is implemented under the assumption that the more it is  searched at one time, the better for the user. This is not necessarily the case. Easy search results does not equate to right results. Submitting a query on a given topic to dissimilar databases with dissimilar content or about dissimilar topics often produces off-topic or irrelevant results. Indeed, easy searching not necessarily translates into a relevance experience.

And there is also the question of how to rank the results. Two strategies are frequently used: (a) data appending and (b) data fusion.

Some of the early forms of federated search engines for the Web used to do (a) and were soon called meta search engines. These search tools simply returned a long list of results by appending the top N ranked results from each databases in a tandem fashion. Obviously this strategy failed to recognized the top M relevant results from this huge list and soon was phased out in favor of (b).

In (b), arithmetic or weighted averages from the top N results from the different databases are computed. The problem with this approach is that is very subjective. In order to compute an arithmetic or weighted relevance score, who decides how much weight should be assigned to a given ranked result from a given database? No matter which weighting criteria are used, at the end it is still a subjective score and one that not necessarily improves the end-user search experience. Just the opposite.

This leads to what I call “The Search Paradox”: Information gateways as information roadblocks.

To learn more about this, visit this old link:
http://www.accessmylibrary.com/article-1G1-182034526/federated-search-101-alexis.html

On trusting search engine stemming and matching results

12 Thursday Aug 2010

Posted by egarcia in Data Mining, Queries

≈ Leave a Comment

If you believe or care about it, then following resources might interest you.

On stemming and counting search results


http://jis.sagepub.com/content/early/2009/05/28/1363459309336801

Our results indicate that Google uses a document-based algorithm for stemming. It evaluates each document separately and makes a decision to index or not for the conflated forms of the words it has. It indexes documents only for word forms that are semantically strongly correlated. While it indexes documents for singulars and plurals frequently, it rarely indexes documents for word forms with the postfixes of -able or -tively.


http://jis.sagepub.com/content/35/4/469.short

This study investigates the accuracy of search engine hit counts for search queries. We investigate the accuracy of hit counts for Google, Yahoo and Microsoft Live Search, and the accuracy of single and multiple term queries. In addition, we investigate the consistency of hit count estimates for 15 days. The results show that all three provide estimates for the number of matching documents and the estimation patterns of their counting algorithms differ greatly. The accuracy of hit counts for multiple word queries has not been studied before. The results of our study show that the number of words in queries affects the accuracy of estimations significantly. The percentages of accurate hit count estimations are reduced almost by half when going from single word to two word query tests in all three search engines. With the increase in the number of query words, the error in estimation increases and the number of accurate estimations decreases.


http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf

 

This article reviews the rewards and limitations of either acquiring Web content and processing it nto a static corpus or else accessing it directly as a dynamic corpus, a distinction captured in or / as corpus the process it surveys typical applications of such data to both academic analysis and real-world.

 


http://gplsi.dlsi.ua.es/congresos/qwe10/fitxers/QWE10_Funahashi.pdf

Abstract. In this paper, we investigate the trustworthiness of search engines’hit counts, numbers returned as search result counts. Since many studies adoptsearch engines’ hit counts to estimate the popularity of input queries, thereliability of hit counts is indispensable for archiving trustworthy studies.However, hit counts are unreliable because they change, when a user clicks the“Search” button more than once or clicks the “Next” button on the searchresults page, or when a user queries the same term on separate days. In thispaper, we analyze the characteristics of hit count transition by gathering varioustypes of hit counts over two months by using 10,000 queries. The results of ourstudy show that the hit counts with the largest search offset just before searchengines adjust their hit counts are the most reliable. Moreover, hit counts arethe most reliable when they are consistent over approximately a week.

For those that still believe in counting search results, despite the evidence from the above articles:


http://www2006.org/programme/item.php?id=3047

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine’s index using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding “weight”, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine’s index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.
 

A Simple Search Strategy to beat them all

01 Tuesday Dec 2009

Posted by egarcia in Data Mining, Machine Learning, Programming, Queries, Vector Space Models

≈ 6 Comments

Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.

I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.

Mistake not a simple matching strategy  for a simple or basic search approach as it can evolve into  the most complex one.

Unlike classic boolean searches (i.e., AND, OR, XOR),  SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.

Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.

I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!

In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.

Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.

So far, my one is cost effective and does not require of extra libraries.

PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!

SIDIM XXIV Conference

05 Thursday Mar 2009

Posted by egarcia in Conferences, Data Mining, Homeland Security, Queries, Vector Space Models

≈ 4 Comments

I am presenting at The Seminario Interuniversitario de Investigación en Ciencias Matemáticas (Interuniversity Seminar on Mathematical Sciences Research, SIDIM).

This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (
http://sidim2009.uprr.pr/
)

This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book of abstracts  is available at
http://sidim.uprh.edu/libroSIDIM2009.pdf

I will be presenting new research work on IDF and a new model for the conditional specificity of terms. If you have followed previous posts on the topic of inverse document frequency, now you will understand why I have dissected the topic several times. Thank you all for your private comments and feedback on the topic.

My abstract follows:

Scaled Inverse Document Frequency: A Model for the Evaluation of the Conditional Specificity of Query Terms in Search Engine Collections

Edel Garcia, Internet Business Development Center, Interamerican University of Puerto Rico, Metropolitan Campus

Inverse document frequency (IDF) is a measure of the specificity of query terms over a collection of D number of documents that has been successfully incorporated into numerous vector space information retrieval models. Since these models assume term independence, the specificity of a given term, present in different queries, is assumed to be unique and independent from other query terms. To the best of our knowledge, there are no known models that condition the specificity of terms to the presence of other terms in a query.

This paper proposes a new measure called scaled inverse document frequency (SIDF) which evaluates the conditional specificity of query terms over a subset S of D and without making any assumption about term independence. S can be estimated from search results, OR searches, or computed from inverted index data. We have evaluated SIDF values from commercial search engines by submitting queries relevant to the financial investment domain. Results compare favorably across search engines and queries. Our approach has practical applications for `real-world’ scenarios like in Web Mining, Homeland Security, and keyword-driven marketing research scenarios. SIDF can be incorporated into a variety of information retrieval models as a global weight scoring system.

Keywords: inverse document frequency, conditional term specificity, web mining, search engines

When IDF is not enough

25 Wednesday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ Leave a Comment

I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:

According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.

Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *

This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.

* I have modified a bit this assertive question to make the point more clear.

References


http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Hill, I.D., “Wouldn’t it be nice if we could write computer programs in ordinary English – or would it?” The Computer Bulletin, June 1972, pp. 306-312.

Neumann, Peter G., “Only his Only Grammarian Can Only Say What Only He Means,” ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.

Glottochronology: Part IV

20 Friday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ Leave a Comment

Please read Part I, Part II, and Part III before reading this post.

I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur’s book Discrete Dynamical Systems (Oxford, 1990).

1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?

2. Consider the model of glottochronology. Assume a language is given today.

(a) How long will it take for 1/4 of the words to change?

(b) How long will it take for 10 per cent of the words to change?

3. Suppose that person A knows 60 per cent of a list of 1000 words, person B knows 70 per cent of that list, and person C knows 30 per cent of that list.

(a) How many words do you expect all three people know?

(b) What per cent of the words is known by A and B but not by C?

Hints:

Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle  to A, B, and C.

← Older posts
June 2013
M T W T F S S
« May    
 12
3456789
10111213141516
17181920212223
24252627282930

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.