Learning about Grover’s Algorithm: Quantum Database Search.
- Grover, L. K. (1996). A fast quantum mechanical algorithm for database search.
- Grover, L. K. (1997). Quantum Mechanics helps in searching for a needle in a haystack Phys. Rev. Lett. 79 (1997) 325.
- Lavor, C., Manssur, L.R.U., and Portugal, R. (2008). Grover’s Algorithm: Quantum Database Search.
- Wikipedia. Grover’s algorithm.
Who said that IR and LSI cannot be fun? Detecting Cyberbullying: Query Terms and Techniques
As part of the development of Minerazzi, we have published an article explaining two of our search modes: XOR and XNOR. Additional articles explaining other modes will soon follow.
We believe that IR and SEO practitioners will find these search modes particularly useful.
The beauty of XOR and XNOR searches is that these allow users to run complex co-occurrence searches in a straightforward manner. This is important as Latent Semantic Indexing information is related to term-term co-occurrence relationships.
In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.
Here I would like to reference research at both sides of the Scope Hypothesis.
In the abstract of “Revisiting the relationship between document length and relevance” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:
“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”
However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (http://www.sigir2011.org/papershow.asp?PID=104), Zhou, Huang, and He (2011) state:
“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”
I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.
Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.
We have a new tutorial on Okapi Simple BM25 with Extension to Multiple Fields.
Unlike the BM25, this model (known as Simple BM25F) incorporates the structure of documents into the scoring process.
In addition, we’ve uploaded a new, improved, and expanded version of the Okapi Best Match 25 tutorial.
Have a great IR day!
We have uploaded a new tutorial: Okapi BM25. See http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf
This is a tutorial on the classic Okapi Best Match 25.
These are additional updates made over the weekend to our crawler at http://www.minerazzi.com/labs/crawlinker.php
Changes made include the ability to detect information contained inside selection menus, textareas, and input fields. These changes simplify the examination of name/value pairs used by forms. Very useful when a document contains multiple forms, when query mechanisms of target databases must be identified, or when one need to assess whether a database is susceptible to query injections or script infections. In the latter, a security component is more than obvious.
We also moved both the Markup and Robots Text File Reports to the Source Code section. This is a new section that is now listed as the last reports of the application.
The problem with federated searches (aka, parallel, broadcast, or meta searches) is that it is implemented under the assumption that the more it is searched at one time, the better for the user. This is not necessarily the case. Easy search results does not equate to right results. Submitting a query on a given topic to dissimilar databases with dissimilar content or about dissimilar topics often produces off-topic or irrelevant results. Indeed, easy searching not necessarily translates into a relevance experience.
And there is also the question of how to rank the results. Two strategies are frequently used: (a) data appending and (b) data fusion.
Some of the early forms of federated search engines for the Web used to do (a) and were soon called meta search engines. These search tools simply returned a long list of results by appending the top N ranked results from each databases in a tandem fashion. Obviously this strategy failed to recognized the top M relevant results from this huge list and soon was phased out in favor of (b).
In (b), arithmetic or weighted averages from the top N results from the different databases are computed. The problem with this approach is that is very subjective. In order to compute an arithmetic or weighted relevance score, who decides how much weight should be assigned to a given ranked result from a given database? No matter which weighting criteria are used, at the end it is still a subjective score and one that not necessarily improves the end-user search experience. Just the opposite.
This leads to what I call “The Search Paradox”: Information gateways as information roadblocks.
To learn more about this, visit this old link: http://www.accessmylibrary.com/article-1G1-182034526/federated-search-101-alexis.html
If you believe or care about it, then following resources might interest you.
On stemming and counting search results
Our results indicate that Google uses a document-based algorithm for stemming. It evaluates each document separately and makes a decision to index or not for the conflated forms of the words it has. It indexes documents only for word forms that are semantically strongly correlated. While it indexes documents for singulars and plurals frequently, it rarely indexes documents for word forms with the postfixes of -able or -tively.
This study investigates the accuracy of search engine hit counts for search queries. We investigate the accuracy of hit counts for Google, Yahoo and Microsoft Live Search, and the accuracy of single and multiple term queries. In addition, we investigate the consistency of hit count estimates for 15 days. The results show that all three provide estimates for the number of matching documents and the estimation patterns of their counting algorithms differ greatly. The accuracy of hit counts for multiple word queries has not been studied before. The results of our study show that the number of words in queries affects the accuracy of estimations significantly. The percentages of accurate hit count estimations are reduced almost by half when going from single word to two word query tests in all three search engines. With the increase in the number of query words, the error in estimation increases and the number of accurate estimations decreases.
This article reviews the rewards and limitations of either acquiring Web content and processing it nto a static corpus or else accessing it directly as a dynamic corpus, a distinction captured in or / as corpus the process it surveys typical applications of such data to both academic analysis and real-world.
Abstract. In this paper, we investigate the trustworthiness of search engines’hit counts, numbers returned as search result counts. Since many studies adoptsearch engines’ hit counts to estimate the popularity of input queries, thereliability of hit counts is indispensable for archiving trustworthy studies.However, hit counts are unreliable because they change, when a user clicks the“Search” button more than once or clicks the “Next” button on the searchresults page, or when a user queries the same term on separate days. In thispaper, we analyze the characteristics of hit count transition by gathering varioustypes of hit counts over two months by using 10,000 queries. The results of ourstudy show that the hit counts with the largest search offset just before searchengines adjust their hit counts are the most reliable. Moreover, hit counts arethe most reliable when they are consistent over approximately a week.
For those that still believe in counting search results, despite the evidence from the above articles:
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine’s index using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding “weight”, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine’s index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.