If you believe or care about it, then following resources might interest you.

On stemming and counting search results

http://jis.sagepub.com/content/early/2009/05/28/1363459309336801

Our results indicate that Google uses a document-based algorithm for stemming. It evaluates each document separately and makes a decision to index or not for the conflated forms of the words it has. It indexes documents only for word forms that are semantically strongly correlated. While it indexes documents for singulars and plurals frequently, it rarely indexes documents for word forms with the postfixes of -able or -tively.

http://jis.sagepub.com/content/35/4/469.short

This study investigates the accuracy of search engine hit counts for search queries. We investigate the accuracy of hit counts for Google, Yahoo and Microsoft Live Search, and the accuracy of single and multiple term queries. In addition, we investigate the consistency of hit count estimates for 15 days. The results show that all three provide estimates for the number of matching documents and the estimation patterns of their counting algorithms differ greatly. The accuracy of hit counts for multiple word queries has not been studied before. The results of our study show that the number of words in queries affects the accuracy of estimations significantly. The percentages of accurate hit count estimations are reduced almost by half when going from single word to two word query tests in all three search engines. With the increase in the number of query words, the error in estimation increases and the number of accurate estimations decreases.

http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf

 

This article reviews the rewards and limitations of either acquiring Web content and processing it nto a static corpus or else accessing it directly as a dynamic corpus, a distinction captured in or / as corpus the process it surveys typical applications of such data to both academic analysis and real-world.

 

http://gplsi.dlsi.ua.es/congresos/qwe10/fitxers/QWE10_Funahashi.pdf

Abstract. In this paper, we investigate the trustworthiness of search engines’hit counts, numbers returned as search result counts. Since many studies adoptsearch engines’ hit counts to estimate the popularity of input queries, thereliability of hit counts is indispensable for archiving trustworthy studies.However, hit counts are unreliable because they change, when a user clicks the“Search” button more than once or clicks the “Next” button on the searchresults page, or when a user queries the same term on separate days. In thispaper, we analyze the characteristics of hit count transition by gathering varioustypes of hit counts over two months by using 10,000 queries. The results of ourstudy show that the hit counts with the largest search offset just before searchengines adjust their hit counts are the most reliable. Moreover, hit counts arethe most reliable when they are consistent over approximately a week.

For those that still believe in counting search results, despite the evidence from the above articles:

http://www2006.org/programme/item.php?id=3047

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine’s index using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding “weight”, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine’s index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.
 

Advertisements