For years marketers have put PageRank-based models into question, via “sandbox” theories and similar tales.
This old paper might help to address the issue on the biased/unbiased nature of these models: Paradoxical Effects in PageRank Incremental Computations.
What interest me the most about the paper is how the authors put into use Kendall’s t theory. Written by Paolo Boldi, Massimo Santini, and Sebastiano Vigna the abstract states:
“Abstract. Deciding which kind of visiting strategy accumulates high-quality pages morequickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectivenessfor crawl strategies: whether the graph obtained after a partial visit is in some senserepresentative of the underlying web graph as far as the computation of PageRank isconcerned. More precisely, we are interested in determining how rapidly the computationof PageRank over the visited subgraph yields node orders that agree with theones computed in the complete graph; orders are compared using Kendall’s t .We describe a number of large-scale experiments that show the following paradoxicaleffect: visits that gather PageRank more quickly (e.g., highest-quality first) arealso those that tend to miscalculate PageRank. Finally, we perform the same kind ofexperimental analysis on some synthetic random graphs, generated using well-knownweb-graph models: the results are almost opposite to those obtained on real webgraphs.”
The authors also state:
“The most classical visit strategies are the following:
• Depth-first order: the crawler chooses the next page as the last that wasadded to the frontier; in other words, the visit proceeds in a LIFO fashion.
• Random order: the crawler chooses randomly the next page from the frontier.
• Breadth-first order: the crawler chooses the next page as the first that wasadded to the frontier; in other words, the visit proceeds in a FIFO fashion.
• Omniscient order (or quality-first order): the crawler uses a queue prioritisedby PageRank values [Cho et al. 98]; in other words, it chooses to visitthe page with highest quality among the ones in the frontier. This visit ismeaningless unless a previous PageRank computation of the entire graphhas been performed before the visit, but it is useful for comparisons. A variant of this strategy may also be adopted if we have already performeda crawl and thus we have the (old) PageRank values of (at least some ofthe) pages.”
“Both common sense and experiments (see, in particular, [Boldi et al. 05])suggest that the visits listed above accumulate PageRank in an increasinglyquicker way. This is to be expected, as the omniscient visit will point immediatelyto pages of high quality. The fact that the breadth-first visit yields high-qualitypages was noted in [Najork and Wiener 01].”
“There is, however, a different and also quite relevant problem that has beenpreviously overlooked in the literature: if we assume that the crawler has noprevious knowledge of the web region it has to crawl, it is natural that it will try to detect page quality during the crawl itself, by computing PageRank onthe region it has just seen. We would like to know whether a crawler doing sowill obtain reasonable results or not.”