On 11-16, Stephane Labert sent me copy of an article that attempts to correlate Google’s PageRank and the rank of a document in this search engine result pages (SERPs).

In spite of the fact that Labert apparently worked hard on the piece, and besides proper credit given for this, I found the article disappointing on the grounds of the sampling, chosen regression model, and statistical analysis employed.

I suggested Labert few tips and things to look at since my perception was that the article was not ready for prime time. My intentions: To prevent Labert from getting unnecessary “harm”.

I was too late. Apparently, by the time I received it, the piece was already sent to many known SEOs or webmasters. This included some of IRW readers, including expert cloaker Ralph Tegtmeier, aka fantomaster.

On 11-17 Tegtmeier blogged about it. He and other SEOs promptly put into question the article’s statistical analysis. I am not going to go over their reactions since I pretty much agree with their critiques. Besides, the main issues argued by Labert and these SEOs are not knew at all and have been revisited many times. For those interested, reactions to Labert’s article can be read at the following links:

http://sphinn.com/story/14452#wholecomment18087

http://www.timnash.co.uk/11/2007/lies-damn-lies-and-pagerank-statistics/

Rather than echoing their comments I prefer to discuss the experimental of Labert’s article:

**Firstly, the sampling:**

There is no full disclosure on how the data was collected. To be honest, this goes against the article’s credibility. Which queries were used? How many terms were used per queries: 2, 3, 4…? Which query modes were used: AND (FINDALL), ANY (OR), EXACT, constraining modes…? This is important since many variables, including the query, can influence SERPs. None of this was disclosed in the article.

As mentioned, many variables affect ranking results, and some have interactions. Ignoring these interactions and then isolating one variable and plotting this against an X axis does not provide an accurate picture.

**Secondly, the regression model:**

Why the data was adjusted to a linear model, when it actually tends to be nonlinear? Why apparent outliers were included in the least square analysis? Which error analysis respect to the slope was used to justify the inclusion/rejection of these apparent outliers? None of this was explained or reported.

**Third, variable dependencies:**

All graphs show a curve with a very small slope for the adjusted regression straight line. This suggests that changes in the X-axis (Rank) provoke small changes in the Y-axis (PageRank), indicating that variables are almost independent from one another, and that is despite the correlation coefficient value allegedly reported as close to 1.

Indeed, a correlation coefficient close to 1 is not enough. To investigate whether any two variables are dependent of one another or that there is a significant correlation between these we need to do more than just look at a bunch of correlation coefficients. As a matter of fact, an almost flat, orthogonal straight line against a Y-axis actually suggests orthogonality and variable independence.

To assess whether the correlation found is significative one could conduct a two-tail t-test and n – 2 degrees of freedom on the correlation coefficient at a defined confidence level. Once this is done, one would need to make the null hypothesis that there is no correlation between X and Y and compare the experimental t-value versus tabulated values from t-test tables. If t-experimental is greater than t-table the null hypothesis is rejected, that is, we conclude in such a case that a significant correlation does exist. This test was not reported, either.

Labert claims to have conducted a more detailed research to support the aforementioned article claims. I look forward to read that.

Pingback: How Not to Use Correlation Coefficients « IR Thoughts

Pingback: Beware of SEO Statistical Studies « IR Thoughts

E. Garcia

said:A tutorial on the correct way of computing and analyzing correlation coefficients is available at http://www.miislita.com/information-retrieval-tutorial/a-tutorial-on-correlation-coefficients.pdf

IN addition, a response to SEOmoz “rebuttal” and alleged “knowledge” on statistics is available now at https://irthoughts.wordpress.com/2010/07/12/on-seomoz-knowledge-about-statistics/.

Dr. E. Garcia