Beware of SEO statistical studies as their findings can easily evolve into urban legends and myths when taken at face value. Why? Keep reading.

From time to time SEOs publish statistical studies in which incorrect reasoning are given in relation to Statistical Analysis and PageRank.

For instance back in 2007, an SEO published a study trying to find correlations between the rank of a page in the search results and PageRank. We wrote two rebuttals about that study:

http://irthoughts.wordpress.com/2007/11/20/a-pagerank-rank-correlation/
http://irthoughts.wordpress.com/2008/08/29/how-not-to-use-correlation-coefficients/

Evidently, we cannot read too much into correlation coefficients, particularly without providing a t-test analysis. In fact, x-y paired variables can be almost orthogonal (perpendicular) suggesting independence and still have high correlation coefficients, which was actually the case in the mentioned “study”.

Data can also be non-linear and still have a high correlation coefficient, as many books on statistics show.  Furthermore, a low or zero correlation coefficient value not necessarily means that the data is not correlated. This is illustrated in the following figure, taken from “Statistics for Analytical Chemistry” (J.C. Miller and J. N. Miller, 1984).

nonlinear correlation coefficients

Note that coefficients (Pearson in this case) can be high or low and the data can still be nonlinear AND well correlated. These two results just show that the data does not fit to a linear model. By simply looking at a coefficient and inferring no correlation if it is too low is misleading (bottom figure). Similarly if the coefficient is high we cannot just neglect non-linearity (top figure) and assume linearity. Please, enlarge  and read text above the figure to understand why.

In addition, to find a pattern in the data, other tests like PCA or SPCA are necessary. See tutorial at

http://www.miislita.com/information-retrieval-tutorial/pca-spca-tutorial.pdf

The selection of one coefficient over the other can be with or without any a priori statistical knowledge about the data. However, while a previous knowledge about the nature of the data can help, it can also hurt any statistical reasoning as a tester can be easily biased toward finding what he/she wants to find, overlooking other type of pattern finder analyses and reasonings or, worse, arbitrarily dropping outliers to fit results to a preconceived data model.

Now that I’m on the subject of choosing one correlation coefficient statistic over the other, in a recent post another SEO (Randfish) published “The Science of Ranking Correlations” article (http://www.seomoz.org/blog/the-science-of-ranking-correlations), in which he states:

“Why did we use Spearman rather than Pearson correlation?

Pearson’s correlation is only good at measuring linear correlation, and many of the values we are looking at are not.  If something is well exponentially correlated (like link counts generally are), we don’t want to score them unfairly lower.”

Almost immediately, another poster (whiteweb_b) contradicted that assertion and wrote:

“Rand your (or Ben’s) reasoning for using Spearman correlation instead of Pearson is wrong. The difference between two correlations is not that one describes linear and the other exponential correlation, it is that they differ in the type of variables that they use.”

The poster is right on that one, especially when no t-test analysis or additional pattern/trend finder techniques were provided. So, which coefficient to use, when, and why: Pearson or Spearman?  That’s a fair question. We covered this in the following post:

http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Something that few realize is the connection between Cosine Similarity with Pearson and Spearman Coefficients. This also was discussed in the following post:

http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/

To sum up, beware of SEO “Science” and their statistical “studies”.

I hope this helps.

PS: I updated this post to add the figure above and few additional comments.

About these ads