I know I mentioned this before, but I keep getting the same question. How to compute Spearman correlation coefficient (rho) with Excel for a data set consisting of a relatively small number of paired values (n)?

Unlike for a Pearson’s correlation coefficient (r), Excel does not come with a built-in function for Spearman’s. You need to manipulate Excel.

Once you have the paired data in a rank format, there are several ways to proceed. Here are at least two.

1. compute the rank difference for each data point and square these. Next, compute the sum of squared differences (SSD) and program into Excel the formula rho = 1 – (6*(SSD)/(n*(n*n -1)))

2. You can also simply construct a scatterplot of the ranked data and fit the data to a linear regression curve. I do this in Windows 7 by adding focus to the scatterplot and selecting Layout 9 from the Design Tab of Excel. This gives you a visual overlay of how the ranks fluctuate around a straight line regression curve. It also gives you a Coefficient of Determination R^{2}, the square root of which is Pearson’s r, but because the data is ranked, this Pearson’s r is also Spearman’s rho. So the assertion that nonlinear data cannot be treated with Pearson’s correlation is not entirely correct. An example is given below.

The data was taken from http://www.mnstate.edu/wasson/ed602calccorr.htm

Person | Judge A Rank | Judge B Rank | D | D^{2} |

Alan | 1 | 2 | -1 | 1 |

Beth | 2 | 1 | 1 | 1 |

Carl | 3 | 4 | -1 | 1 |

Don | 4 | 3 | 1 | 1 |

Edgar | 5 | 6 | -1 | 1 |

Frances | 6 | 7 | -1 | 1 |

Gertrude | 7 | 5 | 2 | 4 |

Sum | 10 | |||

Spearman | 0.8214 |

Using the second method, Excel gives the regression equation

y = 0.8214x + 0.7143; with R^{2} = 0.6747

Hence, the square root of 0.6747.. is 0.8214..

Note that Spearman’s rho can be calculated from either the slope or the Coefficient of Determination.

For a large number of data points, you are better off by writing an Excel macro. For a really huge data set, you need an industry strength software solution.

Note from the above results that Spearman’s rho (equivalent to Pearson’s r for ranked data) is about 0.82. The Coefficient of Determination indicates that about 67% of the variations can be explained by the regression model, thus about 33% of the variations cannot be explained by the regression model. The smaller the correlation coefficients the more likely the data points will be scattered on the graph. Without considering scatterplots, t-test significance analysis, and slope analyses it is easy to misinterpret correlation coefficients.

On a side note, trying to compare small correlation coefficients (Spearman’s or Pearson’s), frequently leads to flawed or useless conclusions. I don’t understand why some search engine marketers are still insisting in doing that (http://www.seomoz.org/blog/google-vs-bing-correlation-analysis-of-ranking-elements ). It would be interesting to see how their scatterplots look like. That’s how many SEO myths start: with flawed methodologies/reasonings.

ijustwritecode

said:I did the math for the SEOmost post.

I am pretty sure I did the math correctly, but if you want to be less general in your criticism I would be happy to respond.

For instance, what would doing the t-test tell you that isn’t clear from the stderr values we published on the graphs? What are you wanting t-tests on? I assume you want us to show we can reject the null hypothesis that variables are uncorrelated, but just from the stderr you can see we will reject most of them with overwhelming confidence, except for those that are shown on the graph as extremely close to the axis (and the fact we can’t reject those is also clear from looking the graph with the stderr values on them).

More specifically, what are you wanting to know from looking at a scatter plot of the data? I can certainly make a few. Keep in mind, they will generally have over 10,000 points on each. In what specific ways are you worried that the coefficients we published are misleading without the scatter plots? Same question about R^2 values, and same offer to compute them if they really would be that interesting to you. R^2 values will be quite small, just like the correlation coefficients are. Seeing as the r^2 values will vary with the coefficients, does having them both really add information?

Cheers,

Ben

E. Garcia

said:Evidently, you don’t know how to calculate the standard error of a correlation coefficient, otherwise you won’t ask why we need to compute r^2 and t-values.

The procedure is described at the following links in very simple terms:

http://www2.hawaii.edu/~donaldp/bio280/doc/BioStatistics%20Topic%2023.pdf

http://www.uoregon.edu/~stevensj/MRA/correlat.pdf

The first link also shows that correlation coefficients are compared with pooled standard errors of these.

The general formula for a t-test is: t = (parameter)/(standard error of parameter). For a correlation coefficient the standard error (SE) is computed as follows.

We first compute the coefficient of determination r^2. This gives the fraction of variation that can be explained by the model. Next we compute the amount of variation that cannot be explained by the model as 1 – r^2. This is then normalized by the degree of freedom used. For a two-tail this is n – 2. Finally we square root the result. The standard error of a correlation coefficient is then computed as

SE = sqrt[(1 – r^2)/(n – 2)]

Once we compute SE what we do with it? We calculate a t-value as

t = r/SE

This is done to test whether the correlation coefficient is significant. If the correlation coefficient are not significant any comparison between these is useless.

t-calculate is then compared against a t-table at a given confidence level. The null hypothesis in this case is that there is no correlation between x and y. If t-calculated is greater than t-table, the null hypothesis is rejected and concluded that a significant correlation does exists.

Note that the standard error of the mean and the standard error of a correlation coefficient are two different things. Moreover, the standard deviation of the mean is not used to calculate the standard error of a correlation coefficient or to compare correlation coefficients.

In the original article, you and Rand talk about standard errors in convoluted ways. Rand place a Wikipedia link to Standard Error of the mean and then you and few posters talk about standard errors of a correlation coefficient. And then go on to combine these concepts. For instance you wrote:

“Consider the correlation of PR to Google.com. The correlation coefficient is greater than 0.18, standard error is less than 0.0056, so the null hypothesis being right would be an event of more than 32.143(=.18/0.0056) standard deviations.”

You divided a correlation coefficient by a standard error and called that ratio standard deviations. You cannot do that and it invalidates your treatments and reasonings. First r/SE where SE is the standard error of the correlation coefficient is a t-value, not a standard deviation. If you used SE as the standard error of the mean, that’s even worse.

One more thing and this is to Rand. Rand, r = 0 does not imply variables are random. It only means they are not linearly correlated; though they might be nonlinearly correlated and not necessarily random. Stop misleading the public.

I am the one that is now demanding a public retraction from you both for putting out that kind of crap “science” at SEOMoz.

Pingback: On SEO Quackery « IR Thoughts

E. Garcia

said:One more thing. The reason we need to use pooling when comparing correlation coefficients is that these are not additive. Therefore, it is completely incorrect to compute an average correlation coefficient. Similarly it is a nonsense try to use such average to compute a standard deviation of correlation coefficients, as you have confesed at this sphinn post:

http://sphinn.com/story/151848/

Rand and Ben: Clearly, your methodology is quack science.

Why this cannot be done? It is well explained here: http://www.statsoft.com/textbook/basic-statistics/

I think a tutorial on correlation coefficent analysis need to be published to dispel so much nonsense going around. Again, an r/SE ratio is not a standard deviation, but a t-value.

E. Garcia

said:Actually, let me scratch the “a tutorial needs to be published”. I am now putting together two tutorials:

1. A Tutorial on Standard Errors

2. A Tutorial on Correlation Coefficients

These will be out probably this week. After reading these tutorials, SEOs will understand why SEOMoz “statistical treatment” has no basis on any theoretical or experimental grounds.

Will they ever publish a public retraction for putting out, promoting, and trying to save face on their quack “science”? Time will tell.

E. Garcia

said:The tutorial on standard errors is available now.

http://irthoughts.wordpress.com/2010/06/25/a-tutorial-on-standard-errors/

Enjoy it.

E. Garcia

said:A full response to SEOmoz “rebuttal” and alleged “knowledge” on statistics is available now at http://irthoughts.wordpress.com/2010/07/12/on-seomoz-knowledge-about-statistics/.

In addition, a tutorial on the correct way of computing and analyzing correlation coefficients is available at http://www.miislita.com/information-retrieval-tutorial/a-tutorial-on-correlation-coefficients.pdf

I hope this helps.

egarcia

said:I see many recent page views to this post, so it might be useful to link an important reference which deals with the Non-Additivity of Correlation Coefficients. http://irthoughts.wordpress.com/2011/01/07/on-the-non-additivity-of-correlation-coefficients/

A tutorial on this subject is available at http://www.miislita.com . It shows why SEOmoz treatment is flawed as they assume these are additive. Anyway, Ben and Fishkin conceeded they treatment was flawed. So this is a little bit like beating a dead horse or a Little Disfunctional Amoeba (LDA).

egarcia

said:Here is another reason as to why seoMOZ LDA studies were flawed..

Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid.

http://irthoughts.wordpress.com/2012/10/02/a-quantile-quantile-q-q-plot-tutorial-in-excel/

With very small correlation coefficients, the scatter graphs will be far from normality!