I know I mentioned this before, but I keep getting the same question. How to compute Spearman correlation coefficient (rho) with Excel for a data set consisting of a relatively small number of paired values (n)?

Unlike for a Pearson’s correlation coefficient (r), Excel does not come with a built-in function for Spearman’s. You need to manipulate Excel.

Once you have the paired data in a rank format, there are several ways to proceed. Here are at least two.

1. compute the rank difference for each data point and square these. Next, compute the sum of squared differences (SSD) and program into Excel the formula rho = 1 – (6*(SSD)/(n*(n*n -1)))

2. You can also simply construct a scatterplot of the ranked data and fit the data to a linear regression curve. I do this in Windows 7 by adding focus to the scatterplot and selecting Layout 9 from the Design Tab of Excel. This gives you a visual overlay of how the ranks fluctuate around a straight line regression curve. It also gives you a Coefficient of Determination R2, the square root of which is Pearson’s r, but because the data is ranked, this Pearson’s r is also Spearman’s rho. So the assertion that nonlinear data cannot be treated with Pearson’s correlation is not entirely correct. An example is given below.

The data was taken from http://www.mnstate.edu/wasson/ed602calccorr.htm

 Person Judge A Rank Judge B Rank D D2 Alan 1 2 -1 1 Beth 2 1 1 1 Carl 3 4 -1 1 Don 4 3 1 1 Edgar 5 6 -1 1 Frances 6 7 -1 1 Gertrude 7 5 2 4 Sum 10 Spearman 0.8214 Using the second method, Excel gives the regression equation
y = 0.8214x + 0.7143; with R2 = 0.6747

Hence, the square root of 0.6747.. is 0.8214..

Note that Spearman’s rho can be calculated from either the slope or the Coefficient of Determination.

For a large number of data points, you are better off by writing an Excel macro. For a really huge data set, you need an industry strength software solution.

Note from the above results that Spearman’s rho (equivalent to Pearson’s r for ranked data) is about 0.82. The Coefficient of Determination indicates that about 67% of the variations can be explained by the regression model, thus about 33% of the variations cannot be explained by the regression model. The smaller the correlation coefficients the more likely the data points will be scattered on the graph. Without considering scatterplots, t-test significance analysis, and slope analyses it is easy to misinterpret correlation coefficients.

On a side note, trying to compare small correlation coefficients (Spearman’s or Pearson’s), frequently leads to flawed or useless conclusions. I don’t understand why some search engine marketers are still insisting in doing that (http://www.seomoz.org/blog/google-vs-bing-correlation-analysis-of-ranking-elements ). It would be interesting to see how their scatterplots look like. That’s how many SEO myths start: with flawed methodologies/reasonings.

Advertisements