Spearman and Pearson Correlation Coefficients

New Relevant Tool: https://irthoughts.wordpress.com/2018/09/14/regression-correlation-calculator-updates-and-improvements/

I’ve been asked to explain the difference between Spearman (S) and Pearson (P) Correlation Coefficients. Good question as these are frequently used in data mining studies.

 I hope this helps.

S is equivalent to P, computed on variables, after these have been transformed into rank-orders. In such a case, we can determine S from the coefficient of determination (D) of a linear regression equation. For instance, if D = 0.49, S=0.7. BTW, D = 0.49 means that 49% of the variations can be explained by the regression model, but 51% cannot be explained by the model. Thus, to compute S with EXCEL, simply rank-order the variables, apply linear regression on a scatter plot, and square root the coefficient of determination. You can also inspect the slope.

Any changes in the original variables that do not affect the rank-order, should not change S, but P. For a givent set of variables, if S > P, we might conclude that the variables are consistently correlated, but not in a linear fashion. However, if S and P are very similar and different from zero, there is indication of a linear relationship.

Pros/Cons of S

It is less sensitive to bias due to outliers, does not require data to be metrically scaled or of normality assumptions, but of assumptions about symmetry of a gaussian-like distribution. It is applied to ordinal variables. Ties must be factored in to computations and calculations are tedious.

Pros/Cons of P

It is easy to compute. Assumes normality in both variables. It is sensitive to outliers.

Important Notes on Correlation

A correlation coefficient varies from +1 to -1. If it is zero the variables are not related. If it is positive, these are positively correlated: one increases when the other increases. If it is negative, these are negatively correlated: one increases when the other decreases and viceversa.

Correlation is not causality. It is just a measure of association between variables that addresses whether these covary. It is not necessary to prejudge these as dependent or independent before estimating correlation.

To determine whether these covary in a significant fashion, we can apply a t-test to the correlation coefficient at a given n – 2 degrees of freedom and confidence level, usually at 95%.

PS.

In a more recent post, (https://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/) I explained the connection between Pearson and Spearman coeffficients with cosine similarities and dot products and a particular case wherein all these are equivalent.

References

https://www.msu.edu/~nurse/classes/summer2002/813/week8spearman.htm

http://www.chipst2c.org/lectures/Stat_lecture_correlation.pdf

http://www.statpac.com/statistics-calculator/correlation-regression.htm

4 Comments

Leave a comment