This post shows the connection between Pearson and Spearman’s Correlation Coefficients with Cosine Similarity and Dot Products. As mentioned in previous posts:

Pearson’s Coefficient is equivalent to Spearman’s Coefficient for raw data that has been transformed into ranks.

Similarity is not Distance.

The ‘similarity distance’ expression is an oxymoron that should be avoided.

Arbitrary transformations between Distance and Similarity must be avoided.

It is important to know how to define Distance

Transforming Pearson’s Coefficients into Cosine Similarities

A Pearson’s Correlation Coefficient is equivalent to the cosine of the angle of a paired data represented as vectors, provided that the raw data is first transformed into z-scores:

z(xi) = (xi – mean_x)/s_x
z(yi) = (yi – mean_y)/s_y

where the mean is deducted and the difference normalized respect to the standard deviation. Thus, we end up with a mean-centered, deviation-normalized data. A z-score data is often called a ‘centered data’.

An Example

To convince yourself, Pearson’s Correlation Coefficient for the following data is 0.8837…


xi yi
2 3
4 2
6 5
20 7

If you center the data and calculate its cosine similarity, it should be the same.

If the centered data is converted into column unit vectors, the dot product of the vectors is also 0.8837… since for unit vectors cosine angle = dot product

Now try this.

Rank the raw data. Next, center it, and then convert into unit vectors. Convince yourself that in this case

Spearman’s = Pearson’s = Cosine Angle = Dot Product = 0.8000…

Since Similarity is not Distance, these are not Distances.

About these ads