I received yesterday the following email from a reader (name removed). I am reproducing it since the discussion might be of value to others with similar questions.
Hello Dr. Garcia,
I’ve seen your tutorial “PCA and SPCA Tutorial” while I was trying to
find out something about PCA. So I decided to ask this to you. I’ll be
happy if you answer.
My feature matrix contains 30 feature vectors and I want to reduce
this dimension using 95% of variance explained. Ranges of some feature vectors have great differences. For example, one feature vector’s range is about [0,10], while some others is [-10^10,10^10]. So when I directly subtract the mean and calculate covariance matrix, one of the eigenvalues suppresses the others.
Is it a proper way to scale data (z=(x-mean)/std_dev) firstly and then subtract mean of the scaled version and calculate covariance matrix?
When I try this procedure eigenvalues seem to be correct but I cannot be sure if this is a correct way or not.
Do you think that this is correct? If not, what is the correct way
using covariance matrix rather than correlation matrix?
Thanks in advance.
My answer follows.
The purpose of centering data (transforming data to z-scores) is to remove undesirable fluctuations. This is particular useful when there is a common source of error; e.g. as in a time series. Assuming this is your case, then you are doing the right thing.
An advantage of data centering is that it is part of the PCA solution of minimizing the sum of squared errors (SSE). Overall, the goal is to find the best affine linear subspace.
Centering the data has other advantages. It allows us to make cosine angles equal to Pearson’s Correlation Coeffficients so that similarity information can be explored. Also, once in a z-score form, the data can be checked to see whether it follows a normal distribution.
For additional information, check these links:
I hope this helps.