I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods


Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).


Discriminant analysis
Neural Networks (Back propagation)
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

About these ads