Some readers have asked me to clarify the difference between SVD and PCA, since these have many overlapping heritages. This was clarified at a TREC9 presentation. For those interested in a mathematical explanation or in ongoing research using these, the following might help.

In January of 2006 and thanks to a National Science Foundation (NSF) grant from the Institute of Pure and Applied Mathematics (IPAM) at UCLA, CA, I was invited to attend their Document Space Workshop. The event provided a forum for worldwide experts in the field of pure and applied mathematics and information retrieval to present their case on what is a document, a query and document spaces. Most speakers were leading experts that have written books in their respective fields.

One of the speakers was Professor Michael Trosset, from College of William and Mary. His talk, “Trading Spaces: Measuring of Document Proximity and Methods for Embedding Them”, made clear that similarity and disimilarity in terms of Euclidean space and Euclidean distances is not enough to represent documents and queries. The general agreement appears to be that queries are linear combinations of terms. However, he mentions that  to define documents one needs to define first the document space.

Professor Trosset then reviewed binary and quantitative Vector Space Models (VSM), Latent Semantics Indexing (LSI) and Principal Component Analysis. He mentions the difference between SVD and PCA as follows:

LSI finds the best linear subspace, while PCA finds the best affine linear subspace. To find the best affine linear subspace, first translate the feature vectors (yi) so that their centroid lies at the origin, then find the best linear subspace.

I raised my hand and asked Trosset why we need to do this transformation. He mentions that this is done to convert cosine similarities to Pearson product-moment correlation coefficients.

There you go!

I have written a comprehensive Summary of the Document Space Workshop for my good friend Mike Grehan. I also have written a short summary of the workshop for SEW Forums, which includes some great moments. To access it, simply go to forums.searchenginewatch.com and do a forum search for the IPAM workshop.

If you are an IR scientist, these summaries are a goldmine. They include presentations and descriptions on cutting edge research conducted at Yahoo!, Google, the Naval Surface Warfare Center, and top universities like Princeton, Yale, John Hopkins, and other fine colleges.

Some presentations included applied research using LSI. For instance, Prof. Michael Berry (U of Tennessee) presented on “Text Mining Approaches for Email Surveillance”. Prof. Berry, who along with Susan Dumais and Tom Landauer, is considered one of the pioneers of LSI, described  how they applied LSI on the now famous Enron Corpus to conduct email forensics. Amazing research.

David Marquette from the Naval Surface Warfare Center presented on “How Document Space is like an Elephant?”. He discussed a bit the so-callled Dimensionality Reduction Curse of LSI and SVD (singular value decomposition).

From the search engine side, Michael W. Mahoney from Yahoo! presented on “Data-driven
Dictionary Definition for Diverse Document Domains.” Mahoney reviewed SVD and introduced new advances in SVD and matrix selection. Mahoney mentioned that term-document data, recommendation system data, individual-gene data, and temporal image data can be represented as term-document matrices. He focused on applying SVD and data matrix analysis to dictionary mapping and described a low-rank matrix decomposition expressed in terms of a small number of actual rows and actual columns.

To access the presentation files (html, pdf, and ppt) visit the IPAM site at ipam.ucla.edu and search for the Document Space Workshop. If you cannot find these contact me and I will point you to alternate sources. As a last resource I might be able to provide you with some zipped copies.

This is a legacy post originally published in 7/20/2006

Advertisements