As we mentioned in IR Watch – The Newsletter (got a free subscription?), although LSI (LSA) itself is not first-order co-occurrence (see Prof. Tom Landauer: Introduction to Latent Semantic Analysis), a recent thesis from Regis Newo shows that high-order co-occurrence might be at the heart of LSI and is what makes the technique works. This 2005 thesis abstract on Understanding LSI via the Truncated Term-Term Matrix states:

“In this thesis, we study the relation between Latent Semantic Indexing (LSI)and the co-occurrence of terms in collections. LSI is a method for automatic indexing and retrieval, which is based on the vector space model and which represents the documents and computes the relevance scores in a reduced, topic-related space. For our study, we view LSI as a document expansion method, i.e. for a pair of terms, the occurrence of one of them in a document increases or decreases the importance of the other term for the document, depending on the respective entry in the expansion matrix. We study the relation between the expansion matrix and the order of co-occurrence of the pairs of terms in collections. We find out that the entries of the expansion matrix are influenced by the degree of co-occurrence of the pairs of terms. We then show that the retrieval performance of LSI for the optimal choice of parameters can be obtained when the expansion matrix used is a simple linear combination of the first and the second order co-occurrences.”

Fascinating defense.

Findings: short order co-occurrence is important for LSI to work. However, Newo’s thesis is focused on defining elements of the term-document matrix A using a term count weighting scheme (term weights as word counts). It remains to be seen if this holds for more complete models that incorporate global and entropy weights.

In future post we will elaborate on this thesis.

This is a legacy post originally published on 9/6/2006