How to Populate a Matrix for SVD
SVD has been applied to different scenarios like IR, Economy, Computational Chemistry, BioComputation, and other scenarios. In all these cases, one must pay attention to how one populates the initial matrix to be “SVDied”.
The folks at igvita.com have a great discussion on an SVD Recommendation System in Ruby, complete with a simple example and nice code lines. I stopped by to chime in some thoughts:
Quote starts here.
“The poster that goes by the name of Genghis has raised two good questions:
1. how to discriminate initial vector constituents.
2. why cosine angles and not distances are used.
I hope the following helps:
Addressing this from the IR side, early LSI papers describe SVD wherein initial term-doc matrix entries (aij) were defined as local weights (Lij)
aij = Lij = fij
where f is the frequency (occurrence) of term i in doc j.
This assumes that aij values are independent of one another and not limited by an upper bound.
Authors of same early LSI papers soon recognized that performance is improved by incorporating global (Gi) and normalization (Nj) weights:
aij = Lij Gi Nj
wherein Gi is the weight of term i across all docs. Gi can be defined in terms of inverse frequencies or entropies. Nj is a normalization weight for doc j, usually set to 1.
Lij, on the other hand can be constrained to a 0 to 1 interval using augmented logarithms or normalized augmented scales. Using raw counts for Lij implies that local weights are defined using a scale without an upper bound.
Lij, Gi, and Nj have been described using dozen of schemes. Since Lij, Gi, and Nj can be defined to run within the 0 to 1 scale, aij adopt values between 0 and 1.
There are several advantages of defining global weights as entropies than as mere inverse frequencies, but that is out of the scope of this post. Many -including Berry, Dumais, and others- have used entropy for global weights.
As given in the example above, initial entries were given without considering global and normalization weights. These are also given without an upper bound.
The problem with using local weights is that the relative importance of a given object (or feature) against all other objects of the collection are taken out of the picture.
At first, not incorporating global weights difficults discerning, eg. between something like Obj(5,5,0,0,0,4), Obj(1,1,0,0,0,1), Obj(5,5,0,0,0,5) or even something like Obj(x, x, 0, 0, x) -where x can be any number that sweets your heart-, since
(a) we are not considering the relative importance of vector constituents (entries defining a vector) across the entire collection at hand.
(b) vector constituents don’t have an upper bound.
Regarding Genghis second question: why cosine is used instead of distances, two good reasons (there are many) are:
(a) a cosine angle when used as a similarity measure actually reduces to a correlation coefficient.
(b) If one still want to map similarity to distances this can be done. Once similarities are known these can be converted into distances. However, while a distance can be converted into a similarity measure, the reverse is not that obvious because of the triangular inequality which must be satisfied by a distance metric.
BTW, when we explain SVD to students we often use aij = Lij, just to illustrate the basics of SVD and simplify classroom presentation. Then students are introduced to advanced term weight schemes for populating a term-doc matrix to be “SVDied”.
At least in IR, the framework
aij = Lij Gi Nj
is the one used with real SVD applications.
I hope this helps to clear things up.”
End of the Quote.
Note I used the conditional statement “At least in IR…” since there might be specific cases, like with SVD in computational chemistry, wherein one is forced to settle for a simpler representation of weights.
At least in computational chemistry one uses vector space models, SVD, similarity measures, and other concepts from ancilliary fields, but accommodating these to the underlying chemical significance under consideration. This means that a chemist must look at the chemical principles involved and not just at the numerical analysis. In this sense, all these ancilliary techniques from IR acquire some sort of useful meaning and “life”.
June 17, 2007 at 1:11 pm
Thanks, xshirase, for your comments.
I’ll be happy to know more about your method for implementing c-indices.
Regarding your questions, co-occurrence estimates from c-index calculations are conducted from answer sets and does not require the construction and decomposition of a term-document matrix. In contrast, estimating SVD-based co-occurrence requires access to a collection of documents and the construction and decomposition of a term-document matrix.
In SVD, all entries of the term-document matrix influence one another. Thus, any change in a single term affects the co-occurrence phenomenon across all terms. Depending on the vocabulary usage across the collection, some terms can be impacted more than others.
The computational effort between the two approaches is quite different in both time and operational costs.
Last but not least, unlike with SVD-based co-occurrence, co-occurrence based on c-indices does not require access to the entire collection, but to answer sets of this; thus, it is a client side approach to the problem, meaning that end users can use it for data mining answer sets from systems.
Note that to implement SVD one needs to actually interact with the IR architecture at hand and have access to the collection. It is not a client-side thing like c-indices and results do depend on the number of optimum dimensions retained.
You might want to read a paper that soon will be out on the problem between co-occurrence and co-retrieval (terms easily mistaken for one another).
June 17, 2007 at 2:06 pm
Thanks for this so fast answer!
I use a basic method for c-indices based on your on-topic tutorial,
First , I’ve built a table and filled it with keywords related with my topics of interests (intelligence, computer science, future, etc…).
I’ve developped next a method to retrieve the number of results of a google search query in findall mode. (the hardest part was bypassing the limitations of google for those who do not use their api, a problem I’ve solved by proxying queries to a list of websites that have a searchbar… dirty but it works)
At this point, I get the “google number” for word1, word2, word1+word2 . Let’s call them n1, n2, n12 like in your article. The next line is the computation of the indice based on the formula you provide : c12 = n12/(n1+n2-n12) (I added a factor 1000 for easier reading).
Again with your formula, I decided it could be fun to compute the distance between terms : d12= | log10(c12) | where the absolute enables me to compare easily distances by putting words on an axis with word1 for origin.
c12 and d are stored in my database with the pair concerned and I just have to do a simple sql sorted query to get the closest terms associated with one word.
I know it’s quite basic, and I’m reading a lot to improve this, but I’m strongly motivated by the results I get (far more coherent than with the fancy methods I used before reading your paper ).
My problem now is for the system to be able to generate queries that may provide relevant results, and then analyze these results and select those which fit the most. As I do it for fun and don’t plan any public release except maybe for my teachers, time and ressource cost does not really matter.
Again, thanks for all your work.
June 17, 2007 at 5:42 pm
You are thinking right. That’s a reasonable implementation. I’ve found that even when c-indices are based on search result counts claimed by a search engine and that not all counts equal to co-occurrence (some noise is inherent in the counts because of a co-retrieval phenomenon), there is an underlying confidence level one is tapping on when computing these ratios.
Thus, c-index ratios are just estimates of the probability (or likelihood) of terms co-occurring, and this can shed some information about the IR implementation and architecture at hand (the target system).
It is like if there is a hidden semantic discourse answer sets are “telling” or “shouting” to the end user. This is similar to what is known in IR as the Cluster Hypothesis.
Our current research addresses cases wherein c-indices can be estimates as both co-occurrence and co-retrieval. You might want to stay tuned on this since we have new information on the whole concept of c-indices and further refinements in the area of keyword-brand associations.
February 13, 2008 at 10:19 am
[...] How Many Dimensions to Keep? In How to Populate a Matrix for SVD I referred readers to Igvita’s great blog posts on SVD. A recent visit to the blog shows it [...]