In How to Populate a Matrix for SVD I referred readers to Igvita’s great blog posts on SVD. A recent visit to the blog shows it is still very much alive and equally interesting. The issues been discussed are not really new, though.
When we lecture on SVD an issue that soon or later arises is how many dimensions k to keep. A recent visitor of the aforementioned blog finally raised the same question.
Can you pls give me a clue as to how we decide how many dimensions to project our data onto when using SVD?
How many dimenisions to keep is the so-called Rank k Approximation that often leads to the dreaded dimensionality reduction curse in which performance can be compromised.
In the Latest SEO Incoherences (LSI) post we mentioned that this issue was already addressed by Dr. Susan Dumais, many times, and througout her first papers and talks on LSI. In that post we referred readers to Dumais’s talk Transcription of the Application presentation by Susan Dumais, Bellcore (now at Microsoft). That talk is now a classic in the history of LSI.
In those days Dumais approach was simply “by seat of the pants“:
Let me end, as my time is running out, with some of the statistical issues that we have encountered and that I hope you have some hints about. The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it doesn’t work. You know when you have too few dimensions. We would like some better methods for doing this, things like the scree test don’t seem to correspond very well to behavioral data that we have.
Later during the QA session participants revisited this issue. Let us reproduce participants-Dumais QA:
PARTICIPANT: Thank you, Susan. Questions from the floor?
PARTICIPANT: I’m a little nervous that if someone was browsing the Web and we hoped to put some of this material in the Web, that we’re in trouble. We’re talking about seat of the pants and underwear models, that people are going to get the wrong context for why we’re here. But that is part of the big problem that Susan is talking about.
PARTICIPANT: I thought I would just mention an entirely different approach to this problem, with Joe (word lost) at EDS. What we’re doing is –
PARTICIPANT: Can you get to a mike?
PARTICIPANT: We are using a poisson model for the word counts. Then we’re interested in finding maximum likelihood estimates for the clustering, and we found various combinations of simulated annealing and markup chain Monte Carlo to work very well with funding these things.
One of the nice things in a model based approach is that you get natural measures of association rather than just SVD types of things, although it could be slower.
PARTICIPANT: I think one thing we will try to ask everyone after the conference is to send us electronically two or three references of relevant work that we can disseminate in this way, because we do hope to learn about new approaches and new methods and related work. So keep that in mind as the discussion progresses. We will send out E-mail requesting those in electronic form.
PARTICIPANT: (Comments off mike.)
DR. DUMAIS: It is first of all not clear that the 300 or 400 dimensions we have used for the trek databases is optimal. We find that performance is still increasing up to 400 dimensions; it may well increase beyond that.
In fact, I should mention that if you plot performance as a function of number of dimensions, what you get is an inverted U function that is heavily skewed. That is, performance increases dramatically as you go from 20 or 30 up to several hundred dimensions, and then it tails off gradually through the level of performance that you see with raw key word matching, which is the full dimensional solution.
We don’t know that we have reached the peak. In problems where we know what the optimal number of dimensions is, we have found that the peak is not so sharp.
Twenty years later (first LSI papers saw the light in 1988, not in 1990 as some SEOs have incorrectly claimed) a lot of research advances on SVD in relation to LSI have been published. Old IR ideas regarding LSI have been dropped and new ones have been adopted. That is what research is all about.
Still, the issue of how many dimensions to keep is still an open issue and a “by seat of the pants” one. All kind of things and guidelines have been tried. But at the end we need to test and retest the system under examination.
I even have tested my own guideline: keep the top k singular values that amount to more than X percent of the trace of the S matrix; where
S is the matrix of singular values.
X is a threshold value, usually 80-85%
But, again, some would ask: why 80%? Why not 90%, 70%, 60%, etc?
While the above guideline works for many systems, I have trepidated on some systems in which the above threshold is not good. So I always come to “find X experimentally or by seat of the pants”.
We could inspect this as an optimization problem and use Nelder-Mead Multivariative Sequential Simplex Optimization, but I haven’t tried this yet. I’m not sure if this is the way to go either, but might be worth to test.
Another idea is to iteratively update-test-update-test the matrix using any of the current SVD updating methods for several X values. I need to spare some time on this one to see what comes out.
I’m also open to suggestions.
For those interested, a 1.0 Mb download of Dumais’s 1995 presentation is available. If you have problems downloading it, let me know. I can send you a zip file.
April Kontostathis, from Ursinus College, in Essential Dimensions of Latent Semantic Indexing (LSI), proposes an interesting approach to address aspects of this problem. She illustrates her approach with a model wherein term weights are computed using a well known base-2 LOG model for local weights combined with the ENTROPY model for global weights.
More work is still needed along these lines.