One of the reasons I started the SVD and LSI Tutorial series was to debunk so many myths about latent semantic indexing. These myths come mostly from a given sector of the search engine marketing industry. In the 1800s and 1900s, when new drugs and medicines were discovered, an interesting phenomenon took place in the old wild west: unscrupulous marketers started to sell “amazing potions” and ”miracle syrups”. These “snake oil sellers” are nothing new since each decade has its versions.
So far, the decades of 1990 and 2000 have their own in the form of a large sector from the search engine marketing industry. These modern snake oil seller are easy to recognize by the pattern and trail they leave behind.
1. Their pattern: talk about something, explain nothing.
2. Their trail: combine hot scientific terms with marketing lingo.
Their current stop appears to be LSI and SVD, two scientific acronyms that stand for singular value decomposition and latent semantic indexing. In an attempt to sell whatever they sell, one can see now a crew of search engine marketers reciting these terms in their search marketing conferences, forums, blogs, press releases and any other marketing channel one can think of.
They are all over the place: from US to Russia to India to Mexico, claiming that they sell “LSI-based services” or do “LSI-based website optimization”, “LSI-based link building”, have patented an “LSI tool”, “reversed engineered LSI”, yadda, yadda, yadda.
Such “bottles of snake oil” are easy to sell to ignorants, especially when clients still have hard time figuring out what SEO or HTML stand for.
Their pattern -talk about something, explain nothing- is obvious. When not quoting each other hearsays, they paraphrase or copy/paste lines from the LSI literature, speculate about how search engines might implement LSI or even come up with their own theories.
Their trail – combine science with marketing lingo- is obvious, too. They will put online any argument that will help them to sell something. Quite convenient, hugh.
Still, you will not find these unscrupulous marketers any time soon providing how-to stepwise calculations explaining how LSI or SVD works.
On Deceiving Clients
Deception is also part of their formula, when you think thoroughly.
Consider two lists of keyword terms. One list is compiled by a search engineer and constructed from a truncated matrix A of unique terms and documents. A is obtained from a large corpus or document collection. The engineer SVD A, in order to run an LSI algorithm and to obtain a list of similarity-inducing terms. The list is then used as a list of suggested terms.
Now consider another list. This time the list is the result of either consulting a thesaurus, an annotated reference list of concepts, running a keyword suggestion tool, or even the result of in-house brainstorming.
Next, present both lists to a human or oracle and ask him to determine their origins. Which one comes from LSI?
Exactly. That is the point. Both lists would be indistinguishable, for the purpose of answering that question. You can now imagine possible marketing applications for this.
So the argument from a search engine optimization firm (SEO), from US, India or elsewhere that they sell LSI-based keyword research lists or that they can fine-tune a web page using their LSI-based “patented” software must be taken with a grain of salt, albeit that Telcordia, Inc owns the LSI patent.
And how about these claims: “We collect a sample of documents from the Web and SVD these to optimize your document with our LSI patented technology”, “We provide LSI-based link popularity services”, “Our reverse-engineered LSI software” bla…bla…bla.”
“LSI-based” SEO Services… Really?
Let us revisit these claims. First let do some history and go back to
Why going that far? Because her presentation has many issues that still hold relevant and valid today, aside that new issues have been discovered. Dumais, one of the inventors of LSI wrote (emphasis added):
“Let me give you some characteristics of the data sets that we have looked at. Let start with one that we started with about five or six years ago now. It has 1,033 documents, 5800 unique words. A couple of things to notice about it. It has less than one percent non-zero entries. We can compute the hundred largest singular values and vectors in about two minutes on a standard spark work station, using iterative sparse matrix methods. About five years ago, it took us 48 hours on somewhat slower machines, using standard dense methods.”
“The computational complexity however increases as the data size increases. We currently can handle databases of about 100,000 documents on standard work stations, albeit with reasonable amounts of core memory. For larger applications — and in fact, these are the ones of most interest, we have had to resort to sub-sampling. Our limits involve memory IO as well as CPU to a lesser extent.”
“This trek collection is a collection that is made available by NIST and ARPA to explore information retrieval with large data sets. This is the kind of collection we would like to be able to handle directly, rather than by sampling. It represents about three gigabytes of ASCII text. As you can see, there are 750,000 documents and 500,000 terms.”
“I should mention also that the number of terms here are the number of terms that occur in more than one document. We have omitted from the SVD terms that occur in exactly one document. You can about double this number if you include those terms.”
“The other thing is that the CPU times are for extracting the hundred largest singular values and vectors. For many of these problems, we have found that we need to increase this to compute 300 or 400 dimensions.”
“This increases the computational time by about a factor of ten. So to compute the SVD of 70,000 by 87,000 document matrix takes us about 24 hours on a standard spark board station. This is not unreasonable, because this is largely a one-shot process.”
“Let me end, as my time is running out, with some of the statistical issues that we have encountered and that I hope you have some hints about. The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it does not work. You know when you have too few dimensions. We would like some better methods for doing this, things like the scree test do not seem to correspond very well to behavioral data that we have.”
“We would also like to be able to do larger SVDs faster. We are beginning to face issues about how to update SVDs, especially for changing databases. We can compute the SVD-1s for something like your library card catalog. For databases like running APU wire, that change very frequently, we have issues of either recomputing the SVD to maintain orthogonality, or updating it efficiently.”
“We are interested in what is probably not a statistical issue, but one that is of great practical interest to us. That is finding neighbors in high dimensional spaces. Our queries are vectors in this 300 space, for example. Right now, in order to find near neighbors, the nearest documents, we as psychologists try the brute force method. We compute the similarities to all 700,000 documents, sort and rank.”
In the QA session, she answered almost at the end:
“In fact, I should mention that if you plot performance as a function of number of dimensions, what you get is an inverted U function that is heavily skewed. That is, performance increases dramatically as you go from 20 or 30 up to several hundred dimensions, and then it tails off gradually through the level of performance that you see with raw key word matching, which is the full dimensional solution.”
“We do not know that we have reached the peak. In problems where we know what the optimal number of dimensions is, we have found that the peak is not so sharp.”
End of the quotes.
No doubt that many of those 1995 computational issues have been resolved. Back then at Bellcore, (now Telcordia, Inc) Dumais et. al. used the largest 100+ singular values (k=100+). This amounted to a truncated matrix where 100+ dimensions were retained. Selection of k values is still done largely arbitrarily.
Let me underscore this. Setting the top k values leads to different truncated matrices, thus, different results, not to mention that some Joe trying to SVD few docs from the Web and a search engine “SVD-ing” a large corpus or document collection will end up with different results, anyway.
In LSI, a query is considered another document that must be compared against all other documents. Thus, as Grossman and Frieder pointed out in their book, (Information Retrieval, Algorithms and Heuristics, p. 73): with LSI an inverted index is not possible. One would need to combine SVD with ancilliary techniques to work around this (e.g., see p.129).
In addition, elements of the original term-document matrix (A) are not necessarily term frequencies as many SEOs think. Indeed, most current implementations define these elements using a specific term weight scoring model in which local, global and normalization weights are used, not just mere term frequencies. Note that access to a corpus would be required to compute global weights. To top off, in the LSI literature one can even see term-matrix elements being the result of a term weighting scheme in which entropy values must be computed.
Even more, consider a search engine with million of documents receiving zillion of queries everyday. Compare this with few documents and few queries or whatever an SEO “LSI tool” tries to measure. Whaever these tools measure is just a bad caricature of what the search engine scores. Note also -as Dumais mentioned- that the entire thing largely is “a one-shot process”. Imagine the computational power a little mom and pop firm would require to emulate a search engine implementation like Yahoo! or Google.
In addition, note that the truncated matrix and the discovery of the similarity-inducing terms depend on the k singular values retained. As mentioned before this is more or less an arbitrary selection process. In a typical implementation about few dozen to several hundreds of k values might be involved. To mimic the LSI implementation of a target system a search marketer would need to know the exact k values used and which documents were sampled.
Let me quote an interesting portion of Dumais presentation:
“The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it does not work. You know when you have too few dimensions.”
Anyone can take the first n ranked documents from a query, linearize these to construct a matrix of unique m terms and n documents, SVD these and come up with some suggested terms relevant to the initial query. One could also take a list of links and anchor text, construct a term-link matrix, SVD these with a free online SVD calculator, and then extract terms and claim these are “LSI-based” results.
With all, I challenge any search engine marketer that claim to know when they have too few or too many dimensions in their quest of attempting to mimic the LSI implementation of any given search engine at any given time.
How about few other myths from these unethical marketers? Whether the unique terms of a matrix are nouns, verbs, plurals, come from a specific url domain or portion of a document, are related or not by stems (word roots), plays no role when doing the SVD. In SVD, terms are treated as mere unique tokens and strictly speaking almost everything is reduced to linear algebra operations.
At this point I got tired of highlighting more flaws in the claims of these search marketing firms. A sample list of the latest LSI myths is available for your perusal.
Next stop for these snakeoil marketers?
How about PCA (Principal Component Analysis) or LDA (Latent Dirichlet Allocation)?
This is a legacy post originally published in 8/25/2006