If you are a subscriber, the current issue or IRWatch – The Newsletter should arrive to your inbox during the day. If not, let me know.
The piece is a summary of my 03/8,9/07 presentation at the OJOBuscador Congress 2 in Madrid, Spain. My topic was Demystifying LSI for SEOs
The material presented at the conference has been heavily edited and adapted to the newsletter. It is presented using figures and side-by-side comparisons of myths and facts. This time I wanted to skip the math and lengthy explanations.
I assume readers are familiar with the tutorial series and fast tracks on SVD and LSI:
Regarding LSI SEO Myths, well …
There is No such Thing as “LSI-Friendly” Documents
I often state that there is no such thing as “LSI-Friendly” documents. Why? The main reason is just common sense. In LSI a term-doc matrix is populated with term weights according to a scoring scheme of the form
aij = Lij*Gi*Nj
aij = weight of term i in doc j
Lij = a local weight; this can be given in terms of a scale of frequencies. This scale can be logarithmic, log normalized, or using plain occurrences of terms (word counts). The later is a poor performer and more like not used by any current search engine, though it was used in the early LSI papers that SEOs like to misquote.
Gi = a global weight; it can be given as IDF, Probabilistic IDF, as entropy values, etc.
Nj = a normalization factor, often set to Nj = 1, but not necessarily as in document pivoted normalization
The term-doc matrix is then decomposed using k-dimensions. How many dimensions to keep for SVD must be determined experimentally. Its value can be in the hundred and certainly affects LSI results and performance. More likely end users like SEOs wouldn’t know how many dimensions a search engine implementing LSI uses to SVD a given term-doc matrix. This is important since different k values lead to different SVD results (doc ranking results, docs and term clusters, term weights, etc).
Before decomposition, the term-doc matrix is sparse.
A sparse matrix is any matrix with many entries empty, in which case a value of zero is normally assigned to the entries. Often term-doc matrices are sparse matrices since not all terms have weights in all other documents. There is nothing mysterious or unique about sparse matrices, like some SEOs are trying to claim. Some recently claim that “sparse matrix algorithms are here to stay”. Evidently these folks don’t know what they are talking about. Sparse matrices have been around since the invention of matrix algebra.
Anyway, after decomposition and reconstruction with SVD a term-doc sparse matrix becomes dense. Looking at the reconstructed matrix reveals that LSI causes a redistribution of term weights, because of the way the SVD algorithm works. This means that any single change in any term causes a redistribution of term weights across the entire matrix representing the collection of docs.
There is no way for SEOs sitting behind a query box to predict such redistribution of weights. First they would need to have access to the entire collection of docs. Second they would need to know how other documents across the entire collection were changed by their owners. Third they would need to know how many k-dimensions were used. They would need to know all this at any given point in time.
Common sense tells us this more likely is an impossible task. And we still haven’t mentioned that docs are constantly added or purged from the index. Thus, SEOs cannot manipulate LSI with their trickery or even using an alleged list of imaginary “LSI-valid” terms or phrases to try to game the system or create “LSI-optimized” docs.
Such lists of “approved LSI terms” are just fabrications from those trying to sell something or to promote their own image as SEO “experts”. Most of these folks don’t even know how to SVD a simple term-doc matrix, anyway, and just misquote the work of LSI researchers. Others simply repeat like parrots what their peers claim at search engine strategy (SES) conferences, forums, or blogs.
Even assuming that an “LSI list of terms” was created from an SVD output, once identified terms would be used with new docs and these would be indexed. Thus, term weights across the SVD matrix would need to be recomputed and you would end with a chicken and the egg problem. To overcome this problem, folding-in methods have been tried with small and controlled collections. With huge collections in the billion of docs like Google or Yahoo this won’t help because of degradation in the reduced space which compromises ortogonality.
As for marketing firms deceiving the public and prospective clients by selling or promoting “LSI-based” services, lets hope one day these unethical marketers be slapped with a class action consumer fraud lawsuit. If you believe that you have been deceived by these fraudulent firms contact the FTC and the BBB. If you need some coaching or advice, let me know.
What many SEOs are realizing about LSI
For those that think I’m the only one bashing SEO LSI Myths, check the following links for a second opinion. Some of these are industry leaders I helped to learn the truth about LSI myths and these snakeoil sellers. They are now helping others within the search marketing industry.
Lies, Lies, and LSI by Mike Grehan
Personalization Through Tracking Triplets of Users, Queries, and Web Pages
InfoSearch Media & ContentLogic – Purveyors of Falsehoods
5 Myths about SEO
The History of Latent Semantic Indexing
Web content and LSI mega-rant. Part Two…