In A Complete Glossary of Essential SEO Jargon an SEOMOZ poster defines LSI as follows:
“LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document. SEOs refer to these same groups of words as “Long Tail”. The majority of searches consist of three or more words strung together. See also “long tail”. The significance is that it might be almost impossible to rank well for “mortgage”, but fairly easy to rank for “second mortgage to finance monster truck team”
I have been asked to comment about this.
To put the post in perspective, a jargon glossary is like a collection of expressions used within a specific group of individuals with similar interests. Normally jargon is not intended for outsiders.
Overall, the post is a nice coffee table reading. The title states this is a complete glossary of essential SEO jargon. However, it can be argued whether the glossary is complete or if some entries of the glossary are indeed essential to SEOs.
Within SEO circles, jargon connected to search engine technology often comes with two elements:
(a) oversimplification
(b) misinformation
To the poster’s credit, not all entries of the glossary have (a), (b), or both, but are actually informative. Like some of the comments these generate, some are entertaining.
Unfortunately the LSI entry comes with both, (a) and (b). Last time I revisited the post the LSI entry was ignored by commenters. I could have posted these comments there and add content to their blog, but I decided at the last minute to add content to this blog, instead.
Now let’s comment on the sustantive part.
Firstly, two different concepts are almost concatenated by the poster: LSI and the so-called “long tail”. The former is based on SVD, and the later is an expression that describes a distribution. Research on long tail-shaped distributions are found in Mandelbrot’s early work from the 50’s and 60’s, and even before Mandelbrot. Page 84 of James Gleick’s best-seller, Chaos (1987) also mentions a long tail distribution Mandelbrot came across.
Secondly, LSI is not exactly document indexing as some may loosely imply by reading the LSI entry and as many SEOs have claimed in the past. LSI is applied to already indexed documents from which terms have been extracted and already scored with a particular term weight model. Thus before applying LSI, terms and docs are identified and indexed. Now using LSI to cluster terms and documents and then reclassifying these is a different thing. Sometimes this is called reindexing and loosely referred to as “indexing” by few folks.
The initial statement of the LSI entry is simply sloppy, a hearsay, and made out of thin air: “LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document”.
The other problem with this statement is the informational service it provides to the casual reader, who might believe and repeat such notion of LSI across the Web. Besides, LSI is not essential to SEOs.
September 3, 2007 at 4:40 pm |
It’s amazing how backward SEOs have LSI.
At this Dan Thies blog, Long Tail Myth & Reality, one poster trying to hang on the keyword density myth even claims the following, but provides no evidence:
“LSI and long tail are KIND of related so I thought I’d mention it.”
According to whom? This is simply non sense. Both are completely unrelated. I still don’t understand from where these marketers learn such crap.
September 3, 2007 at 8:04 pm |
The only relationship LSI has to the long tail is that it can filter the long tail of noisy singular values embedded within the data matrix of weighted document vectors. This has NOTHING to do with the long tail of keyword densities.
September 3, 2007 at 8:19 pm |
Good to see you stopping by, David.
Indeed, and LSI can do this, regardless if the distribution is long-tail. So, at the end such filtering can hardly be called a relationship, but an application or use of the SVD algorithm.
September 6, 2007 at 10:22 am |
My stand is that anything that can be related to anything else is good for the common person if it can get results. Just as “Keyword Density” is not how search engines rank sites doesn’t mean it’s not a useful measurement. You can use DNA analysis to find European girls. Or you can look for blonds. My method is not accurate, but there is a correlation that can be measured. And I posted a URL to the data so why lie about that?
September 6, 2007 at 12:30 pm |
Sure, I can try to find a relation between eating spam and traveling to the Moon and vice versa. And? A relation is not correlation and correlation is not causality.
Unfortunately, throwing terms like “correlation” and “anything that can be related to anything else” I’m afraid is not enough and proves nothing. That’s why we have Statistical Correlation Methods of Analysis courses.
If you have a statistical correlation analysis to prove your point, and enough reproducible experiments, I will be happy to read that.
By contrast, if you are happy with whatever you do with keyword density, calling whatever you do “correlation”, or hanging around the KD myth, good for you.
January 18, 2008 at 10:45 am |
[...] http://irthoughts.wordpress.com/2007/09/03/lsi-according-to-an-seomoz-glossary/ http://irthoughts.wordpress.com/2007/08/29/a-call-to-expose-seo-liers/ [...]
February 18, 2008 at 4:44 pm |
Would you care to formulate a simplified definition of LSI that would be suitable for the average webmaster?
I actually wrote that post, and I apologize for my ignorance, but you are correct that the community made no mention of this particular term – despite having around 15,000 or so subscribers at the time. This may indicate a general lack of understanding.
I actually started writing that article as a learning aid for myself, and admittedly the definitions were largely an aggregation of common knowledge that I gathered from the SEO blogging community. Clearly, common knowledge is not always accurate.
February 22, 2008 at 12:24 am |
When it comes to information retrieval and search engines, most SEOs discussion sites are contaminated cesspools of opinions and ideas made out of thin air, with quick followers doing, well, just that, following others. Your post and the comments that it triggered at SEOMoz is just an example of this. I’m afraid that SEOMoz is not any different from other vanity post repositories. It is a shame.
Regarding LSI:
Today we know more about this than 20 years ago. The first LSI papers were published in 1988. Sticking to old definitions, as many SEOs and even IRs are doing is not doing any good. It is like trying to use original PageRank ideas 10 years later (today). I’m sure you would agree with me that a lot of things have changed since then.
When LSI first saw the light it was speculated that the technique captures latent semantic information and even the way humans process semantics. First, today we know this is not true. LSI captures first and high-order co-occurrence information, not latent semantic information. In fact, whether if terms clusters found with LSI are synonyms or related terms or not is not a requirement in LSI.
Second, LSI is not an indexing method as many believe (perhaps because of the “indexing” token in its name). Documents are already indexed and a term-document matrix is constructed before implementing any LSI to decompose this matrix. These two points are the one SEOs do not comprehend.
In LSI, a term-doc is decomposed into a term-by-dimension matrix, a singular values matrix, and a document-by-dimension matrix. The idea is to reconstruct a latent structure of term-term and a document-document latent structure by manipulating how many singular values to keep.
These structures capture co-occurrence, not exactly semantics. The process keeps the most significative dimensions and ignore least significative dimensions.
What ever you have read about LSI from SEOs, unfortunately, probably is crap to sell something. That something can be a product, a service, or their own image as “experts”.