SEOs and Still Their LSI Misconceptions

I just came across this article

http://seo-and-google.blogspot.com/2007/07/5-tips-to-effective-seo-keyword.html

by Valerie DiCarlo and honestly don’t know from where these marketers learn all these misconceptions regarding LSI. Perhaps she has been misled as well by the usual suspects and is just trying to make some honest comments she truly believes. Unfortunately most of her’s are incorrect. I’m commenting her lines, one by one.

“Latent Semantic Indexing (LSI) is a vital element in Search Engine Optimization (SEO) for better keyword rankings in search results.”

No is not.

That’s what many SEO firms claiming to sell “LSI-based” services wish prospective clients to believe: Repeat a false statement often until many takes it as a true statement. Even SEO companies making such statements don’t really know what is LSI or how it works. On top, there is no such thing as “LSI-friendly” or “LSI-optimized” documents, nor LSI can be manipulated to improve rankings. I’m still waiting for an SEO to prove such LSI claims. The challenge has been issued in:

https://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

Many SEOs are now recanting their past misleading LSI “explanations” they once told and sold to others –thanks to that challenge or “invitation”.

“LSI is based on the relationship, the “clustering” or positioning, the variations of terms and the iterations of your keyword phrases.”

Simply incorrect. This is not how LSI works and any result one gets from such strategies, if any, has little to do with LSI.

“Expertly knowing LSI and how it can be most useful and beneficial for your SEO and the importance it has with the algorithm updates to search engines like Google, MSN and Yahoo which will benefit your keyword research for best practice SEO.”

Expertly knowing LSI will prevent one from making the above statements.

“Those doing keyword research over the years have always known to use synonyms and “long tail” keyword terms which is a simpler “explanation” to LSI. “

In the early and outdated LSI papers SEOs often misquote, the role of synonyms was stretched. Today we know that what is at the heart of LSI is a high-order co-occurrence phenomenon taking place across a collection of documents, transmitting a redistribution of weights across connectivity paths. This phenomenon can be explained in terms of graph theory.

This phenomenon can be present regardless of whether the terms involved are synonyms or not or regardless if end-users resource to a particular writing style. This synonym fallacy can be traced back to the early LSI papers. To learn how and why this synonym fallacy started read
SVD and LSI Tutorial 5: LSI Keyword Research and Co-Occurrence Theory.

LSI is not word co-citation either. Word co-citation or first-order co-occurrence is observed when terms co-occur in the same documents. This type of co-occurrence and even high-order co-occurrence (in-transit co-occurrence) does not grant contextuality, as terms can occur in passages that discuss different topics in a given document.

Web documents -especially large docs- are prone to be multitopic and thus terms can co-occur within different topics. Even web docs with a central article come with news stories, newsfeeds, blog feeds, ads, sales pitches, etc. These might undergo updates, links can be changed at will, and so forth. Thus, only because terms co-occur this is does not insures topification.

In such cases term weights based on co-occurrence can be misleading. The key to similarity scores that incorporate co-occurrence is not that terms happen to be synonyms or co-occur together or while in-transit, but that they co-occur together or in-transit within similar neighboring terms and within similar topics. This means that not all connectivity paths from an LSI term-term matrix are equally important and as such an SVD-truncated and dense term-term matrix must be artificially sparsed to extract useful topics.

Unfortunately, some misunderstandings regarding synonyms and co-occurrence in relation with LSI are currently being perpetuated at this SEOMOZ blog. Back to the claims of the above writer. She continues:

“The real bottom line is that Latent Semantic Indexing is currently a MUST in keyword research and SEO.”

Simply another misleading statemente since the last part is false (…Latent Semantic Indexing is currently a MUST in…SEO). First SEOs would need to know what is LSI and how to compute LSI scores.

To sum up, this is just an example of how all these SEOs and marketers claim to know about a subject they know little about. It is a consistently distorted and almost sinister way of promoting products and services across the websphere and blogosphere. One can spot these folks from the distance.

9 Comments

  1. Excellent post. I think it was I who first opened the pandora’s box of LSI at SEOMoz. I’d like to correct my faux pas. If I read your tutorial correctly, when I said “synonyms”, I should have said “related terms, although not necessarily synonyms”. Is that correct? LSI can aid you in finding documents about primates, even if those documents do not contain the term “primate”, if they do contain terms “monkey”, “ape”, “chimpanzee” and the corpus contains higher than expected co-occurrence of these terms with “primate”. Is that correct?

  2. Excellent post. I think it was I who first opened the pandora’s box of LSI at SEOMoz.

    Well, LSI has been discussed at SEOmoz for few years now, one way or the other.

    I’d like to correct my faux pas. If I read your tutorial correctly, when I said “synonyms”, I should have said “related terms, although not necessarily synonyms”. Is that correct?

    Correct.

    LSI can aid you in finding documents about primates, even if those documents do not contain the term “primate”, if they do contain terms “monkey”, “ape”, “chimpanzee” and the corpus contains higher than expected co-occurrence of these terms with “primate”. Is that correct?

    This is a compound statement. The first part where it reads “LSI can aid you in finding…” can be correct. The second part, concerning “higher than expected co-occurrence” is not necessarily correct.

    First, according to Tom Landauer’s group and quote:

    “Typically well over 99% of word-pairs whose similarity is induced never appear together in a paragraph”.

    Click to access LSATutorial.pdf

    While LSI (LSA) itself is not co-occurrence, the co-occurrence phenomena are important in LSI studies.

    Indeed, more than a “higher than expected co-occurrence”, the key here is a co-occurrence phenomenon across the entire collection and how specific co-occurrence paths influence each other. One could achieve similar results as you describe with terms being along the proper co-occurrence paths, yet not necessarily having higher than expected co-occurrence. This is better understood with standard graph theory applied to the Uk matrix extracted from SVDing a term-doc matrix. More on this below.

    About The Synonym Fallacy…

    Early LSI papers (circa 1988) used a primitive term weight scoring framework wherein the initial term-doc matrix was populated with just raw frequencies (occurrences), neglecting global weights and entropy. The term-term clustering power, hence, term discovery strength of LSI was initially thought to be a reflection of the nature of the terms.

    However, a lot has changed since then. Quoting those papers, some dated back to 10, 15, 20 years is like trying to use early Google papers to explain PageRank in 2007. I’m sure you would agree with me that a lot of things have changed since then.

    In those early LSI papers the documents examined with SVD were abstracts, office memos, or just plain titles. These were also of similar lengths, formats, and free from commercial noise (like spamming content). None of these “docs” were actually Web pages. In addition, the docs examined were rich in synonyms, related terms, and about similar topics. Obviously the clusters obtained consisted of such kind of terms. Nothing unexpected or surprising from those results.

    SVD and LSI Tutorial #5 discusses that after SVDing a term-doc matrix one ends with three new matrices, Uk, Sk, and Vk –where k is the number of singular values, hence dimensions retained.

    The term-term clustering power of LSI can be inspected using the Uk matrix. Rows of Uk hold term vector coordinates. Thus, term comparisons and terms discovery and clustering can then be conducted by measuring the cosine angle between these term vectors when projected in a k-reduced space.

    Once this is done, terms can be clustered by similarity values. This is how terms can be “discovered”. In the example given in the tutorial and taken from Grossman and Frieder’s 2004 book (Information Retrieval: Algorithms and Heuristics) this was illustrated with the figure at http://www.miislita.com/information-retrieval-tutorial/u-matrix.gif

    Obviously for more than three dimensions a visual representation is not possible.

    Note from the figure that terms end forming clusters, but none of these are related terms or synonyms. This basic example, even populating the initial term-doc matrix with raw frequencies, underscores a major misconception drawn from the early LSI papers, which is what I call The Synonym Fallacy: that terms end clustered in LSI because these happen to maintain a synonymity or relatedness relationship.

    This is the same fallacy SEOs are still perpetuating when “explaining” LSI through a blog or seo book. Some perpetuate this fallacy by advising others to stuff documents with synonyms and related terms. The extrapolation SEOs are making here or the message they are trying to send is that doing so will make documents “LSI-friendly”, and that such “LSI-treated” documents will be ranked higher by a mythical search engine implementing LSI in the organic SERPs. It is laughlable now to see some of the very same SEOs recanting their “LSI explanations” or recoliling into desesperation and realizing they didn’t know what was LSI after years claiming they did know what was LSI and advicing others.

    When it comes to term clustering and discovery the nature of the terms not necessarily is the determining factor. In the previous figure, terms are neither synonyms nor related terms. A graph of nodes and edges, with nodes being terms and edges representing n-degree co-occurrence relationships sheds some light into the term clustering power of LSI.

  3. Often LSI is mistaken for key cocurrence. In google it is possible to query and get them to show you part of the index. However Google did name the way they derived the cocurrance from a document as LSI and this is probably part of the reason for the confusion.

  4. Hi, Michael. Thanks for stopping by.

    Often LSI is mistaken for key cocurrence.

    I agree.

    However Google did name the way they derived the cocurrance from a document as LSI…

    I’m quite busy now to check the datum. Please feel free to drop me an email with a valid research reference from Google. Probably there is a wording here to grasp.

    Google has referred in the past to the alt attribute of the IMG element as an “alt tag”, when indeed there is no such thing as a W3C-valid HTML alt tag –so I would not be surprised since many SEOs repeat that and talk about “alt tags”. Either way, point me to a valid reference.

  5. However Google did name the way they derived the cocurrance from a document as LSI…

    Without a research reference from Google stating that, I’m afraid this might qualifies as your opinion or as another hearsay.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s