As I mentioned in a ClickZ column written by Mike Grehan, The Myths and Maths of SEOs, a blogonomy is the dissemination of false knowledge through electronic forums, especially through blogs. Today I want to commment on two LSI blogonomies promoted by several SEO firms.

(a) On CIRCA

In 2003 Google bought Applied Semantics and its AdSense Service and rebranded this with its own brand. The service and other AS services was based on their CIRCA technology, protected by several patents. This is a technology based in part on ontologies, which is described at this white paper and in these patents:

Meaning-based advertising and document relevance determination
Meaning-based information organization and retrieval

The CIRCA white paper and patents are often misquoted in SEO blogs that discuss Latent Semantic Indexing (LSI). Some well known SEO bloggers even have mistaken these for LSI. The fact is that these technologies are not LSI, albeit that Telcordia has a patent on LSI since 1989. CIRCA is based on ontologies, not on SVD. LSI is based on SVD, not on ontologies.

If later on Google and Yahoo incorporated LSI into their paid lists, this was after the facts.

(b) paid and free LSI tools

This blogonomy has to do with two specific SEO firms that promote an “LSI tool”. One is free and the other is paid. These “snakeoil” marketers claim they use LSI to extract information from documents. I have seen both in action and have not found any evidence they use SVD to decomposed and approximate the initial term-document matrix. So I cannot say that these use LSI, albeit that a tool that outputs 2 and 3 dimensional graphs for all test cases is simply using the first 2 or 3 singular values only, i.e., one per dimension. This provides poor results. In most valid LSI tools hundreds of singular values, then dimensions, are used to approximate the input matrix. In fact, it is well known that the number of dimensions to be used is a critical performance parameter that must be determined experimentally. When it comes to number of dimensions to use, a “one size fits all” approach is not recommended.

To illustrate, for few thousand training documents, a plot of singular values vs. performance produces an inverted u-shapped curve with maximum performance around 100-150 singular values (dimensions). These will be the optimum conditions. Note that for more than 3 dimensions a visual representation is not possible.

Reducing the number of test documents or singular values to few is actually one of the poorest scenario of this curve. The other extreme is using too many dimensions. So optimum conditions must be determined experimentally and each is different according to the universe in question. Another point is that a tool that pulls terms that are not present in the test universe (the initial term-document matrix) is simply appending these from an external source and therefore is faking the results.

Want know if you have been taken by these unscrupulous marketers? Read Mike Grehan column Lies, Lies and LSI.

My take on all this? So far I have not seen any valid LSI tool from any current SEO firm. Do not let these marketing firms scam you.

This is a legacy post originally published in 10/13/2006