Mike Grehan finished his great ClickZ column of June 11, 2007 SEO Is Dead. Long Live, er, the Other SEO, as follows:

“I’ve run out of space again. I’ll come back to the stupidity of the latent semantic indexing issue in my next column.”

Mike probably referred to LSI claims by SEOs. He made good that promise with his July 9, 2007 column Optimizing for Universal Search. In that column he extends the following invitation “to get some feedback”.

“Finally, I’m researching a column on latent semantic indexing (LSI). I’d really like to get some feedback from anyone who’s bought into an SEO package with LSI optimization or reference to LSI built into it. I’d like to know what you got (or are getting) for your money.”

“Or perhaps you just know some SEO firm Web sites suggesting they can optimize around LSI. Along with my dear friend and colleague Edel Garcia, I’d like to expose some of the nonsense written about LSI. And hopefully, for the last time, dispel some of the many myths that abound.”

Without getting personal, let’s call things as they are.

Mike’s call for feedback is not a naive invitation. If you read between lines, especially the second part of the “invitation”, more than a call for feedback this is a well deserved challenge to all those that bought “LSI services” and to all those firms suggesting they can optimize around LSI, that they actually can provide LSI services, or that have a valid LSI technology.

I second that. I would also “really like to get some feedback” from these buyers and sellers.

From an objective perspective, Mike’s “invitation” is both important and valid for two reasons:

1. Important because it comes from a well-respected insider of the search marketing industry, Mike Grehan, who is working for a company owned by Bruce Clay, an equally respected marketer.

2. Valid because when buyers and sellers claim they can use LSI to improve their rankings, that they have a technology to make documents “LSI-friendly”, or that they can manipulate LSI to influence web page rankings it is then fair to ask for any evidence to support such claims.

Mike’s “invitation to get some feedback” is open to all those alleged buyers and sellers of LSI services. Some of the later are well known in the search marketing scene.

For instance, while at Michael Marshall’s FortuneInteractive, Andy Beal claimed they used LSI or provided LSI as part of their services. As of today he is still believing that.

Mike Marshall in LSI: What it is, how it works and what it means has described LSI as follows: “Because LSI correlates surprisingly well with how we as humans might classify a document collection, writing content that performs well under LSI analysis is not like writing contrived, robotic styled verbiage for a machine. It involves giving proper attention both to persuasive, well written copy and to semantics.”

This was just an opinion initially stretched by some IRs, but unfortunately not supported by the scientific evidence. Two researchers I have shared email with regarding LSI, Regis Newo and April Kontostathis, have demonstrated through their excellent scientific work that what is at the heart of LSI and makes it work is the formation of high-order co-occurrence paths taking place across the collection under inspection. This phenomenon of co-occurrence not necessarily has to correlate with how we as humans might classify a document collection and certainly does not necessarily need of synonyms, related terms, or of a particular writing style in order to be present and hidden (latent) in a term-document matrix. It is this phenomenon what makes LSI works.

In addition, the idea that LSI improves the precision, recall, or other performance measure of an IR system must be taken with a grain of salt. According to Grossman and Frieder (Information Retrieval: Algorithms and Heuristics, 2nd Ed, p. 73, Springer), improvement of precision and recall (or precision and recall curves) in LSI depends on the number of dimensions retained during SVD (e.g. around 100 for 1033 docs from a MED collection have been reported). Overall, performance vs. number of dimensions retained tend to describe an inverted U-shaped curve.

Further work using MED, TIPSTER and other collections just show a slightly improvement between LSI over standard vectors space techniques. And that was using controlled collections. With commercial search engines with collections full of noise, vested interests, documents of variable length/format, and in the order of few million or billion the jury is still out there as to whether either technique is any better.

Like Beal and Marshalll, Aaron Wall has claimed in the past to know what is LSI. He even has written articles advising others on how they could make the content of documents LSI appealing for search engines implementing the technique. Wall even has promoted that a tool developed by Quintura provides “LSI-like” results. The later gave that firm some exposure in the seosphere. Soon after many were induced to believe such claims and the idea that indeed Quintura’s tool is an LSI technology.

In a recent blog post by Wall, it became clear to David Petar Novakovic, a graduate student conducting research in LSI, to Mike Duz, who wrote The LSI Myth column, and to many others that Mr. Wall did not know what is LSI after all.

I’m not interested in using my time to turn this post into a personal attack or rant. The above are simply the facts, easy to double check on the Web. I simply cannot decorate the obvious by being a hypocrit or pretending to be politically correct. However, since Wall chose to attack me than to provide any scientific evidence to prove his case, here is my rebuttal to Wall and his cheerleaders, with a follow up here.

I am passing that page on him.

Certainly in the interest of fairness, we need to give these companies an opportunity to prove their cases. Let’s just hope they don’t resource to any marketing lingo, “point-of-sale” arguments, or to superficial explanations of semantic distances and synonym stuffing lines.

Let’s find out the truth.

Do they use SVD? If so, how? How do they construct the initial term-document matrix and why? How many documents and terms do they use? Which scoring system do they use to populate the initial matrix and why? Knowing that for more than 3 dimensions a visual representation of an LSI reduced space is not possible, how do they determine the optimum number of dimensions to keep? How many dimensions they keep and why?

How do they know if they have achieved an optimal reduced SVD matrix that works better for a particular search engine or enterprise information retrieval system? Other than the opinion of others or of copying/pasting or misquoting lines from IR articles, which first-hand scientific evidence these marketers can provide to make the assertion that “LSI correlates surprisingly well with how we as humans might classify a document collection”?

How do they implement folding-in or other SVD updating techniques without degradating significantly ortogonality and at what scale? What is the precision-recall profile of their implementations? How do they deal with sparsification and dense issues in an SVD matrix? How do they address high-order co-occurrence patterns present in a term-term LSI matrix?

These are some of the few questions that an LSI researcher would ask. I have more questions –all designed to test LSI arguments and for spotting fake or poorly designed LSI implementations.

So without getting personal, let’s ask all these marketing firms if they really do know the how-to SVD calculations involved in computing LSI scores or if they can show that they have a valid LSI technology for improving the search engine rankings of web pages.

All this is fair to ask, in order to separate real gems from diamond dozens. Why? Because many insiders and outsiders that have read the above claims in connection with LSI have been under the impression that all these marketing firms do know what is LSI, how it works, or that they can provide “LSI services”.

Let conceed for a moment that these firms do have a real LSI technology. Even so, there is no such thing as “LSI-friendly” documents and end users cannot manipulate LSI weights either. The reason can be found in the way the SVD algorithm works. In LSI one starts with a sparse matrix and ends with a dense matrix of redistributed term weights.

Any little change in a term(s) in any given document(s) used in a truncated SVD matrix provokes a redistribution of term weights across the entire collection of documents represented by that matrix. There is no way for end users to predict how that redistribution will affect other documents of the collection.

In addition, end users don’t have access to the collection undergoing SVD. They don’t have control over other documents of that collection either, and don’t know when or how others will make changes to their own documents. Thus, it is impossible for end users to predict the final redistribution of weights caused by an SVD algorithm or to anticipate the subsequent ranking of documents at any given point in time. More on this here.

Even in the eventuality that users can implement LSI to a collection, there is a problem. This has to do with common sense.

Consider a search marketing firm applying LSI to a collection of D documents and retaining K dimensions to truncate a term-document matrix consisting of T terms. The firm uses a specific term weight scoring system to populate entries of the matrix. Any results will be valid for that and only that number of dimensions retained, for that and only that T-D matrix, and for whatever the little number of documents and terms they have used. Let assume this output is used to optimize a document around a pool of terms from this matrix.

If that document is submitted to an LSI-based search engine, the engine can use its own experimental LSI conditions to include the document in its collection. Some if not all of the experimental conditions to be used: K, T, D, as the term weight scoring system for populating the matrix will be different from those used by the marketing firm.

In addition, the search engine can update the original matrix at any given point in time. Any comparison between the above two systems will be an exercise in futility. And still we haven’t included high-order co-occurrence paths distributed across the matrix subject to SVD or the fact that search engines do not publish their current term weight scoring formulae -for obvious reasons spammers know well.

Mike’s invitation “to get some feedback” is not a naive one, but a very relevant call these days. Why? Because the misrepresentation of products and services can open the door for a consumer’s fraud class action lawsuit or at the very least is something that the Federal Trade Commission might want to hear about.

Now, thanks to Grehan, the ball is on the court of these buyers and sellers.

I have my own preliminary 10-point QA/QC test for Spotting Fake, Poorly Written, or Invalid LSI Implementations, but I prefer to read when, where, and how Mike will -as he states- “expose some of the nonsense written about LSI. And hopefully, for the last time, dispel some of the many myths that abound.”