Probably one of the first official papers on LSI that is still available online. Save it before no longer is.
The year was 1988. What you were doing back then?
Probably one of the first official papers on LSI that is still available online. Save it before no longer is.
The year was 1988. What you were doing back then?
As part of the development of Minerazzi, we have published an article explaining two of our search modes: XOR and XNOR. Additional articles explaining other modes will soon follow.
We believe that IR and SEO practitioners will find these search modes particularly useful.
The beauty of XOR and XNOR searches is that these allow users to run complex co-occurrence searches in a straightforward manner. This is important as Latent Semantic Indexing information is related to term-term co-occurrence relationships.
For years many SEOs fooled their own peers with the assertion that LSI was something new that Google implemented. Some even have claimed LSI was a proprietary algorithm from Google. I’ve spent sooooo many years debunking all this crap and few other urban legends from unscrupulous SEOs.
In this Thanksgiving Day I thank that all these myths have been debunked to no end: LSI-rank correlations, LDA-rank correlations, KD-rank correlations, additiveness of correlation coefficients, blah, blah, blah… I thank also that along came this: http://infolab.stanford.edu/~sergey/349/
Known from the onset by Google.
A cost effective implementation in a large scale and dynamic environment as the Web is?
If you search this blog (IRThoughts) for LSI or visit its Latent Semantic Indexing category you will find many posts wherein SEO LSI Myths are debunked. Prior to this wordpress blog I used to maintain a personal blog wherein SEO myths regarding LSI were also debunked.
Over the years, many realized they were taken by the usual agents of misinformation, at least when it comes to “SEO LSI” and “LSI-Friendly” documents.
Recently, I found traffic coming from a blog discussion about a video (http://www.stomperblog.com/warning-advanced-seo-technique-does-not-work/) wherein LSI in relation with Google is debunked.
The video also discusses one flavor of LSI; i.e. one wherein weights are tf-IDF weights. This flavor does not incorporate relevance information or entropy information, like other LSI variants.
The video does a good job at debunking LSI Myths. However, it has at least a factually incorrect argument in relation to how the SVD algorithm works.
The video gives an example implying that SVD works by reducing a large set of words to a few words, such that, for example thousand of words are reduced to, let say 300 words. This is incorrect and certainly is not a trivial flaw.
SVD does not work by reducing a vocabulary, but by reducing dimensions, and there are as many dimensions as singular values. This is why is called a dimensionality-reduction and not a vocabulary-reduction algorithm. I should stress that an LSI Space is not like a Term Space wherein each term is a dimension such that there is a 1:1 correspondence.
In LSI, the SVD algorithm is used to reduce the dimensions of a matrix; the number of singular values of the matrix.
For instance in our SVD and LSI Tutorial series at
we present an LSI problem example consisting of many words and few initial dimensions such that for the initial matrix
#words >> # initial dimensions
more specific, we used 11 words and 3 dimensions
After truncation, we ended up with 11 words and 2 dimensions.
Other than this, the video is fun to watch, but ended up as an introductory promotion for another SEO proposal.
After reviewing several times the video, unfortunately I found the video has another incorrect argumentation.
When objecting to that Google might not use LSI, an argument is made in the sense that LSI has to return same results when word variants are used like plurals and tenses. This might be the case if stemming is heavily used in an LSI implementation, but the use of stemming is not a requirement for implementing LSI at all.
When stemming is not implemented, for sure the SVD reduction will return different results since these will be entered in the original term-doc matrix to be undergo decomposition as different tokens.
The video also misses what the power of LSI comes from: higher order co-occurrence connectivity path hidden (latent) in the original matrix. Whether terms have to be synonyms, related terms, or even of non-derivative forms is not a requirement for observing these hidden paths in LSI.
Terms no need to be related terms either to end up clustered with LSI. It is the hidden co-occurrence patterns what is behind the clustering. For example, in our SVD and LSI tutorial above, we intentionally used stopwords and zero synonyms/related terms and these ended-up in their corresponding clusters, without being necessarily semantically related. This simple example shows that in LSI the SVD algorithm produces an output based on crushing numbers, not on making sense out of meaning or intelligence, and contradicts the generalized opinion that LSI works at the level of meaning.
I have to conclude that while the video is intended to debunk LSI SEO myths (a noble effort), it uses incorrect arguments and hearsays lines from around the Web. Debunking hearsay with more hearsay: What a shame.
There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.
From VSM to LSI
Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.
VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.
In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan’s 1965 SVD algorithm to “documents” exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).
LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The “documents” were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.
It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.
Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.
There is now a crew of SEOs claiming that they can design documents “LSI-friendly” by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as “experts”). Same goes for those that claim “PLSI-SEO” strategies.
Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why “LSI-friendly” documents is plain SEO Snakeoil. Again, the same goes for those that claim “PLSI-SEO” strategies. Keep reading.
In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?
In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.
Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected “reduced description” associated with the document.
But there is a problem.
By 2003 Hofman’s PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI “is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. ”
Blei and co-workers then stated that this leads to two problems:
1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting
2. it is not clear how to assign probability to a document outside of the training set.
Thus, it is not true that PLSI is the preferred model to work with in IR, as some have claimed. In addition, the model has non-trivial theoretical flaws and limitations.
In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a “bag of words”. However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for college junior or junior college produce far different results.
To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.
Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.
In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.
I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere “bags of words”. The former is IR’s Original Sin and the later is its copycat.
The image above gives me a flash back on research work I conducted in the late ’80s on sequential simplex optimization methods.
Here is another SEO resource (http://www.billhartzer.com/pages/latent-symantec-indexing-lsi-is-the-key-to-great-search-engine-rankings/) that a la Aaron Wall is still promoting LSI SEO non sense in connection with ranking high in search engines. Like if these marketers really know what is LSI or how it works. Otherwise, they will never publish such crap.
There is no such thing as “LSA/LSI sites” nor SEOs can manipulate LSI to influence ranking results. It is this type of snakeoil marketing what is a black eye in the face of the SEO industry.
We are happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at the Indian Institute of Technology in Madras, India is using our SVD LSI tutorial as lecture material for his course: CS625, Memory Based Reasoning in AI. http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf
Another investigator, this time from the cancer research field, congratulated us for the LSI tutorials. Jaime Fernandez Vera from Structural Biology and Biocomputing, Centro Nacional de Investigaciones Oncologicas, Madrid, Spain wrote (contact info removed):
Estimado Dr. García:
Muchas gracias por poner a disposición de la Comunidad sus magníficas guías prácticas y, en especial, la de LSI que es la que he seguido.
Jaime Fernández Vera
Biología Estructural y Biocomputación Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas
Our LSI/SVD tutorials are also listed in http://www-timc.imag.fr/Benoit.Lemaire/lsa.html huge repository of LSI research resources.
For additional IR resources quoting our tutorials, check the following link at http://www.miislita.com.
Talking about LSI…
Spammers disguised as ethical SEOs and that promote LSI crap are now hidding. There is less talking on the blogosphere on “SEO LSI” and “LSI-friendly SEO Optimization” myths. As we always say, these crooks are a black eye to the ethical sector of the search marketing industry.
Their signature seems to be the promotion of crap tools and services like Keyword Density tools, Markov Chain generators (if you believe that crap), TFIDF rarity calculators, “semantic page strength” estimators, lookup lists based on “LSI operators”, etc. What will be their next effort at misleading the public? Latent Dirichlet Allocation (LDA) tools?
However, in an effort to save face, the usual suspects are still making gymnastic wording. They are desperate. It is clear that our efforts at exposing these crook marketers through IR knowledge are working.
Many are learning why they should stay away from the incorrect knowledge promoted by marketers that ocassionally use IR jargon to pretend they know what they are talking about. They often do these IR-like talking attempts to promote their image as “experts” before either naive or ignorant followers. We still cannot assess the dumbers, if the snakeoil sellers or their groupies. They even game each others.
When we expose SEO myths from their competitors they praise us as long as the debunkig works for them, but when their own myths are exposed they get angry at us. Ha, Ha.
Posts somehow related with this post
I’m happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras, India is using my LSI and Term Vector tutorials for his graduate courses:
It is great to see that more and more IRs and graduate students are realizing how certain SEOs have induced the public and their clients into error; that is, by selling their snakeoil in the form of “LSI optimization” and keyword density services. The most recent scam comes in the form of “markov chain” services. Like if they really know about matrix algebra and markov chain processes. Same old tricks…
It is not surprising to hear colleagues referring to these SEOs as vulgar crooks and scammers.
Here is a video of my presentation, Demystifying LSI, at the OJOBuscador Congress 2.0, Madrid, Spain, 2007. One year later, nothing has changed. Many of the same crook SEOs exposed during the congress are still deceiving the public about what is LSI.
Unfortunately, the quality of the video and lights are not good enough to see the pdf slides, plus the presentation is in Spanish. Since attendees were not scientists, I talked very slow for over an hour.
Want to get bored for the next hour? View the video.
Thanks to N. Valenzuela Alonso, Director of SEO and Search Engine Marketing of Media Bit, S.L. for the link (www.ithinksearch.com/2008/03/31/video-lsi-de-edel-garcia-desmitificando-lsi/).
Here is also the presentation of Carlos Castillo (Chato), from Yahoo! Research Spain:
Adversarial IR with Web Spam, parts 1 and 2
I spent great time talking with Carlos, a former grad student of Ricardo Baeza-Yates.
Baeza-Yates, Andrei Broder, Gerald Salton, and Keith van Rijsbergen and few others have helped to shape what is today known as Information Retrieval Research
Talking about Andrei Broder (one of the main researchers behind the old mighty Altavista), here is also a great interview, thanks to ojobuscador site:
At the last Search Engines Architecture lecture we discussed LSI and Terrier. Great questions were raised. Some of these follows:
Q: How many dimensions to keep?
A: This is done by trial and error. I have a research project on the topic. None of the current ways of addressing this problem convince me.
Q: How do we compute a truncated version of the initial matrix, A?
A: After SVDing A, truncate U, S, and V by retaining the first k columns of U and V (rows of V transpose) and the first k diagonal elements of S. Multiply these as discussed in class to get A truncated.
Q: To compute the query vector in the reduced space, do we need to compute A truncated for each query?
A: No. The new coordinates of this vectors are defined as
q = qTUkSk-1
This means that A can be called from the cache. See the fast track tutorial
over at Mi Islita.com site.
Q: Do I need to compute A truncated each time a new document is added or previous are modified?
A: For small matrices the answer is YES. However, for huge matrices we can resource to updating/appending techniques. Some of these add doc vectors without recomputing the previous matrix. There is a point wherein this can compromise orthogonality, though.
Q: How do I use Desktop Terrier?
A: Follow the instructions provided in the updated version of Lab Report 2.