Finally SEOs are getting the LSI Myth!

If you search this blog (IRThoughts) for LSI or visit its Latent Semantic Indexing category you will find many posts wherein SEO LSI Myths are debunked. Prior to this wordpress blog I used to maintain a personal blog wherein SEO myths regarding LSI were also debunked.

Over the years, many realized they were taken by the usual agents of misinformation, at least when it comes to “SEO LSI” and “LSI-Friendly” documents.

Recently, I found traffic coming from a blog discussion about a video (http://www.stomperblog.com/warning-advanced-seo-technique-does-not-work/) wherein LSI in relation with Google is debunked.

The video also discusses one flavor of LSI; i.e. one wherein weights are tf-IDF weights. This flavor does not incorporate relevance information or entropy information, like other LSI variants.

The video does a good job at debunking LSI Myths. However, it has at least a factually incorrect argument in relation to how the SVD algorithm works.

The video gives an example implying that SVD works by reducing a large set of words to a few words, such that, for example thousand of words are reduced to, let say 300 words.  This is incorrect and certainly is not a trivial flaw.

SVD does not work by reducing a vocabulary, but by reducing dimensions, and there are as many dimensions as singular values. This is why is called a dimensionality-reduction and not a vocabulary-reduction algorithm.  I should stress that an LSI Space is not like a Term Space wherein each term is a dimension such that there is a 1:1 correspondence.

In LSI, the SVD algorithm is used to reduce the dimensions of a matrix; the number of singular values of the matrix.

For instance in our SVD and LSI Tutorial series at

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html

we present an LSI problem example consisting of many words and few initial dimensions such that for the initial matrix

#words >> # initial dimensions

more specific, we used 11 words and 3 dimensions

After truncation, we ended up with 11 words and 2 dimensions.

Other than this, the video is fun to watch, but ended up as an introductory promotion for another SEO proposal.

 PS.

After reviewing several times the video, unfortunately I found the video has another incorrect argumentation.

When objecting to that Google might not use LSI, an argument is made in the sense that LSI has to return same results when word variants are used like plurals and tenses. This might be the case if stemming is heavily used in an LSI implementation, but the use of stemming is not a requirement for implementing LSI at all.

When stemming is not implemented, for sure the SVD reduction will return different results since these will be entered in the original term-doc matrix to be undergo decomposition as different tokens.

The video also misses what the power of LSI comes from: higher order co-occurrence connectivity path hidden (latent) in the original matrix. Whether terms have to be synonyms, related terms, or even of non-derivative forms is not a requirement for observing these hidden paths in LSI.

Terms no need to be related terms either to end up clustered with LSI. It is the hidden co-occurrence patterns what is behind the clustering. For example, in our SVD and LSI tutorial above, we intentionally used stopwords and zero synonyms/related terms and these ended-up in their corresponding clusters, without being necessarily semantically related. This simple example shows that in LSI the SVD algorithm produces an output based on crushing numbers, not on making sense out of meaning or intelligence, and contradicts the generalized opinion that LSI works at the level of meaning. 

I have to conclude that while the video is intended to debunk LSI SEO myths (a noble effort), it uses incorrect arguments and hearsays lines from around the Web. Debunking hearsay with more hearsay: What a shame.

 

12 Comments

  1. Thanks for your fine review and feedback.

    To explain the essentials of LSI and describe how it should appear in search results in just 13 minutes and 39 seconds did require some “short cuts” as you have pointed out.

    But our goal is not the creation of scientists, but instead the development of successful business people, and to accomplish that we need provide primarily two things (1) practical skills that do work and (2) enough theory to protect our students from untruths. This most recent video provides some of the later, while the next one provides the former.

    But theoretical inaccuracies aside, I stand by the conclusion that in regards the importance of LSI in the ranking algorithm, the existence of significant differences in search results for singular and plural forms is determative of the question.

    Plurality is a difference in cardinality, not concept, so by the very nature of what LSI does — “by producing a set of concepts related to the documents and terms” and “comparing the documents in the concept space” — the results for plural forms *have* to be nearly, if not completely, identical to the results for singular forms.

    If this is not true, then I contend their concept of “concept” must not be very conceptual after all! 🙂

    Thanks again.

  2. Thank you for stopping by. My reply follows.

    First, your effort of putting out a video to debunk SEO Myths in relation with LSI is, as mentioned before, a noble one and should be pursued by others. I see we have some points we can agree on, so I will highlight those throughout this reply, when neccessary.

    To explain the essentials of LSI and describe how it should appear in search results in just 13 minutes and 39 seconds did require some “short cuts” as you have pointed out.

    Indeed, it is not possible. Students that took my graduate course on Search Engine Architectures came from different background (computer science engineering, information security, programming, and one or two from marketing/web development). I spent few lectures and computer lab sessions teaching them the theory behind LSI and how to do SVD runs on small cases. To get their feet wet on SVD and LSI is not something to absorb in few minutes from a video.

    But our goal is not the creation of scientists, but instead the development of successful business people, and to accomplish that we need provide primarily two things (1) practical skills that do work and (2) enough theory to protect our students from untruths.

    I cannot agree more with (1) and that your goal should not be the creation of scientist. Often those creatures are “created” in real universities and colleges. No offense intended to the organization you belong to.

    With respect to (2), more likely, it would be difficult to protect “students from untruths” by resourcing to theoretical inaccuracies, particularly using flawed arguments.

    Before debunking something we need to know at least what exactly we are trying to debunk. A critical thinking student will challenge any teacher on this one, regardles of the teacher tenure, stature, or reputation. In fact, I encourage my students to do that, so they are not blind followers.

    …I stand by the conclusion that in regards the importance of LSI in the ranking algorithm, the existence of significant differences in search results for singular and plural forms is determative of the question.

    Of course the form of words matters one way or the other, so as many other things. Many variables affect ranking results.

    Plurality is a difference in cardinality, not concept, so by the very nature of what LSI does — “by producing a set of concepts related to the documents and terms” and “comparing the documents in the concept space” — the results for plural forms *have* to be nearly, if not completely, identical to the results for singular forms.

    What is being disputed is that implementing/not implementing stemming makes a difference and will affect the outcome of the SVD algorithm used in LSI, simply because tokens (regardless their cadinality or functionality) are reduced to mere numbers.

    If this is not true, then I contend their concept of “concept” must not be very conceptual after all! 🙂

    That is precisely the point. LSI is a misnomer and does not derive semantic information (meaning, concepts, etc) as previously claimed in the early literature.

    Most of those claims are 10, 20, years old. Why keep citing outdated research?

    Those papers made reference to a hidden (latent) structure between words masked by “noisy” words and was referred to as the latent semantic structure embedded in a corpus. SVD brings back this structure. “Semantics” in those papers refers to that hidden structure.

    What those early LSI manuscript failed to report is that such a structure is the result of high order co-occurrence paths. These are easy to visualize by means of directed graphs. I call these Latent Graphs. To sum up, LSI does not derive semantic information (meanings, concepts, etc).

    To quote Kontostathis’s excellent work:

    “LSI is a dimensionality reduction approach for modeling documents. It was originally thought to bring out the ‘latent semantics’ within a corpus of documents. However, LSI does not derive ‘semantic’ information per se. It does captures higher order term co-occurrence information [19], and we prefer to state that LSI captures ‘term relationship’ information, rather than ‘latent semantic’ information.”
    (http://csdl2.computer.org/comp/proceedings/hicss/2007/2755/00/27550073c.pdf).

    One more thing. Word semantics (meaning, concepts, etc) can be affected by word order (exchangeability of terms), but there is a problem: In most LSI implementations, word order is not accounted for. How then talk about concepts within the context of LSI while ignoring the exchangeability of both tems and documents?

    This is why new models, like LDA (Latent Dirichlet Allocation), have been proposed. A discussion on Vector Space, LSI, PLSI, and LDA is available at https://irthoughts.wordpress.com/2009/04/03/vector-space-probabilistic-lsi-and-lda/

  3. I’d like to take the third way and disagree with both yourself and Leslie Rohde, though not entirely. I have my own SEO Company and although we don’t make particular claims about LSI we have used LSI on all our projects in the last 18 months. Although the use of LSI has been in differing amounts, particularly as we’ve developed a stronger strategy to gain better results, what we have seen is that our strategy has worked on every single project. I don’t claim that our implementation is perfect but it does work. As I say I don’t totally disagree with what your saying but on the conclusion that LSI is a myth I would.

    At this juncture I’d like to point out that I’m familiar with both your work in IR and Stompernet’s work in the internet marketing arena, both of which offer great insight and I’d highly recommend to anyone.

    It’s easy to talk about the theories of LSI and IR in general, some SEO’s don’t even seem to understand the basic concept that Google is essence a IR system and the seosphere is a lot better of for your work. On the other side of the fence all good SEO’s understand that any implementation of IR by Google is simply limited by the mechanics of the internet itself and therefore when IR people like your good self talk about Google’s implementation of IR, when to me anyway there seems no understanding of what those limitations are, you don’t have a proper picture on the full story.

    I’m sure you will both disagree with my conlusion however talking about the theory of LSI and what Google is actually doing is one thing when we’ve actually applied what we think Google is in part doing in the physical world when we’ve seen postitive results every time and explaining that is something entirely different altogether.

  4. About LSI

    To use/apply LSI on a database collection of documents of size D, you MUST HAVE:

    1. Access to every single document of the collection to be analyzed (Something quite hard to do at the scale of the Web.)

    2. Construct a term-document matrix A and populate this with term weights w precomputed according to a particular term weight framework. Said framework can be tf*idf-based or based on many variants of this, or it can be based on entropy or relevance information.

    3. Decide how many dimensions k (singular values) to keep. The optimum number of dimensions to keep is obtained by trial and error and is valid only for the collection under inspection.

    4. Apply the Singular Value Decompostion Algorithm (SVD) in order to decompose the original matrix.

    5. From the resultant left and right eigenvector matrices, apply clustering techniques to identify clusters of both documents and terms.

    6. If the goal is to rank documents, sort these using query-document vector cosine similarity values in the LSI space.

    Any LSI outcome is valid only for the collection of size D used, the orignal matrix A used, the term weight w scoring framework used, and the number of dimensions k used when applying the SVD algorithm.

    If these initial conditions change in time the outcome will also change. In addition, any change in the values of the cells of the original matrix A (even a single change) will provoke a redistribution of weights in the reconstructed matrix obtained from the SVD algorithm. The outcome of said redistribution cannot be predicted as all weights in the truncated matrix will be interrelated. Said weights determines the creation and strength of the co-occurrence connectivity paths observed. And we still don’t know when or how many times a particular document of the collection has been altered by its owners at any given point in time.

    Because all of the above, this is why any attempt at designing “LSI-Friendly” documents is plain snakeoil. Additional information is provided at:

    https://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

    About the video

    With regard to Stompernet’s video, the only thing I share is the author’s position regarding that LSI cannot be used/manipulated by SEOs to optimize web pages. However, the arguments used are plain wrong and it would be hard for others to learn any valid knowledge out of these.

    The problem with the video is that, as mentioned above, before trying to debunk something at least we need to know and understand what exactly we are trying to debunk or refute. It appears to me now that Rohdes does not really understand what is LSI or how it works at all, so as many of the posters following him.

    Some of the arguments used in his video regarding what is LSI or how it works are false and unnecessarily adds to the confusion many SEOs have regarding what is/is not LSI. For example, that SVD works by reducing a vocabulary and that LSI works at the level of semantics, meaning, etc. The later were ideas stretched in the early LSI literature, dated back 20 years or so. Today we know exactly what the LSI algorithm does and why.

    I cannot prejudge at this point whether Stompernet put out a confusing message to make some noise. If that was the intention, it is just marketing 101. It is not the first time some of their “faculty” comes with non sense or misquotes from IR papers, so as many other SEOs looking to put out a product or a service, at the expense of trying to debunk the competition’s hearsay. Debunking incorrect ideas with incorrect ideas is fraudulent teaching simple because the problem is that two wrongs don’t make things right.

    Adding to the confusion is the fact that many SEOs, in an effort to market whatever they sell, call LSI something that is not. This is a standard propaganda practice and works as follows:

    Let say we have two concepts C1 and C2. C1 is a legit, proven concept. C2 is bogus or made out of thin air. To promote C2, rename it as C1. Any argument against C2 is diverted invoking the underlying true facts of C1. Then, profit out of the phony ideas and easy to impress followers and naive “students”. For the parasites of the truth, this is a sinister way of doing marketing and is snakeoil at its best.

  5. My main point was the theory is fine but what about the practice? As I’ve stated previously in my post we’ve tried to understand the theory of LSI. Then we’ve tried to understand how Google may possibly apply what we’ve learned then practically apply it and test it on dozens off project and we’ve had positive results every time. We actually carefully and deliberately isolate what we do with any potential LSI algorthym to insure Google are altering rankings based on what we are testing. I could understand that we may fluke a portion of our results so are you therefore suggesting we fluke them every time? If not how would you explain our results.

    Let me clarify a previous point I was alluding to that technically when I’m talking about LSI actually Google do not use LSI, at least theoretically what you describe to be LSI. Moreover because of the limitations of it’s operating environment this would be impossible. What Google use is some of your theory fitted into the environmental limitations that provides a best fit version of LSI. So Ultimately when your saying Google don’t implement LSI at all because this is your theory of how LSI works then your theory works within my experience because I’m saying that’s not how Google does it. The same with the Stompernet Video – they’re saying Google doesn’t apply LSI because it couldn’t possibly do this. And I’m agreeing with them that it doesn’t do that but what I’m saying, particularly in relation to the Stompernet Video is in my experience Google is doing something radically different anyway so when the theory of what both you and saying backs up my experience

  6. Whether you might be getting good results or not is totally meaningless to the discussion when it comes to explaining what is LSI or how it works.

    The problem with your stand is that you keep making references to the theory of LSI as according to ‘Google, me, you and others’ when in fact, generally speaking there is only one theory and implementation of LSI and that is the one described in the Information Retrieval literature and in the Bellcore (now Telcordia) patent. You might find some variants of implementing LSI in the IR literature or in the USPTO (United State Patents Office), but all are based on applying the SVD algorithm as described in my previous reply.

    You claimed to use LSI. Well, to implement LSI you MUST use the SVD algorithm to reconstruct an initial term-document matrix as a reduced representation of the original matrix. There is no work around this. If you don’t do that then you ARE NOT using LSI, Period. If you don’t grasp this simple concept, then I’m afraid to say that chances are you don’t know what is LSI after all.

    I don’t know what SEO strategies you are using to get good results and good for you, but whatever it is or whatever you might want to call it is not LSI. Many SEOs use some particular techniques for related terms and synonyms in their optimization strategies and dare to call that “latent semantic indexing”.

    Like ThemeZoom and others, you can call these SEO strategies whatever you want to call them for marketing purposes, self-promotion, or to extract money from naive clients, but aren’t LSI. At the end of the day, miscalling a valid concept (C1) as a phony one (C2) or vice versa is pure SEO propaganda.

    You don’t have to buy my words. For a second opinion, read from any IR colleague a tutorial on LSI, but stay away from misleading SEO “LSI tutorials and videos”.

  7. That may well be true but you’d have to blame Google for that! What we implement is some form of Latent Indexed Symantics, our implementation of what I call LSI is based on what Google have said about the subject so in honesty, yes I may be wrong but I’m just parroting Google.

    Though technically and theoretically it may well actually be true that this is not LSI but as I say your fight is therefore with them. I only understand IR up to a point and I wouldn’t claim to be an IR expert and in fact from my point of view I don’t really care about IR my main concern is with SEO and getting rankings for our clients.

    But the title of your post has nothing to do with LSI in the IR world and to be frank unless you are actually a practising expert in the seosphere as you call it should you be commeting at all because I find your comment highly misleading as an SEO expert and my question to you is should you really be talking at all about how some form of LSI is applied by Google? So your title “Finally SEOs are getting the LSI Myth!! is factually totally incorrect, from a seosphere point of view, as is the Stomper Video which I have to say is by far the most inaccurate video they’ve ever brought out.

  8. There is no need to blame Google for SEO hearsays and propaganda. SEOs are the one that have claimed Google uses LSI, not Google. Applying LSI to the Web would be hard if not impossible. As said before, you don’t have to buy my words, just check here:

    http://www.seo-blog.com/latent-semantic-index-lsi-myth.php

    The fact that you claimed to use LSI without knowing the IR facts behind it, reinforces my perception that you don’t know what is LSI or how it works at all.

    Probably you are forming an opinion of what you think is LSI or simply repeating what you have heard from other SEOs. You are now blaming others for inducing you as you said ‘parroting’. It is easy to blame others for repeating urban legends. That’s the easy way out.

    When it comes to SEO claims about LSI, there is one SEO side that claims (like you and ThemeZoom) to use LSI without really understanding what is LSI. And there is another side that uses incorrect arguments to debunk LSI (as Stompernet). So indeed both sides are getting the LSI Myth. The title of this post is therefore more than appropriate. You can nickpick all you want on this to save some face, and that’s understandable. I don’t expect less from marketers.

    This blog is about debunking SEO/IR myths through information retrieval knowledge. Some of these are then used as case studies to be tested and dissected in my IR graduate courses. Check here:
    https://irthoughts.wordpress.com/2009/04/02/airweb-course-announcement/

Leave a comment