Archive for the ‘Latent Semantic Indexing’ Category

IR Videos in Spanish

June 22, 2009

I normally do not put online my lecture notes (ppt, pdf, videos). However, there are two public conferences that event organizers taped. Both last over 1 hour and are in Spanish, but with slides in English. Here are the links. The quality of the videos is so-so.

Since the videos were made available few months later after the events, these are not properly dated. I have included below the actual date of the events. If you don’t know Spanish, you are out of luck.

1. Understanding Search Engines (Entendiendo a los Buscadores), University of Puerto Rico, Bayamon, 4-23-2008

http://video.google.com/videoplay?docid=-653964730907023811

This one last for about two hours. The audience consisted of grad students and researchers. Unfortunately, the video has an audio-visual mismatch of about one slide. If you can coupe with this, I hope you like it.

2. Demystifying LSI (Desmitificando LSI)- OJOBuscador Congress, Madrid, Spain, 3-09-2007.

http://www.ojotube.com/videos/congreso-ojobuscador-2007-ponencia-desmitificando-lsi-de-dr-e-garcia/

This one last for over one hour. Since it was for a non-scientific audience  (most Spanish SEOs)  I tried to talk very slow.

What is a Similarity Matrix?

June 16, 2009

Soon or later CS students, in particularly those in IR, will need to deal with similarity matrices.

In simple terms, any matrix M that exhibits the following five characteristics is a similarity matrix.

Squaredness = M must have the same number of rows and columns.
Non-Negativity = all elements of M must be real, non-negative numbers.
Boundedness = all elements of M must adopt values between 0 and 1.
Reflexivity = all diagonal elements of M (i.e. from left to bottom) must be filled with 1.
Symmetry = all ij elements must be identical to all ji elements.

A matrix that fails to exhibit any of these characteristics is not a similarity matrix.

Accordingly, some matrices found in the literature on LSI and whose elements have been referred to as similarities are not so since the corresponding matrix does not conform to the above definition.

Note. This information will help those that took the IR Quiz on Matrices to realize how well they did.

When Noise is a Good Thing.

May 22, 2009

Today, a reader (name removed to protect confidentiality) asked me:

My name is **** ****. I working as a junior research fellow in a project in India. I red the SVD techniques from the web page http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html#right-eigenvectors. I found it is quite satisfactory for me. Now I can understand how SVD works. But I have a query as follows.

query:

As mentioned in this tutorial that we have arrange these eigen-values in descending order. Cold you please tell me if I put these values in ascending order or arbitrary what will be wrong with the SVD.

Looking forward your early kind response.

Thanking you.

With best regards.

*******

My answer follows.

It depends on what you are trying to address.

SVD is used to identify singular values interpreted as dimensions. When used as a dimensionality reduction technique, the largest N singular values are normally retained and thus retaining the smaller singular values is meaningless.  The largest singular values capture most of the information of the original data set and is therefore a noise minimization approach.

If the retention criterion used is reversed (smaller singular values are retained) this implies retaining the more noisy dimensions such that the reconstructed matrix will be a matrix of the hidden (latent) data noise. This is a noise maximization approach.

If the retention criterion is based on a random selection, the resultant reconstructed matrix might be one representing a data structure with randomized noise.

These scenarios depend on the original data under examination. 

In Image Compression, these approaches have been already explored. If the goal is a stability study and not just SVD dimensionality reduction, “the ratio between the highest singular value and the lowest singular value of the Jacobian matrix quantifies the spread of the Jacobian’s singular values, which in practice, reflects the extent of the solution’s instability with respect to small changes in the observation”  (Horesh’s Thesis )

Having said all that, we should not render noise in a data set as something that must be discarded at all cost.

This is intimate linked with the so-called Inverse Problem. Incorporating noise and a priori SVD information can provide the complete information in a linear sense. Qianqian Fang has a beautiful PPT presentation “Look Closer to Inverse Problem” on the subject. If you want to visualize the MATRIX Problem, this presentation is for you.

I’m thinking in putting together a tutorial on the Singular Value Expansion algorithm (SVE), if I ever find the time.

I hope this helps.

Finally SEOs are getting the LSI Myth!

April 9, 2009

If you search this blog (IRThoughts) for LSI or visit its Latent Semantic Indexing category you will find many posts wherein SEO LSI Myths are debunked. Prior to this wordpress blog I used to maintain a personal blog wherein SEO myths regarding LSI were also debunked.

Over the years, many realized they were taken by the usual agents of misinformation, at least when it comes to “SEO LSI” and “LSI-Friendly” documents.

Recently, I found traffic coming from a blog discussion about a video (http://www.stomperblog.com/warning-advanced-seo-technique-does-not-work/) wherein LSI in relation with Google is debunked.

The video also discusses one flavor of LSI; i.e. one wherein weights are tf-IDF weights. This flavor does not incorporate relevance information or entropy information, like other LSI variants.

The video does a good job at debunking LSI Myths. However, it has at least a factually incorrect argument in relation to how the SVD algorithm works.

The video gives an example implying that SVD works by reducing a large set of words to a few words, such that, for example thousand of words are reduced to, let say 300 words.  This is incorrect and certainly is not a trivial flaw.

SVD does not work by reducing a vocabulary, but by reducing dimensions, and there are as many dimensions as singular values. This is why is called a dimensionality-reduction and not a vocabulary-reduction algorithm.  I should stress that an LSI Space is not like a Term Space wherein each term is a dimension such that there is a 1:1 correspondence.

In LSI, the SVD algorithm is used to reduce the dimensions of a matrix; the number of singular values of the matrix.

For instance in our SVD and LSI Tutorial series at

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html

we present an LSI problem example consisting of many words and few initial dimensions such that for the initial matrix

#words >> # initial dimensions

more specific, we used 11 words and 3 dimensions

After truncation, we ended up with 11 words and 2 dimensions.

Other than this, the video is fun to watch, but ended up as an introductory promotion for another SEO proposal.

 PS.

After reviewing several times the video, unfortunately I found the video has another incorrect argumentation.

When objecting to that Google might not use LSI, an argument is made in the sense that LSI has to return same results when word variants are used like plurals and tenses. This might be the case if stemming is heavily used in an LSI implementation, but the use of stemming is not a requirement for implementing LSI at all.

When stemming is not implemented, for sure the SVD reduction will return different results since these will be entered in the original term-doc matrix to be undergo decomposition as different tokens.

The video also misses what the power of LSI comes from: higher order co-occurrence connectivity path hidden (latent) in the original matrix. Whether terms have to be synonyms, related terms, or even of non-derivative forms is not a requirement for observing these hidden paths in LSI.

Terms no need to be related terms either to end up clustered with LSI. It is the hidden co-occurrence patterns what is behind the clustering. For example, in our SVD and LSI tutorial above, we intentionally used stopwords and zero synonyms/related terms and these ended-up in their corresponding clusters, without being necessarily semantically related. This simple example shows that in LSI the SVD algorithm produces an output based on crushing numbers, not on making sense out of meaning or intelligence, and contradicts the generalized opinion that LSI works at the level of meaning. 

I have to conclude that while the video is intended to debunk LSI SEO myths (a noble effort), it uses incorrect arguments and hearsays lines from around the Web. Debunking hearsay with more hearsay: What a shame.

 

Vector Space, Probabilistic LSI, and LDA

April 3, 2009

 lda
source: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.

From VSM to LSI

Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate  values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.

VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.

In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan’s 1965 SVD algorithm to “documents” exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).

LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The “documents” were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.

It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.

Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.

There is now a crew of SEOs claiming that they can design documents “LSI-friendly” by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as “experts”). Same goes for those that claim “PLSI-SEO” strategies.

Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why “LSI-friendly” documents is plain SEO Snakeoil. Again, the same goes for those that claim “PLSI-SEO” strategies. Keep reading.

Enters Probabilistic Latent Semantic Indexing (PLSI) model

In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?

In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.

Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected “reduced description” associated with the document.

But there is a problem.

Enters Latent Dirichlet Allocation Model (LDA)

By 2003 Hofman’s PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI “is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. “

Blei and co-workers then stated that this leads to two problems:

1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting

2. it is not clear how to assign probability to a document outside of the training set.

Thus, it is not true that PLSI is the preferred model to work with in IR, as some have claimed. In addition, the model has non-trivial theoretical flaws and limitations.

In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a “bag of words”. However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for college junior or junior college produce far different results.

To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.

Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.

In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.

I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere “bags of words”. The former is IR’s Original Sin and the later is its copycat.

The image above gives me a flash back on research work I conducted in the late ’80s on sequential simplex optimization methods.

Time Series Semantics

February 4, 2009

The title of this post might be a bit confusing, but we couldn’t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.

Consider the key term [man].

Providing a meaning or perception for [man] during a good or a bad Economy is a good example.

[man] means something different, depending if you ask to an employed or unemployed [man].

This can be illustrated by reading Why losing a job can hurt men more.

Although it might be a quite depressing article (especially for those with a pink slip in their foreheads), note the key words/phrases that define its topic and overall semantics. Incidentally, the article starts with

“Thomas Schuler is a man.”

and almost at the ends says:

“A man is what he does.”

The key words/phrases of the article can be taken as a semantics state in time attached to the [man] key term

More LSI Snakeoil

December 10, 2008

Here is another SEO resource (http://www.billhartzer.com/pages/latent-symantec-indexing-lsi-is-the-key-to-great-search-engine-rankings/) that a la Aaron Wall is still promoting LSI SEO non sense in connection with ranking high in search engines. Like if these marketers really know what is LSI or how it works. Otherwise, they will never publish such crap.

There is no such thing as “LSA/LSI sites” nor SEOs can manipulate LSI to influence ranking results. It is this type of snakeoil marketing what is a black eye in the face of the SEO industry.

Why Defining Distance is Important

October 23, 2008

In previous posts we mentioned the difference between distance (dissimilarity) and similarity. Both can be used to describe proximity; i.e., how alike objects are. A distance is a metric function, while similarity is a relative judgment of proximity. A generalization of distance is the Minkowski Distance. The Euclidean Distance cannot be greater than the Manhattan or City Block Distance –named in this way because taxicab drivers in Manhattan can only go from one point to another by driving around rectangular city blocks.

We could extend on this topic and provide math arguments, explaining the Minkowski Distance or what is a metric space wherein a distance ‘live’. We could also add insult to injury and explain why many SEOs don’t have a clue when mistaking distance and similarity, or of the non-sense of talking about ’similarity distance’ and ’semantic distance’ in a hyperdimensional space (say ‘Hi’ to SEOs that sell LSI Snake Oil), blah, blah, blah.

Instead, this time I want to provide a real case scenario (taken from Chapter 2 of my upcoming ebook, Keyword Clustering Analysis with Excel) which would help readers and students understand the importance of properly defining distance.

In 2005, the New York Times reported that it was brought to the Court of Appeals’ attention a case wherein a man named James Robbins was accused of selling drugs within 1,000 feet of a school. He was arrested in March 2002 on the corner of Eighth Avenue and 40th Street in Manhattan and charged with selling drugs to an undercover police officer. The nearest school, Holy Cross, is on 43rd Street between Eighth and Ninth Avenues (http://www.nytimes.com/2005/11/23/nyregion/23drugs.html?_r=1&oref=slogin).

Defendant lawyers argued the distance should be measured as a pedestrian would walk city blocks; i.e using the Manhattan or City Block Distance. They claimed the school was more than 1,000 feet away by walking from the site of the arrest.

Law enforcement officials calculated the Euclidean Distance, measuring the distance up Eighth Avenue (764 feet) as one side of a right triangle, and the distance to the church along 43rd Street (490 feet) as another, to find that the length was about 908 feet.

The Court of Appeals upheld his conviction and determined that the Legislature’s intention effectively extended the boundaries of school grounds outward in order to encompass all public areas within a 1000-foot radius of the school (http://www.courts.state.ny.us/ctapps/decisions/nov05/162opn05.pdf). Read reactions to the ruling at http://volokh.com/posts/1132938765.shtml. It is a hilarious discussion.

Having Fun with Oxymorons

October 20, 2008

Over the weekend some asked me to expand on oxymorons, so this post goes.

As mentioned in our previous post, an oxymoron is a combination of contradictory terms. All antonyms are contradictory terms, but not all contradictory terms are antonyms. For instance ‘alone together’ and ‘big baby’ are oxymorons, but only the former consists of antonyms. Thus, it is possible to extract these two types of clusters from a list of oxymorons. We can also extract clusters of oxymorons with a common term or theme.

We must not mistake oxymorons for misnomers. A misnomer is an incorrect designation. Not all oxymorons are misnomers. For example, ‘binary independence’ is a misnomer, but not an oxymoron. All this is explained in my upcoming ebook Keyword Clustering Analysis with Excel.

As mentioned in my previous post, ’similarity distance’ is an oxymoron since distance is dissimilarity. The expression is also a misnomer.

Unfortunately, some IR authors and many SEOs have used the ’similarity distance’ expression. The problem here seems to be a combination of poor selection of words and lack of knowledge about basic IR cluster analysis concepts, particularly of LSI.

In Cluster Analysis, objects are grouped into clusters using Proximity, a criterion of how ‘close’ or alike objects are. Proximity can be defined as Distance (dissimilarity) or Similarity. Clustering by distance is a minimization problem whereas by similarity is a maximization problem.

There are more definitions for similarity than for distance. Which type of proximity and definition to use depends of the type of attributes and scale of attributes of the data.

In a very high dimensional space the notion of distance or similarity is useless, if not meaningless. Thus, in a high dimensional space, talking about a ’semantic distance’ is a waste. We can try to do dimensionality reduction with LSI, but we end up computing cosine similarity, not distance.

The current state of the art is that LSI is also a misnomer as (a) its clustering power is the result of high-order word co-occurrence not semantics and (b) it is not exactly a document indexing method (before applying LSI, documents must be already indexed).

With regard to similarity and distance, it is tempting to think that one is just the numerical complement of the other or that we can blindly transform one from the other. These are short sighted views. Fortunately few IRs believe this. Unfortunately, we cannot say the same about search marketers and SEOs

Want to know more about the difference between these two concepts? Study the topic or read my ebook. Dont’t be fooled by self-proclaimed SEO “experts” or any “seobook”.

SEOs and their Semantic Distance Myths

October 15, 2008

As many of our readers know, one of the goals of this blog is to debunk SEO myths through IR knowledge. At times, we also try to clarify incorrect statements made in the IR field as well. It is not surprising from time to time find IR papers with clear gross errors and misnomers. That’s why we believe the name of this blog, IR Thoughts, is on target.

Today’s SEO myth to debunk is the so-called Semantic Distance Myth in connection with LSI. But, first some definitions. The following material is taken from chapter 1 of my upcoming ebook Keyword Clustering Analysis with Excel.

Dissimilarity characterizes how different objects from a cluster are.

Similarity indicates how similar the objects are.

In IR, it is customary to use the term ‘distance’ to mean dissimilarity. This is because unlike similarity, dissimilarity is a distance metric. Thus, similarity and distance (dissimilarity) are opposite terms. We can convert similarities into distances and viceversa, but not without first understanding the model at hand * (http://www.miislita.com/searchito/binary-similarity-calculator.html).

Beware of Oxymorons and Misnomers

The expression ‘similarity distance’ is an oxymoron or a combination of contradictory terms (like a ‘small giant’, ‘approximately equal’, etc). Unfortunately, the expression has been used in the IR literature (http://www.google.com/search?q=%22similarity+distance%22). Avoid it.

Some search marketers when trying to explain Latent Semantic Indexing (LSI) have used expressions like the ‘semantic distance between words’ when in fact what is being discussed is word similarity or how words relate to each other or to a topic. Used in this context, their discourses are oxymoronic if not dumb, let alone the fact that LSI does not measure any ‘semantic distance’.

Some IR authors have also used the expression ‘distance between words’ in reference to the number of words between any two words. The expression is loosely accepted to describe word spacing/distribution.

Perhaps loosely excluding the last one, all these expressions are misnomers (incorrect designations) since do not conform to the definition of a distance metric. Let us address this point.

Distance Defined

A function f is called a distance if it exhibits reflexivity, symmetry, and triangular inequality. To grasp these concepts, visualize three points (a, b, and c) describing a triangle.

Reflexivity means that the distance from a point to itself is zero; e.g., f(a, a) = f(b, b) = f(c, c) = 0.

Symmetry refers to the fact that the distance between any two points, measured from either one, is the same; e.g., f(a, b) = f(b, a).

Triangular Inequality requires that the distance between any two points, measured from either point, must be equal or less than the distance between these measured through a third point; e.g., f(a, b) + f(b, c) => f(a, c).

If these conditions are not met, the function measure in question is not a distance.

Finally, note that distances cannot be negative and are not upperly bounded, unless their scales have been normalized.

* You might also want to check: http://irthoughts.wordpress.com/2008/01/02/simcalc-binary-similarity-calculator/ 

or

http://www.cs.ualberta.ca/~lindek/papers/sim.pdf

PS.

Note to spammers and SEOs:  Embedding oxymorons and misnomers in documents, particularly in links, could be used as a search engine persuasion trick.

Some might argue whether the expression ‘loosely excluding’ might also qualify as a near oxymoron. For a list of oxymorons or near oxymorons, check these links:

http://www.ethanwiner.com/oxymoron.html

http://cs.bilgi.edu.tr/~hobbittr/critical/carlin_oxymorons.html

http://www.atlantamortgagegroup.com/oxymoronlist.htm

 

IR and SEO Misnomers

August 14, 2008

Dictionary defense definition lovers, here is one for you: misnomer.

According to Dictionary.com, a misnomer is:

1. a misapplied or inappropriate name or designation.
2. an error in naming a person or thing.

i.e., an inappropriate designation.

Once a while, misnomers are found in IR. Here are at least two:

1. Binary Independence Model
2. Latent Semantic Indexing (LSI)

Here is a quick overview

1. Binary Independence IR Model

Back in 1991 Cooper wrote “Some Inconsistencies and Misnomers in Probabilitistic Information Retrieval”

His article’s abstract states:

“The probabilistic theory of information retrieval involves
the construction of mathematical models based
on statistical assumptions of various sorts. One of the
hazards inherent in this kind of theory construction
is that the assumptions laid down may be inconsistent
with the data to which they are applied. Another
hazard is that the stated assumptions may not
be the real assumptions on which the derived modelling
equations or resulting experiments are actually
based. Both kinds of error have been made repeatedly
in research on probabilistic information retrieval.
One consequence of these lapses is that the statistical
character of certain probabilistic IR models, including
the so-called ‘binary independence’ model, has been
seriously misapprehended.”

In that article Cooper wrote:

“Since one can derive all possible assertions from an inconsistent
theory, such a theory must be meaningless –
entirely lacking in significance or predictive power. It
makes no sense that good experimental results could
come out of an inconsistent theory.”

“It is tempting to explain this conundrum by suggesting
that the inconsistencies in question were only minor
ones. However, there is no such thing as a theory
that is just ‘a little bit’ inconsistent. A theory cannot
be just a little bit inconsistent, any more than the
scientist proposing it can be just a little bit pregnant.
A logical inconsistency, if its implications are followed
out, destroys a theory utterly. It is a disaster, and is
totally unacceptable. If rationality is to be preserved,
inconsistencies simply cannot be tolerated.”

2. Latent Semantic Indexing

Contrary to SEO myths, LSI is not an “indexing” model, but an SVD matrix decomposition technique. Despite of what has been written in the early literature on LSI, today we know that LSI does not assesses semantics, but gets its power from high order co-occurrence relationships. Furthermore, one can grasp latent (i.e., hidden) structures present in a system by using diverse techniques other than LSI. In the early papers on LSI, structured collections of journal articles and abstracts about specific topics -medical, science, and computers-  were used. These were rich in synonyms. Evidently, the latent semantic structures identified these by forming clusters of synonyns and related terms.

Soon after, SEOs that misread those papers developed a synonymity myth around LSI. The fact is that in LSI we can arrive to the so-called “LSI latent clusters” regardless if terms are synonyms or not. This myth is easy to debunk. Once the clusters have been obtained, arbitrarily replace one term from the term-doc matrix that is known to belong to a cluster by a new arbitrary term not present in the collection, keeping the weights intact. Run the SVD algorithm again. You should arrive at the same latent structure, but the new term might not be semantically related at all to terms in the cluster since the algorithm only understands and processes numbers, not semantics. Thus, it processes a la garbage in-garbage out. This is a great topic for another ‘SEO Myth Debunking’ IRW article.

For those SEOs, marketers, and spammers that claim to know what is LSI, like Aaron Wall and the likes, read: http://irthoughts.wordpress.com/2007/05/01/irwatch-may-issue-demystifying-lsi/

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Mike Grehan
Lies, Lies, and LSI by Mike Grehan
http://www.clickz.com/showPage.html?page=3623571

Bill Slawski
Personalization Through Tracking Triplets of Users, Queries, and Web Pages
http://www.seobythesea.com/?p=535

Rand Fishkin
InfoSearch Media & ContentLogic – Purveyors of Falsehoods
http://www.seomoz.org/blog/infosearch-media-contentlogic-purveyors-of-falsehoods

Lee Odden
5 Myths about SEO
http://www.toprankblog.com/2006/12/5-myths-about-seo/

Marios Alexandrou
The History of Latent Semantic Indexing
http://www.searchgrit.com/history-of-latent-semantic-indexing/

Carson
Web content and LSI mega-rant. Part Two…
http://contentdonebetter.com/2007/03/30/web-content-and-lsi-mega-rant-part-two/

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/  

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

Claps and Slaps, the LSI Way

August 4, 2008

Claps

We are happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at the Indian Institute of Technology in Madras, India is using our SVD LSI tutorial as lecture material for his course: CS625, Memory Based Reasoning in AI. http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf 

Another investigator, this time from the cancer research field, congratulated us for the LSI tutorials. Jaime Fernandez Vera from Structural Biology and Biocomputing, Centro Nacional de Investigaciones Oncologicas, Madrid, Spain wrote (contact info removed):

Estimado Dr. García:

Muchas gracias por poner a disposición de la Comunidad sus magníficas guías prácticas y, en especial, la de LSI que es la que he seguido.

Un abrazo,

Jaime Fernández Vera

Biología Estructural y Biocomputación Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas

Our LSI/SVD tutorials are also listed in http://www-timc.imag.fr/Benoit.Lemaire/lsa.html huge repository of LSI research resources.

For additional IR resources quoting our tutorials, check the following link at http://www.miislita.com.

http://www.miislita.com/searchito/educational-links.html.

Slaps

Talking about LSI…

Spammers disguised as ethical SEOs and that promote LSI crap are now hidding. There is less talking on the blogosphere on “SEO LSI” and “LSI-friendly SEO Optimization” myths. As we always say, these crooks are a black eye to the ethical sector of the search marketing industry.

Their signature seems to be the promotion of crap tools and services like Keyword Density tools, Markov Chain generators (if you believe that crap), TFIDF rarity calculators, “semantic page strength” estimators, lookup lists based on “LSI operators”, etc. What will be their next effort at misleading the public? Latent Dirichlet Allocation (LDA) tools?

However, in an effort to save face, the usual suspects are still making gymnastic wording. They are desperate. It is clear that our efforts at exposing these crook marketers through IR knowledge are working.

Many are learning why they should stay away from the incorrect knowledge promoted by marketers that ocassionally use IR jargon to pretend they know what they are talking about. They often do these IR-like talking attempts to promote their image as “experts” before either naive or ignorant followers. We still cannot assess the dumbers, if the snakeoil sellers or their groupies. They even game each others.

When we expose SEO myths from their competitors they praise us as long as the debunkig works for them, but when their own myths are exposed they get angry at us. Ha, Ha.

 

Posts somehow related with this post

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/ 

http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

Claps and Slaps

July 14, 2008

Claps

Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.

Slaps

There is something funny about SEOs that sell snake oil ( http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.

The following links provide additional information about AIRWeb and EMNLP

AIRWeb
http://irthoughts.wordpress.com/2008/04/29/for-seo-spammers-airweb-2008-presentations/  

EMNLP
http://conferences.inf.ed.ac.uk/emnlp08/

SEOs and their IDF Myths: Part 2

July 3, 2008

This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long standing presence in IR since its introduction back in 1972 by the late Karen Sparck Jones. Since then the model has been thoroughly researched and incorporated into IR models (Salton’s tf-idf, RSJ, and BM25 models).

SEO myths and misconceptions in connection with IDF is featured in the current issue of IRW newsletter.

It appears that repeating SEO crap across the blogosphere makes some to become experts on the subject. These pseudo teachers should have researched the topic before making dumb claims or repeating their peers’s hearsay or come up with definitions made out of thin air. Nothing new coming from SEOs. Such practices are almost their trademarks.

For instance Aaron Wall, has incorrectly defined IDF as follows:

“Inverse Document Frequency is a term used to help determine the position of a term in a vector space model.”

As usual other seos have repeated like parrots such misinformation. It is not the first time. It reminds me of Wall’s claims about LSI, a topic he wrote extensively on until his ignorance about the topic was exposed in several blogs that discuss IR. He was not alone. Andy Beal, Mike Marshall, and few other vocal SEOs have claimed to know about LSI or have used LSI in SEO work. Really?

But this post is not about LSI, but IDF which is a topic equally misunderstood by SEOs. So, let us debunk their claims. As stated in IRW-2008-06:

Salton et al. proposed the vector space model in 1975 in the paper A Vector Space Model for Automatic Indexing (15). In that paper several schemes for scoring term weights were proposed. One of these consisted in combining term frequency (tf) with IDF. Over the years, a family of tf-IDF models has been proposed. Obviously, these are predated by the IDF model of 1972.

In Salton’s vector space model documents are represented as vectors. A query is represented just as another document. Vectors are projected in a vector space, whose dimensions are terms. The units of those dimensions are weights. Coordinates associated to a point (or a vector) in that space are computed according to a scoring model. Terms cannot have positions in this vector space because they are the dimensions of the space. It is that simple.

Despite the fact that IDF has been around since 1972 and tf-IDF since 1975, some search marketers like those that repeat Andy Edmonds’s claims are saying that IDF or tf-IDF is a “new” buzzword in the IR field. WOW! IDF and tf*IDF is a “new” buzzword in IR circles. Really?

Others have claimed that it is not possible to evaluate the IDF of a phrase. Even some that plan to teach IR have claimed that calling log(N/n) “inverse document frequency” is an “insult to students”. Before making a fool of themselves they should read Robertson and Sparck Jones legacy papers on the topic.

Sorry to sound harsh, but I wonder what kind of crap all these pseudo teachers are lecturing while sitting in the dark of their  empty classrooms and forums.

Did search engines use IDF? Yes, absolutely.

Do all search engines use IDF? No, absolutely.

Do I think X search engine currently uses IDF? I cannot speculate what X is doing simply because I don’t work at X.

Do I use IDF? Yes, in my experimental search engine students are building/researching.

What are the drawbacks of IDF? Several. Its stability as N gets larger is an ongoing research topic.

Before commenting on IDF, SEOs please don’t lose credibility and do your homework. Start here.

New Research on the topic: http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Search Engines Architecture Week 5

April 11, 2008

Week 5 Agenda

Lecture Session

Building a Parser
Parsing Techniques and Implementations

Lab Session

Building a Query Normalizer
Building a Document Parser
Building a URL Normalizer

Comments

A reminder that hard and digital copies are required for all lab reports to get full credit. I need to evaluate whether the programs actually work as intended.

Report deadlines: To be announced to accommodate to class needs and previous lab needs.

Demystifying LSI Video

April 7, 2008

Here is a video of my presentation, Demystifying LSI, at the OJOBuscador Congress 2.0, Madrid, Spain, 2007. One year later, nothing has changed. Many of the same crook SEOs exposed during the congress are still deceiving the public about what is LSI.

Unfortunately, the quality of the video and lights are not good enough to see the pdf slides, plus the presentation is in Spanish. Since attendees were not scientists, I talked very slow for over an hour.

Want to get bored for the next hour? View the video.

Thanks to N. Valenzuela Alonso, Director of SEO and Search Engine Marketing of Media Bit, S.L. for the link (www.ithinksearch.com/2008/03/31/video-lsi-de-edel-garcia-desmitificando-lsi/).

Here is also the presentation of Carlos Castillo (Chato), from Yahoo! Research Spain:

Adversarial IR with Web Spam, parts 1 and 2 
(http://www.ojobuscador.com/2007/06/14/ir-con-adversario-y-webspam-videopost/).

I spent great time talking with Carlos, a former grad student of Ricardo Baeza-Yates.

Baeza-Yates, Andrei Broder, Gerald Salton, and Keith van Rijsbergen and few others have helped to shape what is today known as Information Retrieval Research

Talking about Andrei Broder (one of the main researchers behind the old mighty Altavista), here is also a great interview, thanks to ojobuscador site: 
http://www.ojobuscador.com/2006/05/20/entrevista-a-andrei-broder/

 

Adressing Some LSI Questions

April 2, 2008

At the last Search Engines Architecture lecture we discussed LSI and Terrier. Great questions were raised. Some of these follows:

Q: How many dimensions to keep?
A: This is done by trial and error. I have a research project on the topic. None of the current ways of addressing this problem convince me.

Q: How do we compute a truncated version of the initial matrix, A?
A: After SVDing A, truncate U, S, and V by retaining the first k columns of U and V (rows of V transpose) and the first k diagonal elements of S. Multiply these as discussed in class to get A truncated.

Q: To compute the query vector in the reduced space, do we need to compute A truncated for each query?
A: No. The new coordinates of this vectors are defined as
q = qTUkSk-1
This means that A can be called from the cache. See the fast track tutorial
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf
over at Mi Islita.com site.

Q: Do I need to compute A truncated each time a new document is added or previous are modified?
A: For small matrices the answer is YES. However, for huge matrices we can resource to updating/appending techniques. Some of these add doc vectors without recomputing the previous matrix. There is a point wherein this can compromise orthogonality, though.

Q: How do I use Desktop Terrier?
A: Follow the instructions provided in the updated version of Lab Report 2.

Search Engines Architecture Week 2

March 14, 2008

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

How Many SEO Myths In One Sentence?

March 12, 2008

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

LSI: How Many Dimensions to Keep?

February 13, 2008

In How to Populate a Matrix for SVD I referred readers to Igvita’s great blog posts on SVD. A recent visit to the blog shows it is still very much alive and equally interesting. The issues been discussed are not really new, though.

When we lecture on SVD an issue that soon or later arises is how many dimensions k to keep. A recent visitor of the aforementioned blog finally raised the same question.

Can you pls give me a clue as to how we decide how many dimensions to project our data onto when using SVD?

How many dimenisions to keep is the so-called Rank k Approximation that often leads to the dreaded dimensionality reduction curse in which performance can be compromised.

In the Latest SEO Incoherences (LSI) post we mentioned that this issue was already addressed by Dr. Susan Dumais, many times, and througout her first papers and talks on LSI. In that post we referred readers to Dumais’s talk Transcription of the Application presentation by Susan Dumais, Bellcore (now at Microsoft). That talk is now a classic in the history of LSI.

In those days Dumais approach was simply “by seat of the pants“:

Let me end, as my time is running out, with some of the statistical issues that we have encountered and that I hope you have some hints about. The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it doesn’t work. You know when you have too few dimensions. We would like some better methods for doing this, things like the scree test don’t seem to correspond very well to behavioral data that we have.

Later during the QA session participants revisited this issue. Let us reproduce participants-Dumais QA:

PARTICIPANT: Thank you, Susan. Questions from the floor?

PARTICIPANT: I’m a little nervous that if someone was browsing the Web and we hoped to put some of this material in the Web, that we’re in trouble. We’re talking about seat of the pants and underwear models, that people are going to get the wrong context for why we’re here. But that is part of the big problem that Susan is talking about.

PARTICIPANT: I thought I would just mention an entirely different approach to this problem, with Joe (word lost) at EDS. What we’re doing is –

PARTICIPANT: Can you get to a mike?

PARTICIPANT: We are using a poisson model for the word counts. Then we’re interested in finding maximum likelihood estimates for the clustering, and we found various combinations of simulated annealing and markup chain Monte Carlo to work very well with funding these things.

One of the nice things in a model based approach is that you get natural measures of association rather than just SVD types of things, although it could be slower.

PARTICIPANT: I think one thing we will try to ask everyone after the conference is to send us electronically two or three references of relevant work that we can disseminate in this way, because we do hope to learn about new approaches and new methods and related work. So keep that in mind as the discussion progresses. We will send out E-mail requesting those in electronic form.

PARTICIPANT: (Comments off mike.)

DR. DUMAIS: It is first of all not clear that the 300 or 400 dimensions we have used for the trek databases is optimal. We find that performance is still increasing up to 400 dimensions; it may well increase beyond that.

In fact, I should mention that if you plot performance as a function of number of dimensions, what you get is an inverted U function that is heavily skewed. That is, performance increases dramatically as you go from 20 or 30 up to several hundred dimensions, and then it tails off gradually through the level of performance that you see with raw key word matching, which is the full dimensional solution.

We don’t know that we have reached the peak. In problems where we know what the optimal number of dimensions is, we have found that the peak is not so sharp.

Twenty years later (first LSI papers saw the light in 1988, not in 1990 as some SEOs have incorrectly claimed) a lot of research advances on SVD in relation to LSI have been published. Old IR ideas regarding LSI have been dropped and new ones have been adopted. That is what research is all about.

Still, the issue of how many dimensions to keep is still an open issue and a “by seat of the pants” one. All kind of things and guidelines have been tried. But at the end we need to test and retest the system under examination.

I even have tested my own guideline: keep the top k singular values that amount to more than X percent of the trace of the S matrix; where

S is the matrix of singular values.
X is a threshold value, usually 80-85%

But, again, some would ask: why 80%? Why not 90%, 70%, 60%, etc?

While the above guideline works for many systems, I have trepidated on some systems in which the above threshold is not good. So I always come to “find X experimentally or by seat of the pants”.

We could inspect this as an optimization problem and use Nelder-Mead Multivariative Sequential Simplex Optimization, but I haven’t tried this yet. I’m not sure if this is the way to go either, but might be worth to test.

Another idea is to iteratively update-test-update-test the matrix using any of the current SVD updating methods for several X values. I need to spare some time on this one to see what comes out.

 I’m also open to suggestions.

For those interested, a 1.0 Mb download of Dumais’s 1995 presentation is available. If you have problems downloading it, let me know. I can send you a zip file.

April Kontostathis, from Ursinus College, in Essential Dimensions of Latent Semantic Indexing (LSI), proposes an interesting approach to address aspects of this problem. She illustrates her approach with a model wherein term weights are computed using a well known base-2 LOG model for local weights combined with the ENTROPY model for global weights.

More work is still needed along these lines.

Association & Scalar Clusters Tutorial – Part 1

January 22, 2008

I am writing a tutorial series on Cluster Analysis. It is my pleasure to announce that the
Association and Scalar Clusters Tutorial – Part 1: Back Mapping Term Clusters to Documents was uploaded few days ago.

Online publication was announced in advanced to subscribers of the IR Watch – The Newsletter, so they already have an edge over regular readers and visitors of Mi Islita

Abstract follows:

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A “reaction” equation approach is used to break down the classification problem to a sequence of steps. From the initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified. A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters. Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation, and document classification are discussed.

During last night lecture (Web Mining Course), I applied the back mapping technique to scalar clusters generated from LSI. The technique provides additional information and reasons as to how and why documents score as observed after implementing SVD. A clear connection with Fuzzy Set Theory was made.

Students taking the Web Mining Course will find this tutorial quite handy.

Web Mining Week 8

January 21, 2008

Week 8 Agenda

 Take-Home Work 3 and Web Mining Course FAQs
LSI and Scalar Cluster Analysis: An EXCEL Spreadsheet Approach (PPT presentation)
LSI and Fuzzy Sets = Fuzzy LSI
Introduction to Intelligence Searches (PPT presentation)
Bonus: My IPAM Lost Pictures at the 2006 Document Indexing Workshop

Required Reading Material

http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf 
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Finding Topic-Specific Posts

January 18, 2008

Global Term Weights based on Entropies

January 16, 2008

A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.

Global weights based on entropies are frequently combined with local and normalization weights into overall weights.  These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.

The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:

Global Entropy Weights

Evidently,

G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.

Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.

Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.

Web Mining Week 7

January 14, 2008

Week 7 Agenda

Review of Association and Scalar Clusters
Review of Vector Space Models
LSI & SVD: Demystifying LSI SEO Myths (OJOBuscador Congress, Madrid; PDF Presentation)
LSI & Keyword Research (PDF Presentation)
SVD Noise Filtering: Principal Component Analysis (PCA)

Required Reading Material

Tutorial Series
This is part one of a five-part tutorial series:
http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

Fast Tracks
These are quick tutorials, with to-the-point calculations:
http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Blog Posts
These are IR blog posts designed to fight back against misinformation promoted by unethical SEOs and Spammers:
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/page/1/?s=lsi
http://irthoughts.wordpress.com/page/2/?s=lsi

Blog Category
This is a blog category pointing to a collage of posts that demystify SEO non sense about LSI. Some are about topics that overlap with LSI:
http://irthoughts.wordpress.com/category/latent-semantic-indexing/   

From Keyword Density to William Tutte’s Legacy

December 20, 2007

From Keyword Density to Keyword Distribution

Finally we have the Christmas Break from graduate school.

In my last Web Mining Course lecture before the Christmas Break, I tried to explain to students the importance of incorporating word spacing in information retrieval algorithms and in document relevance assessments. I explained why ideas like SEOs’s keyword density (KD), the traditional local term weight model known as FREQ (Term Count) and used in early papers on Vector Space and LSI models, and the likes are poor estimators of document relevance.

Among other theoretical reasons, it was discussed that a term mentioned X times not necessarily is X times more important than other terms. In addition, KD and the term count model cannot attenuate frequencies. We then discussed several frequency attenuation models (keyword spam filters) that also work as term weight scoring models. These can dampen down the effect of abnormal repetition of terms, raise a spam flag, and do not require of any reference to KD “tales”.

We also discussed several scenarios in which one could use word distributions and co-occurrence to analyze textual information –far better than with the aforementioned “crapstimators”. For instance, word spacing can be used in encryption/steganographic algorithms to uncover hidden messages, profiling writing styles/people, imputate authorship of text, assess plagiarism, fraud, etc.

I’m happy that not all SEOs are buying into the keyword density of non-sense and similar “crapstimates”, as I can see from these SEOmoz posts.

From Keyword Distribution to William Tutte’s Legacy

This morning I came across a nice biography of one of those venerable giants: the late William Tutte. Beautifully written by Dan Younger, the biography is a tribute to Tutte’s greatness. Interesting to point out in relation to word spacing theory is this portion of Young’s writing (emphasis added):

“Tutte’s great contribution was to uncover, from samples of the messages alone, the structure of the machines which generated these codes. This came about as follows. In August 1941, a German operator sent a Fish-enciphered teleprinter message of some 4000 letters from Athens to Berlin. For some reason, the message was not received properly and so it was resent. Against all guidelines, it was sent with the same setting. It was identical in content, but it differed slightly, in word spacing and punctuation. John Tiltman of Bletchley was able to use this blunder to find both the message and the obscuring string that was added to make up the enciphered message. But that seemed to be all that could be found, when Tutte was presented with the case in October.”

“Tutte began by observing the machine generated obscuring string carefully. Splitting it up into various lengths, he noticed signs of periodicity. For the first of the five teleprinter tape positions, the regularity he supposed arose from a wheel of 41 sprockets. And then at the last position, one of 23 sprockets. Over the next months, Tutte and colleagues worked out the complete internal structure, that it had twelve wheels, two for each of the five teleprinter positions, and two with an executive function. They determined the number of sprockets on each wheel, and how the advancement of the wheels was interrelated. They had completely recreated the machine without ever having seen one. Tony Sale, who first described this work in a 1997 article in New Scientist, characterized it as the “greatest intellectual feat of the whole war.”

“Knowing the structure of the enciphering machine is a necessity for code-breaking, but it is only the first step. Tutte then put himself to creating an algorithm to find from the enciphered messages the initial settings of the machine wheels. The algorithm that he created, the “Statistical Method”, looked for certain types of resonances, but it had to consider far too many possibilities to be carried out by hand. So it was that, in 1943, the electronic computer COLOSSUS was designed and built by the British Post Office. It was to run the algorithms that Tutte; and his collaborators Max Newman and Ralph Tester; developed, that COLOSSUS was created. This man-machine combination was used to break Fish codes on a regular basis throughout the remainder of the War”.

I hope you understand now the title of this post.

 In today’s Web the enciphering machines are search engines, but the underlying principles driving the Search Engines War are the same.

Emphasized words should make sense to students of the Web Mining course.

Perpetuating LSI Misconceptions

December 11, 2007

Mr. Nick Yorchak from Fusionbox and an alleged SEO “expert” has written this Sitepronews.com article about LSI, which perpetuates myths and wrong statements about LSI, similar to those claimed by Mr. Aaron Wall at this SearchEngineJournal article, and by Valerie DiCarlo in this unfortunate article.

Mike Duz has written a quick rebuttal to Yorchak.

Wishful Thinking: Let us hope that in 2008 SEOs learn how SVD works so they stop spreading misinformation about what is LSI.

To learn about SEO misconceptions regarding LSI, check my tutorial series on the topic, starting with

Tutorial 1: Understanding SVD and LSI

Fortunately, more and more SEOs, like Andy Beal here (MarketingPilgrim.com) and Melissa Fach here (SEOAware.com), are realizing what is not LSI.

BTW, here is an “invitation” issued by Mike Grehan and me back in July, 2007: A Call to SEOs Claiming to Sell LSI.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

LSI, According to an SEOMOZ Glossary

September 3, 2007

In A Complete Glossary of Essential SEO Jargon an SEOMOZ poster defines LSI as follows:

“LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document. SEOs refer to these same groups of words as “Long Tail”. The majority of searches consist of three or more words strung together. See also “long tail”. The significance is that it might be almost impossible to rank well for “mortgage”, but fairly easy to rank for “second mortgage to finance monster truck team”

I have been asked to comment about this.

To put the post in perspective, a jargon glossary is like a collection of expressions used within a specific group of individuals with similar interests. Normally jargon is not intended for outsiders.

Overall, the post is a nice coffee table reading. The title states this is a complete glossary of essential SEO jargon. However, it can be argued whether the glossary is complete or if some entries of the glossary are indeed essential to SEOs.

Within SEO circles, jargon connected to search engine technology often comes with two elements:

(a) oversimplification

(b) misinformation

To the poster’s credit, not all entries of the glossary have (a), (b), or both, but are actually informative. Like some of the comments these generate, some are entertaining.

Unfortunately the LSI entry comes with both, (a) and (b). Last time I revisited the post the LSI entry was ignored by commenters. I could have posted these comments there and add content to their blog, but I decided at the last minute to add content to this blog, instead.

Now let’s comment on the sustantive part.

Firstly, two different concepts are almost concatenated by the poster: LSI and the so-called “long tail”. The former is based on SVD, and the later is an expression that describes a distribution. Research on long tail-shaped distributions are found in Mandelbrot’s early work from the 50’s and 60’s, and even before Mandelbrot. Page 84 of James Gleick’s best-seller, Chaos (1987) also mentions a long tail distribution Mandelbrot came across.

Secondly, LSI is not exactly document indexing as some may loosely imply by reading the LSI entry and as many SEOs have claimed in the past. LSI is applied to already indexed documents from which terms have been extracted and already scored with a particular term weight model. Thus before applying LSI, terms and docs are identified and indexed. Now using LSI to cluster terms and documents and then reclassifying these is a different thing. Sometimes this is called reindexing and loosely referred to as “indexing” by few folks.

The initial statement of the LSI entry is simply sloppy, a hearsay, and made out of thin air: “LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document”.

The other problem with this statement is the informational service it provides to the casual reader, who might believe and repeat such notion of LSI across the Web. Besides, LSI is not essential to SEOs.

A Call to Expose SEO Liars

August 29, 2007

Since A Call to SEOs claiming to Sell LSI many are finally realizing they were taken/gamed by crook SEOs selling snake oil in the form of spurious LSI arguments. It is now time to issue a call to expose all these sinisters marketers that are giving a black eye to the search marketing industry. So, you are welcome to join great guys like Mike Duz, David Petar, and Mike Grehan and expose these people.

If you prefer, do like Dan Thies and blog about their myths. In Lies, Damn Lies, Thies has exposed another old SEO myth: keyword density. Here are additional reasons against this myth many marketers are still hanging around:

2007/05/09 Keyword Density Myth – The Devil’s Advocate

2007/05/07 Keyword Density (KD): Revisiting an SEO Myth

On the Evolution of SEO Myths

The evolution of KD myths and KD tools within SEO circles is anecdotal. It is quite similar to the evolution of LSI-based SEO myths promoted by almost the same marketers. There is a clear pattern of deception:

Repeat a hearsay many times, spin it, play with words, convince cheerleaders to repeat like parrots your hearsays and then repeat everything again until many cheerleaders, peers, and “experts” repeat your nonsense in blogs and seo books. Invent formulas out of thin air and tools that support these, etc, etc, etc.

If you prefer, misquote or copy/dump IR papers and patents in your blog to give the impression you know about information retrieval. Then, stretch these IR papers or patents to your heart needs or to whatever you are trying to sale or promote. That can be your own image or other crooks services.

Two wings of the same bird

That’s how the KD and LSI SEO myths have survived all these years. These are two wings of the same bird. Unfortunately the very same marketers go to fancy search marketing conferences, blogs, forums and few other channels to spread the same misinformation or to induce others into error. No wonder Mike Grehan has called these ‘hot air’.

Take for instance, those marketers that have preached about LSI for years or selling “LSI-like services” without even having a clue on how SVD actually works. They either do so to build an image as “experts” or to intentionally deceive their peers and clients, because of vested interests.

When caught with the pants off they often have two choices:

1. recanting.
2. recoiling.

The few raise the royal “we” and “honest” flag and then resource to throwing dirt rather than prove their case regarding their LSI claims. As far as I’m concern they can throw dirt or scream like babies all they want. They deserve their head to be hammered away any day of the week.

These are the very same folks that give a black eye to the damn search marketing industry, by deceiving the public and prospective clients while posing as honest business guys. No wonder so many IR folks perceive SEOs just as vulgar spammers.

As I always say to peer IRs and graduate students, not all SEOs are deceivers. Some are indeed ethical and quite honest. However, the bad apples are easy to spot.

More likely the more vocal “SEO experts” are the less they know about information retrieval and search engines. To be on the safe side, stay away from those that peer marketers call “SEO experts”. As we say in Spanish: ‘Ante la duda, saluda’.

Many of these have been exposed many times and in different places. Here are some references for your perusal:

SVD and LSI Tutorial 1: Understanding SVD and LSI

SEOs and their LSI Misconceptions

LSI Blog Posts and SEOs

When SEOs are caught in Lies