Archive for the ‘Latent Semantic Indexing’ Category

Search Engines Architecture Week 5

April 11, 2008

Week 5 Agenda

Lecture Session

Building a Parser
Parsing Techniques and Implementations

Lab Session

Building a Query Normalizer
Building a Document Parser
Building a URL Normalizer

Comments

A reminder that hard and digital copies are required for all lab reports to get full credit. I need to evaluate whether the programs actually work as intended.

Report deadlines: To be announced to accommodate to class needs and previous lab needs.

Demystifying LSI Video

April 7, 2008

Here is a video of my presentation, Demystifying LSI, at the OJOBuscador Congress 2.0, Madrid, Spain, 2007. One year later, nothing has changed. Many of the same crook SEOs exposed during the congress are still deceiving the public about what is LSI.

Unfortunately, the quality of the video and lights are not good enough to see the pdf slides, plus the presentation is in Spanish. Since attendees were not scientists, I talked very slow for over an hour.

Want to get bored for the next hour? View the video.

Thanks to N. Valenzuela Alonso, Director of SEO and Search Engine Marketing of Media Bit, S.L. for the link (www.ithinksearch.com/2008/03/31/video-lsi-de-edel-garcia-desmitificando-lsi/).

Here is also the presentation of Carlos Castillo (Chato), from Yahoo! Research Spain:

Adversarial IR with Web Spam, parts 1 and 2 
(http://www.ojobuscador.com/2007/06/14/ir-con-adversario-y-webspam-videopost/).

I spent great time talking with Carlos, a former grad student of Ricardo Baeza-Yates.

Baeza-Yates, Andrei Broder, Gerald Salton, and Keith van Rijsbergen and few others have helped to shape what is today known as Information Retrieval Research

Talking about Andrei Broder (one of the main researchers behind the old mighty Altavista), here is also a great interview, thanks to ojobuscador site: 
http://www.ojobuscador.com/2006/05/20/entrevista-a-andrei-broder/

 

Adressing Some LSI Questions

April 2, 2008

At the last Search Engines Architecture lecture we discussed LSI and Terrier. Great questions were raised. Some of these follows:

Q: How many dimensions to keep?
A: This is done by trial and error. I have a research project on the topic. None of the current ways of addressing this problem convince me.

Q: How do we compute a truncated version of the initial matrix, A?
A: After SVDing A, truncate U, S, and V by retaining the first k columns of U and V (rows of V transpose) and the first k diagonal elements of S. Multiply these as discussed in class to get A truncated.

Q: To compute the query vector in the reduced space, do we need to compute A truncated for each query?
A: No. The new coordinates of this vectors are defined as
q = qTUkSk-1
This means that A can be called from the cache. See the fast track tutorial
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf
over at Mi Islita.com site.

Q: Do I need to compute A truncated each time a new document is added or previous are modified?
A: For small matrices the answer is YES. However, for huge matrices we can resource to updating/appending techniques. Some of these add doc vectors without recomputing the previous matrix. There is a point wherein this can compromise orthogonality, though.

Q: How do I use Desktop Terrier?
A: Follow the instructions provided in the updated version of Lab Report 2.

Search Engines Architecture Week 2

March 14, 2008

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

How Many SEO Myths In One Sentence?

March 12, 2008

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

LSI: How Many Dimensions to Keep?

February 13, 2008

In How to Populate a Matrix for SVD I referred readers to Igvita’s great blog posts on SVD. A recent visit to the blog shows it is still very much alive and equally interesting. The issues been discussed are not really new, though.

When we lecture on SVD an issue that soon or later arises is how many dimensions k to keep. A recent visitor of the aforementioned blog finally raised the same question.

Can you pls give me a clue as to how we decide how many dimensions to project our data onto when using SVD?

How many dimenisions to keep is the so-called Rank k Approximation that often leads to the dreaded dimensionality reduction curse in which performance can be compromised.

In the Latest SEO Incoherences (LSI) post we mentioned that this issue was already addressed by Dr. Susan Dumais, many times, and througout her first papers and talks on LSI. In that post we referred readers to Dumais’s talk Transcription of the Application presentation by Susan Dumais, Bellcore (now at Microsoft). That talk is now a classic in the history of LSI.

In those days Dumais approach was simply “by seat of the pants“:

Let me end, as my time is running out, with some of the statistical issues that we have encountered and that I hope you have some hints about. The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it doesn’t work. You know when you have too few dimensions. We would like some better methods for doing this, things like the scree test don’t seem to correspond very well to behavioral data that we have.

Later during the QA session participants revisited this issue. Let us reproduce participants-Dumais QA:

PARTICIPANT: Thank you, Susan. Questions from the floor?

PARTICIPANT: I’m a little nervous that if someone was browsing the Web and we hoped to put some of this material in the Web, that we’re in trouble. We’re talking about seat of the pants and underwear models, that people are going to get the wrong context for why we’re here. But that is part of the big problem that Susan is talking about.

PARTICIPANT: I thought I would just mention an entirely different approach to this problem, with Joe (word lost) at EDS. What we’re doing is –

PARTICIPANT: Can you get to a mike?

PARTICIPANT: We are using a poisson model for the word counts. Then we’re interested in finding maximum likelihood estimates for the clustering, and we found various combinations of simulated annealing and markup chain Monte Carlo to work very well with funding these things.

One of the nice things in a model based approach is that you get natural measures of association rather than just SVD types of things, although it could be slower.

PARTICIPANT: I think one thing we will try to ask everyone after the conference is to send us electronically two or three references of relevant work that we can disseminate in this way, because we do hope to learn about new approaches and new methods and related work. So keep that in mind as the discussion progresses. We will send out E-mail requesting those in electronic form.

PARTICIPANT: (Comments off mike.)

DR. DUMAIS: It is first of all not clear that the 300 or 400 dimensions we have used for the trek databases is optimal. We find that performance is still increasing up to 400 dimensions; it may well increase beyond that.

In fact, I should mention that if you plot performance as a function of number of dimensions, what you get is an inverted U function that is heavily skewed. That is, performance increases dramatically as you go from 20 or 30 up to several hundred dimensions, and then it tails off gradually through the level of performance that you see with raw key word matching, which is the full dimensional solution.

We don’t know that we have reached the peak. In problems where we know what the optimal number of dimensions is, we have found that the peak is not so sharp.

Twenty years later (first LSI papers saw the light in 1988, not in 1990 as some SEOs have incorrectly claimed) a lot of research advances on SVD in relation to LSI have been published. Old IR ideas regarding LSI have been dropped and new ones have been adopted. That is what research is all about.

Still, the issue of how many dimensions to keep is still an open issue and a “by seat of the pants” one. All kind of things and guidelines have been tried. But at the end we need to test and retest the system under examination.

I even have tested my own guideline: keep the top k singular values that amount to more than X percent of the trace of the S matrix; where

S is the matrix of singular values.
X is a threshold value, usually 80-85%

But, again, some would ask: why 80%? Why not 90%, 70%, 60%, etc?

While the above guideline works for many systems, I have trepidated on some systems in which the above threshold is not good. So I always come to “find X experimentally or by seat of the pants”.

We could inspect this as an optimization problem and use Nelder-Mead Multivariative Sequential Simplex Optimization, but I haven’t tried this yet. I’m not sure if this is the way to go either, but might be worth to test.

Another idea is to iteratively update-test-update-test the matrix using any of the current SVD updating methods for several X values. I need to spare some time on this one to see what comes out.

 I’m also open to suggestions.

For those interested, a 1.0 Mb download of Dumais’s 1995 presentation is available. If you have problems downloading it, let me know. I can send you a zip file.

April Kontostathis, from Ursinus College, in Essential Dimensions of Latent Semantic Indexing (LSI), proposes an interesting approach to address aspects of this problem. She illustrates her approach with a model wherein term weights are computed using a well known base-2 LOG model for local weights combined with the ENTROPY model for global weights.

More work is still needed along these lines.

Association & Scalar Clusters Tutorial - Part 1

January 22, 2008

I am writing a tutorial series on Cluster Analysis. It is my pleasure to announce that the
Association and Scalar Clusters Tutorial - Part 1: Back Mapping Term Clusters to Documents was uploaded few days ago.

Online publication was announced in advanced to subscribers of the IR Watch - The Newsletter, so they already have an edge over regular readers and visitors of Mi Islita

Abstract follows:

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A “reaction” equation approach is used to break down the classification problem to a sequence of steps. From the initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified. A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters. Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation, and document classification are discussed.

During last night lecture (Web Mining Course), I applied the back mapping technique to scalar clusters generated from LSI. The technique provides additional information and reasons as to how and why documents score as observed after implementing SVD. A clear connection with Fuzzy Set Theory was made.

Students taking the Web Mining Course will find this tutorial quite handy.

Web Mining Week 8

January 21, 2008

Week 8 Agenda

 Take-Home Work 3 and Web Mining Course FAQs
LSI and Scalar Cluster Analysis: An EXCEL Spreadsheet Approach (PPT presentation)
LSI and Fuzzy Sets = Fuzzy LSI
Introduction to Intelligence Searches (PPT presentation)
Bonus: My IPAM Lost Pictures at the 2006 Document Indexing Workshop

Required Reading Material

http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf 
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Finding Topic-Specific Posts

January 18, 2008

Global Term Weights based on Entropies

January 16, 2008

A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.

Global weights based on entropies are frequently combined with local and normalization weights into overall weights.  These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.

The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:

Global Entropy Weights

Evidently,

G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.

Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.

Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.

Web Mining Week 7

January 14, 2008

Week 7 Agenda

Review of Association and Scalar Clusters
Review of Vector Space Models
LSI & SVD: Demystifying LSI SEO Myths (OJOBuscador Congress, Madrid; PDF Presentation)
LSI & Keyword Research (PDF Presentation)
SVD Noise Filtering: Principal Component Analysis (PCA)

Required Reading Material

Tutorial Series
This is part one of a five-part tutorial series:
http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

Fast Tracks
These are quick tutorials, with to-the-point calculations:
http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Blog Posts
These are IR blog posts designed to fight back against misinformation promoted by unethical SEOs and Spammers:
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/page/1/?s=lsi
http://irthoughts.wordpress.com/page/2/?s=lsi

Blog Category
This is a blog category pointing to a collage of posts that demystify SEO non sense about LSI. Some are about topics that overlap with LSI:
http://irthoughts.wordpress.com/category/latent-semantic-indexing/   

From Keyword Density to William Tutte’s Legacy

December 20, 2007

From Keyword Density to Keyword Distribution

Finally we have the Christmas Break from graduate school.

In my last Web Mining Course lecture before the Christmas Break, I tried to explain to students the importance of incorporating word spacing in information retrieval algorithms and in document relevance assessments. I explained why ideas like SEOs’s keyword density (KD), the traditional local term weight model known as FREQ (Term Count) and used in early papers on Vector Space and LSI models, and the likes are poor estimators of document relevance.

Among other theoretical reasons, it was discussed that a term mentioned X times not necessarily is X times more important than other terms. In addition, KD and the term count model cannot attenuate frequencies. We then discussed several frequency attenuation models (keyword spam filters) that also work as term weight scoring models. These can dampen down the effect of abnormal repetition of terms, raise a spam flag, and do not require of any reference to KD “tales”.

We also discussed several scenarios in which one could use word distributions and co-occurrence to analyze textual information –far better than with the aforementioned “crapstimators”. For instance, word spacing can be used in encryption/steganographic algorithms to uncover hidden messages, profiling writing styles/people, imputate authorship of text, assess plagiarism, fraud, etc.

I’m happy that not all SEOs are buying into the keyword density of non-sense and similar “crapstimates”, as I can see from these SEOmoz posts.

From Keyword Distribution to William Tutte’s Legacy

This morning I came across a nice biography of one of those venerable giants: the late William Tutte. Beautifully written by Dan Younger, the biography is a tribute to Tutte’s greatness. Interesting to point out in relation to word spacing theory is this portion of Young’s writing (emphasis added):

“Tutte’s great contribution was to uncover, from samples of the messages alone, the structure of the machines which generated these codes. This came about as follows. In August 1941, a German operator sent a Fish-enciphered teleprinter message of some 4000 letters from Athens to Berlin. For some reason, the message was not received properly and so it was resent. Against all guidelines, it was sent with the same setting. It was identical in content, but it differed slightly, in word spacing and punctuation. John Tiltman of Bletchley was able to use this blunder to find both the message and the obscuring string that was added to make up the enciphered message. But that seemed to be all that could be found, when Tutte was presented with the case in October.”

“Tutte began by observing the machine generated obscuring string carefully. Splitting it up into various lengths, he noticed signs of periodicity. For the first of the five teleprinter tape positions, the regularity he supposed arose from a wheel of 41 sprockets. And then at the last position, one of 23 sprockets. Over the next months, Tutte and colleagues worked out the complete internal structure, that it had twelve wheels, two for each of the five teleprinter positions, and two with an executive function. They determined the number of sprockets on each wheel, and how the advancement of the wheels was interrelated. They had completely recreated the machine without ever having seen one. Tony Sale, who first described this work in a 1997 article in New Scientist, characterized it as the “greatest intellectual feat of the whole war.”

“Knowing the structure of the enciphering machine is a necessity for code-breaking, but it is only the first step. Tutte then put himself to creating an algorithm to find from the enciphered messages the initial settings of the machine wheels. The algorithm that he created, the “Statistical Method”, looked for certain types of resonances, but it had to consider far too many possibilities to be carried out by hand. So it was that, in 1943, the electronic computer COLOSSUS was designed and built by the British Post Office. It was to run the algorithms that Tutte; and his collaborators Max Newman and Ralph Tester; developed, that COLOSSUS was created. This man-machine combination was used to break Fish codes on a regular basis throughout the remainder of the War”.

I hope you understand now the title of this post.

 In today’s Web the enciphering machines are search engines, but the underlying principles driving the Search Engines War are the same.

Emphasized words should make sense to students of the Web Mining course.

Perpetuating LSI Misconceptions

December 11, 2007

Mr. Nick Yorchak from Fusionbox and an alleged SEO “expert” has written this Sitepronews.com article about LSI, which perpetuates myths and wrong statements about LSI, similar to those claimed by Mr. Aaron Wall at this SearchEngineJournal article, and by Valerie DiCarlo in this unfortunate article.

Mike Duz has written a quick rebuttal to Yorchak.

Wishful Thinking: Let us hope that in 2008 SEOs learn how SVD works so they stop spreading misinformation about what is LSI.

To learn about SEO misconceptions regarding LSI, check my tutorial series on the topic, starting with

Tutorial 1: Understanding SVD and LSI

Fortunately, more and more SEOs, like Andy Beal here (MarketingPilgrim.com) and Melissa Fach here (SEOAware.com), are realizing what is not LSI.

BTW, here is an “invitation” issued by Mike Grehan and me back in July, 2007: A Call to SEOs Claiming to Sell LSI.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

LSI, According to an SEOMOZ Glossary

September 3, 2007

In A Complete Glossary of Essential SEO Jargon an SEOMOZ poster defines LSI as follows:

“LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document. SEOs refer to these same groups of words as “Long Tail”. The majority of searches consist of three or more words strung together. See also “long tail”. The significance is that it might be almost impossible to rank well for “mortgage”, but fairly easy to rank for “second mortgage to finance monster truck team”

I have been asked to comment about this.

To put the post in perspective, a jargon glossary is like a collection of expressions used within a specific group of individuals with similar interests. Normally jargon is not intended for outsiders.

Overall, the post is a nice coffee table reading. The title states this is a complete glossary of essential SEO jargon. However, it can be argued whether the glossary is complete or if some entries of the glossary are indeed essential to SEOs.

Within SEO circles, jargon connected to search engine technology often comes with two elements:

(a) oversimplification

(b) misinformation

To the poster’s credit, not all entries of the glossary have (a), (b), or both, but are actually informative. Like some of the comments these generate, some are entertaining.

Unfortunately the LSI entry comes with both, (a) and (b). Last time I revisited the post the LSI entry was ignored by commenters. I could have posted these comments there and add content to their blog, but I decided at the last minute to add content to this blog, instead.

Now let’s comment on the sustantive part.

Firstly, two different concepts are almost concatenated by the poster: LSI and the so-called “long tail”. The former is based on SVD, and the later is an expression that describes a distribution. Research on long tail-shaped distributions are found in Mandelbrot’s early work from the 50’s and 60’s, and even before Mandelbrot. Page 84 of James Gleick’s best-seller, Chaos (1987) also mentions a long tail distribution Mandelbrot came across.

Secondly, LSI is not exactly document indexing as some may loosely imply by reading the LSI entry and as many SEOs have claimed in the past. LSI is applied to already indexed documents from which terms have been extracted and already scored with a particular term weight model. Thus before applying LSI, terms and docs are identified and indexed. Now using LSI to cluster terms and documents and then reclassifying these is a different thing. Sometimes this is called reindexing and loosely referred to as “indexing” by few folks.

The initial statement of the LSI entry is simply sloppy, a hearsay, and made out of thin air: “LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document”.

The other problem with this statement is the informational service it provides to the casual reader, who might believe and repeat such notion of LSI across the Web. Besides, LSI is not essential to SEOs.

A Call to Expose SEO Liars

August 29, 2007

Since A Call to SEOs claiming to Sell LSI many are finally realizing they were taken/gamed by crook SEOs selling snake oil in the form of spurious LSI arguments. It is now time to issue a call to expose all these sinisters marketers that are giving a black eye to the search marketing industry. So, you are welcome to join great guys like Mike Duz, David Petar, and Mike Grehan and expose these people.

If you prefer, do like Dan Thies and blog about their myths. In Lies, Damn Lies, Thies has exposed another old SEO myth: keyword density. Here are additional reasons against this myth many marketers are still hanging around:

2007/05/09 Keyword Density Myth - The Devil’s Advocate

2007/05/07 Keyword Density (KD): Revisiting an SEO Myth

On the Evolution of SEO Myths

The evolution of KD myths and KD tools within SEO circles is anecdotal. It is quite similar to the evolution of LSI-based SEO myths promoted by almost the same marketers. There is a clear pattern of deception:

Repeat a hearsay many times, spin it, play with words, convince cheerleaders to repeat like parrots your hearsays and then repeat everything again until many cheerleaders, peers, and “experts” repeat your nonsense in blogs and seo books. Invent formulas out of thin air and tools that support these, etc, etc, etc.

If you prefer, misquote or copy/dump IR papers and patents in your blog to give the impression you know about information retrieval. Then, stretch these IR papers or patents to your heart needs or to whatever you are trying to sale or promote. That can be your own image or other crooks services.

Two wings of the same bird

That’s how the KD and LSI SEO myths have survived all these years. These are two wings of the same bird. Unfortunately the very same marketers go to fancy search marketing conferences, blogs, forums and few other channels to spread the same misinformation or to induce others into error. No wonder Mike Grehan has called these ‘hot air’.

Take for instance, those marketers that have preached about LSI for years or selling “LSI-like services” without even having a clue on how SVD actually works. They either do so to build an image as “experts” or to intentionally deceive their peers and clients, because of vested interests.

When caught with the pants off they often have two choices:

1. recanting.
2. recoiling.

The few raise the royal “we” and “honest” flag and then resource to throwing dirt rather than prove their case regarding their LSI claims. As far as I’m concern they can throw dirt or scream like babies all they want. They deserve their head to be hammered away any day of the week.

These are the very same folks that give a black eye to the damn search marketing industry, by deceiving the public and prospective clients while posing as honest business guys. No wonder so many IR folks perceive SEOs just as vulgar spammers.

As I always say to peer IRs and graduate students, not all SEOs are deceivers. Some are indeed ethical and quite honest. However, the bad apples are easy to spot.

More likely the more vocal “SEO experts” are the less they know about information retrieval and search engines. To be on the safe side, stay away from those that peer marketers call “SEO experts”. As we say in Spanish: ‘Ante la duda, saluda’.

Many of these have been exposed many times and in different places. Here are some references for your perusal:

SVD and LSI Tutorial 1: Understanding SVD and LSI

SEOs and their LSI Misconceptions

LSI Blog Posts and SEOs

When SEOs are caught in Lies

Search Smart with WCC’s ELISE

August 9, 2007

The list of subscribers to IRWatch is growing at fast pace. One of our recent subscribers is a developer at WCC, makers of ELISE Smart Search & Match. This seems to be a quite interesting technology. I highly recommend readers to visit their site http://wcc-group.com/

(more…)

THESUS: Semantic Subsets Based on WWW Links

August 2, 2007

I came across a 2003 paper on mining WWW text links, in which researchers introduced THESUS. Really interesting piece.

The abstract of THESUS: ORGANIZING WEB DOCUMENT COLLECTIONS BASED ON LINK SEMANTICS follows:

(more…)

SEOs and Still Their LSI Misconceptions

July 19, 2007

I just came across this article

http://seo-and-google.blogspot.com/2007/07/5-tips-to-effective-seo-keyword.html

by Valerie DiCarlo and honestly don’t know from where these marketers learn all these misconceptions regarding LSI. Perhaps she has been misled as well by the usual suspects and is just trying to make some honest comments she truly believes. Unfortunately most of her’s are incorrect. I’m commenting her lines, one by one.

(more…)

What is a Similarity Thesaurus?

July 16, 2007

In my previous post I explained to a reader the difference between inverse term frequency (ITF) and inverse document frequency (IDF), but did not provide practical applications. This post is to explain what ITF is good for.Like IDF, ITF is a global weight measure; i.e., Gi = ITF. Combined with a local weight measure (Lij), it can be used to compute an overall weight.Local weights can be defined in many different ways. Here is one definition:

(more…)

Random Notes

July 10, 2007

I’m putting the final touches to this month issue of IRW, which is running late –reasons all subscribers know by now. It should be out tomorrow.

Amazing how many are still perpetuating so many misconceptions about “LSI tools”. Here is another example, forwarded to me by Melissa Fach, one of several SEOs that are discovering how many “LSI-based” SEO lies are out there thanks to the usual suspects:

http://courtneytuttle.com/2007/07/05/taking-seo-to-the-next-level-lsi/

(more…)

A Call to SEOs Claiming to Sell LSI

July 9, 2007

Mike Grehan finished his great ClickZ column of June 11, 2007 SEO Is Dead. Long Live, er, the Other SEO, as follows:

“I’ve run out of space again. I’ll come back to the stupidity of the latent semantic indexing issue in my next column.”

(more…)

Research Channel, LRA, Microsoft, and more

June 26, 2007

I forget to mention that I’m attending ICANN this week, so most will be legacy posts –straight from the conference.

The ResearchChannel is a research consortium dedicated to serve as an online channel for the dissemination of cutting edge technologies. If you want to learn the real stuff under the hood of search engines, just do it through the ResearchChannel. Want to learn the difference between LSA(LSI) and LRA (Latent Relational Analysis)?

(more…)

Closeness, Proximity, Similarity, and Distance

June 20, 2007

Often a distinction between the terms given in the title of this post is not clear in the literature.

Closeness is a generic notion that can be expressed in terms of proximity, similarity or distance.

(more…)

Is SEO Dead?

June 12, 2007

With the catchy title, SEO Is Dead. Long Live, er, the Other SEO, once again, my friend Mike Grehan has a great ClickZ column wherein he comments on Google and ASK new approaches to satisfy users’ information needs.

He ends the article as follows:

(more…)

Subsumptions vs Synonyms - Conceptual Indexing Revisited

June 11, 2007

Back in 1997, William Woods, Principal Scientist and Distinguished Engineer at Sun Microsystems Labs, wrote Conceptual Indexing: A Better Way to Organize Knowledge. Although the notion of conceptual indexing turned out to be a complex thing, his paper is still relevant these days wherein many SEOs make incorrect claims about how search engines use Latent Semantic Indexing (LSI) and wherein others are paying attention to synonymy and phrase processing patents. This post is based in part on Woods’s manuscript.

(more…)

LSI Blog Posts and SEOs

June 6, 2007

I’m still trying to understand why so many SEOs have LSI backward and why others insists in promoting or explaining something that is not LSI as LSI. Some even repeat previous fallacies they have heard across the Web or from contaminated pools of knowlege like Wikipedia.

To top off, I have emails from SEOs so mad about being misled into error by other SEO “experts” regarding claims about what is LSI or how it works.

(more…)

When SEOs are Caught in Lies

June 2, 2007

Mr. Aaron Wall is quoting me over at his blog.

Funny how SEOs that are caught in lies brush things off.

Mr. Wall: When people mistake concept A (e.g, LSI) for concept B (e.g., whatever) and spent few years promoting the former for the later across the Web, not only it is clear they don’t know what they are talking about, but they just spread fallacies and induce others into error. That is a fact.

(more…)

LSA: A Goldmine for Educators and Curriculum Developers

May 17, 2007

LSA

Marco Kalz, M.A. over at Educational Technology Expertise Centre Open University of the Netherlands, informed me months ago that the University of Netherlands was organizing the 1st European Workshop on LSA in Technology-Enhanced Learning. Marco is part of the Scientific Committee responsible for organizing the event and co-author of the workshop proceedings.

It is my pleasure to inform our readers that the event was a complete success. I will ask Marco for additional inside information, to perhaps include in our next issue of IRW Newsletter.

(more…)

Zoom in this Theme: The LSI Myth

May 11, 2007

Few days ago, Michael Duz had a great post, The LSI Myth wherein he describes the nonsense promoted by snakeoil SEO marketers. He has a list of common taglines used by these people:

(more…)