Archive for the ‘SEO Myths’ Category

Random notes prior to 4th July weekend

July 3, 2009

As the 4th of July weekend approaches, here are some notes before hitting to planet oblivious.

1. Yesterday we had an interesting business entrepreneur meeting with the CIO of the Government of Puerto Rico at El Palacio Rojo, Fortaleza.

2. IRW should be out by Monday. Main article: Data Mining Texting.

3. Only monkeys still believe in KD Myths. Ha, Ha.

On Term Repetition and Local Models

May 27, 2009

I’m putting together a piece on several local term weight models. It should be ready in few weeks.

It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.

The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:

<last nail in KD coffin  style=”intensity:100%;”>

“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:

 Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.

Aboutness (A); e.g., that the document is x times more about the term.

Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.

Relevance (R);i..e., that a document repeating a term x times is x times more relevant.

Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”

</last nail in KD coffin>

Why IDF is Expressed Using Logs

April 15, 2009

Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.

I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.

Questions 1 and 3 are related so I will answer both now. After that, I will answer question 2.

1) Why is a log function used for calculating IDF?
3) Would it be accurate to describe IDF as “the ratio of documents in a collection to documents in that collection with a given term”? I’m guessing your answer would be, IDF is the [LOG of " the ratio of documents in a collection to documents in that collection with a given term"]? Which brings us back to question, I guess? hehe

These are recurrent questions students asked me before. The reason for using logs is due to two assumptions frequently made in most IR models; i.e.

I. that scoring functions are additive.
II. that terms are independent.

While in some models II might not be present, both (I and II) play well with logs since these also are additive.

These functions and why the use of logs is explained in the recent RSJ-PM Tutorial http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf

Document Frequency (DF) is defined as d/D, where d is number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).

But since often D > d the log of d/D, that is log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared. Now log(D/d) is conveniently called Inverse Document Frequency.

Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B

p(AB) = p(A)p(B).

Taking logs we can write

log[p(AB)] = log[p(A)]+ log[p(B)]

It is easy to show that for two terms

log(d12/D) = log(d1/D) + log(d2/D)

Inverting and using the definition of IDF we end up with

IDF12 = IDF1 + IDF2

validating assumption I; that IDF as a scoring function is additive.

That is the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,

p(AB) > p(A) + p(B)

or

p(AB) < p(A) + p(B)

and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think in order to promote some dumb keyword research tools is plain snakeoil.

2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a
collection.”

This is legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents. If a a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.

The RSJ-PM Tutorial mentioned above was written to kill for good some misconceptions regarding IDF. In it we explain why IDF is considered by Robertson and Jones a particular RSJ weight in the absence of relevance information.

In a nutshell, IDF is a collection wide estimate and as such the information on whether documents containing the terms being queried are relevant to these is unknown. Similarly, the information on whether documents not containing the query terms are relevant or not is unknown and often remains unscrambled when we just look at the d/D and d/(D – d) collection-wide ratios. All we can say is that relevant documents might have a higher probability of containing query terms in comparison with other documents from the collection as a whole. But we could make such assertion without resourcing to IDF as well.

In the case of Web documents, often these are about multiple topics. Many documents aggregate content from dissimilar sources (news headlines, rss, blogs, etc) and said document content might change in time. The mere mention of a term (regardless of repetition) is not a proof of its relevancy or of its importance with respect to the topics discussed in a document.

Thus, the idea that we can assess if terms are relevant to a document by simply comparing IDF values is missing the whole point and defeats the purpose for which the RSJ-PM model and many of its variants (e.g., BM25) were developed.

I hope this helps to clear up some SEO misconceptions on the topic.

SEOs and Their IDF Myths: Part 3

March 20, 2009

In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).

In SEOs and their IDF Myths: Part 2, we exposed some of these folks.

In Understanding TFIDF, we wrote a rebuttal.

We are still seeing so many bloggers mistaking IDF for something that is not. We have to conclude these pseudo-teachers either are just trying to sell something or they don’t really understand what term specificity stands for. They should know that IDF is a small pixel section within the bigger picture of the Robertson-Sparck Jones Probabilistic Model for information retrieval.

Thus, we are writing a tutorial on RSJ-PM to kill for good their intentionally misleading efforts. Hopefully, the tutorial will be ready before the month ends. It will be a great way of putting to rest all the false information flying around from the usual agents of misinformation (mostly SEOs). CS students interested in knowing about the pros and cons of probability models in IR will find it useful.

SSN Myths

December 8, 2008

From time to time we hear of some urban legends and myths in connection with social security numbers (SSNs).

One myth has it that SSNs label citizens based on their race or origins. Another myth is that a number can be decoded to spell out names. Let’s debunk these myths.

Regarding the first  myth, according to the SS Administration site (http://www.ssa.gov/history/ssnmyth.html):

“Apparently due to the fact that the middle digits of the SSN are referred to as the “group number,” some people have misconstrued this to mean that the “group number” refers to racial groupings. So a myth goes around from time-to-time that encoded in a person’s SSN is a key to their race. This simply is not true.”

“As should be clear from the explanation of the SSN numbering scheme, the “group number” refers only to the numerical groups 01-99. For filing purposes, the “area numbers” are broken down into these numerical subgroups. So, for example, for area numbers starting with 527 there would be 99 subgroups, one for every number starting with 527-01, and one for every number starting with 527-02, and so on. This was done back in 1936 because in that era there were no computers and all the records were stored in filing cabinets. The early program administrators needed some way to organize the filing cabinets into sub-groups, to make them more manageable, and this is the scheme they came up with.”

“So the “group number” has nothing whatever to do with race.”

Still, some folks like this Google user heard that the fifth digit of a SSN is odd for whites, but even for african-americans and minorities. Not true.

Regarding the second myth. Some have claimed that flipping a SSN might closely spell or encode a name, word, message, etc.

For instance in Feb of 2008, Google won the Dylan Stephen Jayne v. Google Founders lawsuit. Jayne claimed that his social security number upside down spelled ‘Google’. He was seeking a $5 billion compensation.

The United States Court of Appeals for the Third Circuit on appeal from the United States District Court for the Middle District of Pennsylvania(PDF) dismissed the case and resolved in favor of Google that:

“As explained by the District Court, Google and its founders are not state actors, and Jayne’s allegation concerning his coded social security number does not constitute a violation of the Constitution or federal law. We also agree that any amendment of the complaint would be futile.”

I don’t know about you, but to me and based on pure speculations and font-family, flipping upside down ‘Google’ resembles 216009. 

But, there is a problem: this sequence can appear anywhere in a candidate SSN (beginning, end, etc).

True that we can narrow down possible sequences since according to the SSN site the middle two digits cannot be ‘00′ in order to be a valid SSN. With all, three missing numbers are needed to complete a 9-digit sequence. Can you guess how to obtain these?

Still, this guessing exercise does not amount to a case. When it comes to guessing/gaming, you have the right to guess/game all you want to guess/game.

Now for those that believe in things like Numerology, Kabbalah, etc, 216009 can be reduced to 18 and then to 9. Upside down 9 resembles a 6 or a G, which is the first letter in Game, Gaming, Google, and God.

I have placed this post in the SEO Myth category just because the underlying nature of the above myths resembles the dumb nature of many of the myths promoted by SEOs.

Similarity, Pearson, and Spearman Coefficients

October 29, 2008

This post shows the connection between Pearson and Spearman’s Correlation Coefficients with Cosine Similarity and Dot Products. As mentioned in previous posts:

Pearson’s Coefficient is equivalent to Spearman’s Coefficient for raw data that has been transformed into ranks.
http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Similarity is not Distance.
http://irthoughts.wordpress.com/2008/10/15/seos-and-their-semantic-distance-myths/

The ’similarity distance’ expression is an oxymoron that should be avoided.
http://irthoughts.wordpress.com/2008/10/20/having-fun-with-oxymorons/

Arbitrary transformations between Distance and Similarity must be avoided.
http://irthoughts.wordpress.com/2007/09/17/binary-similarity-calculator/

It is important to know how to define Distance
http://irthoughts.wordpress.com/2008/10/23/why-defining-distance-is-important/

Transforming Pearson’s Coefficients into Cosine Similarities

A Pearson’s Correlation Coefficient is equivalent to the cosine of the angle of a paired data represented as vectors, provided that the raw data is first transformed into z-scores:

z(xi) = (xi – mean_x)/s_x
z(yi) = (yi – mean_y)/s_y

where the mean is deducted and the difference normalized respect to the standard deviation. Thus, we end up with a mean-centered, deviation-normalized data. A z-score data is often called a ‘centered data’.

An Example

To convince yourself, Pearson’s Correlation Coefficient for the following data is 0.8837…


 

xi yi
2 3
4 2
6 5
20 7

If you center the data and calculate its cosine similarity, it should be the same.

If the centered data is converted into column unit vectors, the dot product of the vectors is also 0.8837… since for unit vectors cosine angle = dot product

Now try this.

Rank the raw data. Next, center it, and then convert into unit vectors. Convince yourself that in this case

Spearman’s = Pearson’s = Cosine Angle = Dot Product = 0.8000…

Since Similarity is not Distance, these are not Distances.

Why Defining Distance is Important

October 23, 2008

In previous posts we mentioned the difference between distance (dissimilarity) and similarity. Both can be used to describe proximity; i.e., how alike objects are. A distance is a metric function, while similarity is a relative judgment of proximity. A generalization of distance is the Minkowski Distance. The Euclidean Distance cannot be greater than the Manhattan or City Block Distance –named in this way because taxicab drivers in Manhattan can only go from one point to another by driving around rectangular city blocks.

We could extend on this topic and provide math arguments, explaining the Minkowski Distance or what is a metric space wherein a distance ‘live’. We could also add insult to injury and explain why many SEOs don’t have a clue when mistaking distance and similarity, or of the non-sense of talking about ’similarity distance’ and ’semantic distance’ in a hyperdimensional space (say ‘Hi’ to SEOs that sell LSI Snake Oil), blah, blah, blah.

Instead, this time I want to provide a real case scenario (taken from Chapter 2 of my upcoming ebook, Keyword Clustering Analysis with Excel) which would help readers and students understand the importance of properly defining distance.

In 2005, the New York Times reported that it was brought to the Court of Appeals’ attention a case wherein a man named James Robbins was accused of selling drugs within 1,000 feet of a school. He was arrested in March 2002 on the corner of Eighth Avenue and 40th Street in Manhattan and charged with selling drugs to an undercover police officer. The nearest school, Holy Cross, is on 43rd Street between Eighth and Ninth Avenues (http://www.nytimes.com/2005/11/23/nyregion/23drugs.html?_r=1&oref=slogin).

Defendant lawyers argued the distance should be measured as a pedestrian would walk city blocks; i.e using the Manhattan or City Block Distance. They claimed the school was more than 1,000 feet away by walking from the site of the arrest.

Law enforcement officials calculated the Euclidean Distance, measuring the distance up Eighth Avenue (764 feet) as one side of a right triangle, and the distance to the church along 43rd Street (490 feet) as another, to find that the length was about 908 feet.

The Court of Appeals upheld his conviction and determined that the Legislature’s intention effectively extended the boundaries of school grounds outward in order to encompass all public areas within a 1000-foot radius of the school (http://www.courts.state.ny.us/ctapps/decisions/nov05/162opn05.pdf). Read reactions to the ruling at http://volokh.com/posts/1132938765.shtml. It is a hilarious discussion.

Having Fun with Oxymorons

October 20, 2008

Over the weekend some asked me to expand on oxymorons, so this post goes.

As mentioned in our previous post, an oxymoron is a combination of contradictory terms. All antonyms are contradictory terms, but not all contradictory terms are antonyms. For instance ‘alone together’ and ‘big baby’ are oxymorons, but only the former consists of antonyms. Thus, it is possible to extract these two types of clusters from a list of oxymorons. We can also extract clusters of oxymorons with a common term or theme.

We must not mistake oxymorons for misnomers. A misnomer is an incorrect designation. Not all oxymorons are misnomers. For example, ‘binary independence’ is a misnomer, but not an oxymoron. All this is explained in my upcoming ebook Keyword Clustering Analysis with Excel.

As mentioned in my previous post, ’similarity distance’ is an oxymoron since distance is dissimilarity. The expression is also a misnomer.

Unfortunately, some IR authors and many SEOs have used the ’similarity distance’ expression. The problem here seems to be a combination of poor selection of words and lack of knowledge about basic IR cluster analysis concepts, particularly of LSI.

In Cluster Analysis, objects are grouped into clusters using Proximity, a criterion of how ‘close’ or alike objects are. Proximity can be defined as Distance (dissimilarity) or Similarity. Clustering by distance is a minimization problem whereas by similarity is a maximization problem.

There are more definitions for similarity than for distance. Which type of proximity and definition to use depends of the type of attributes and scale of attributes of the data.

In a very high dimensional space the notion of distance or similarity is useless, if not meaningless. Thus, in a high dimensional space, talking about a ’semantic distance’ is a waste. We can try to do dimensionality reduction with LSI, but we end up computing cosine similarity, not distance.

The current state of the art is that LSI is also a misnomer as (a) its clustering power is the result of high-order word co-occurrence not semantics and (b) it is not exactly a document indexing method (before applying LSI, documents must be already indexed).

With regard to similarity and distance, it is tempting to think that one is just the numerical complement of the other or that we can blindly transform one from the other. These are short sighted views. Fortunately few IRs believe this. Unfortunately, we cannot say the same about search marketers and SEOs

Want to know more about the difference between these two concepts? Study the topic or read my ebook. Dont’t be fooled by self-proclaimed SEO “experts” or any “seobook”.

SEOs and their Semantic Distance Myths

October 15, 2008

As many of our readers know, one of the goals of this blog is to debunk SEO myths through IR knowledge. At times, we also try to clarify incorrect statements made in the IR field as well. It is not surprising from time to time find IR papers with clear gross errors and misnomers. That’s why we believe the name of this blog, IR Thoughts, is on target.

Today’s SEO myth to debunk is the so-called Semantic Distance Myth in connection with LSI. But, first some definitions. The following material is taken from chapter 1 of my upcoming ebook Keyword Clustering Analysis with Excel.

Dissimilarity characterizes how different objects from a cluster are.

Similarity indicates how similar the objects are.

In IR, it is customary to use the term ‘distance’ to mean dissimilarity. This is because unlike similarity, dissimilarity is a distance metric. Thus, similarity and distance (dissimilarity) are opposite terms. We can convert similarities into distances and viceversa, but not without first understanding the model at hand * (http://www.miislita.com/searchito/binary-similarity-calculator.html).

Beware of Oxymorons and Misnomers

The expression ‘similarity distance’ is an oxymoron or a combination of contradictory terms (like a ‘small giant’, ‘approximately equal’, etc). Unfortunately, the expression has been used in the IR literature (http://www.google.com/search?q=%22similarity+distance%22). Avoid it.

Some search marketers when trying to explain Latent Semantic Indexing (LSI) have used expressions like the ‘semantic distance between words’ when in fact what is being discussed is word similarity or how words relate to each other or to a topic. Used in this context, their discourses are oxymoronic if not dumb, let alone the fact that LSI does not measure any ‘semantic distance’.

Some IR authors have also used the expression ‘distance between words’ in reference to the number of words between any two words. The expression is loosely accepted to describe word spacing/distribution.

Perhaps loosely excluding the last one, all these expressions are misnomers (incorrect designations) since do not conform to the definition of a distance metric. Let us address this point.

Distance Defined

A function f is called a distance if it exhibits reflexivity, symmetry, and triangular inequality. To grasp these concepts, visualize three points (a, b, and c) describing a triangle.

Reflexivity means that the distance from a point to itself is zero; e.g., f(a, a) = f(b, b) = f(c, c) = 0.

Symmetry refers to the fact that the distance between any two points, measured from either one, is the same; e.g., f(a, b) = f(b, a).

Triangular Inequality requires that the distance between any two points, measured from either point, must be equal or less than the distance between these measured through a third point; e.g., f(a, b) + f(b, c) => f(a, c).

If these conditions are not met, the function measure in question is not a distance.

Finally, note that distances cannot be negative and are not upperly bounded, unless their scales have been normalized.

* You might also want to check: http://irthoughts.wordpress.com/2008/01/02/simcalc-binary-similarity-calculator/ 

or

http://www.cs.ualberta.ca/~lindek/papers/sim.pdf

PS.

Note to spammers and SEOs:  Embedding oxymorons and misnomers in documents, particularly in links, could be used as a search engine persuasion trick.

Some might argue whether the expression ‘loosely excluding’ might also qualify as a near oxymoron. For a list of oxymorons or near oxymorons, check these links:

http://www.ethanwiner.com/oxymoron.html

http://cs.bilgi.edu.tr/~hobbittr/critical/carlin_oxymorons.html

http://www.atlantamortgagegroup.com/oxymoronlist.htm

 

Direct-to-Consumer Drug Advertising: A Waste?

September 2, 2008

According to these news articles.

http://www.washingtonpost.com/wp-dyn/content/article/2008/08/29/AR2008082902646.html

http://afp.google.com/article/ALeqM5iQtDzzee581Ixc7sEbUjWFvcMwjg

direct-to-consumer advertising (DTC ads) of drugs through TV, Internet, and other media channels is an about $5B big waste of money. DTC ads, the study finds out, is based on scant data that has a small effect on the sales of drugs.

Prescription drugs are not like selling aspirin, cereal, or popcorn.

Stephen Soumerai, head of the research group that worked on the study thinks DTC advertising has failed because the process of buying prescription drugs is not like buying over-the-counter medication.

According to Soumerai, a person has to see an ad, get motivated, contact their doctor, show up for an appointment, communicate both the condition and the drug, convince the doctor to prescribe it, and then actually fill the prescription, which is also likely to carry an out-of-pocket cost.

It is certainly not a waste for modern snakeoil sellers (marketers, spammers, and scammers) that eat from the $5B pie. Same pattern as usual: theories made out of thin air and scant data.

I will not be surprise if they soon send a paid researcher or someone with vested interests to write a rebuttal.

How Not to Use Correlation Coefficients

August 29, 2008

This is a continuation of yesterday post.

In this post http://irthoughts.wordpress.com/2007/11/20/a-pagerank-rank-correlation/ we discuss a study that claims a PageRank-Rank correlation. The study was criticized by many SEOs on the grounds that it was flawed.

They were right, but none of them provided a statistical analysis assessment, which prompted back then our post.

The study seems to perpetuate SEO myths regarding ranking results.

If you look at the study, the reported slopes were very small, indicating that variables were almost orthogonal (independent), despite the correlation coefficients.

Second, no t-test confidence analysis on the correlation coefficents or slopes were provided.

Third, no transparency in the data sampling process was provided.

This is an example of how not to do stat analysis.

IR and SEO Misnomers

August 14, 2008

Dictionary defense definition lovers, here is one for you: misnomer.

According to Dictionary.com, a misnomer is:

1. a misapplied or inappropriate name or designation.
2. an error in naming a person or thing.

i.e., an inappropriate designation.

Once a while, misnomers are found in IR. Here are at least two:

1. Binary Independence Model
2. Latent Semantic Indexing (LSI)

Here is a quick overview

1. Binary Independence IR Model

Back in 1991 Cooper wrote “Some Inconsistencies and Misnomers in Probabilitistic Information Retrieval”

His article’s abstract states:

“The probabilistic theory of information retrieval involves
the construction of mathematical models based
on statistical assumptions of various sorts. One of the
hazards inherent in this kind of theory construction
is that the assumptions laid down may be inconsistent
with the data to which they are applied. Another
hazard is that the stated assumptions may not
be the real assumptions on which the derived modelling
equations or resulting experiments are actually
based. Both kinds of error have been made repeatedly
in research on probabilistic information retrieval.
One consequence of these lapses is that the statistical
character of certain probabilistic IR models, including
the so-called ‘binary independence’ model, has been
seriously misapprehended.”

In that article Cooper wrote:

“Since one can derive all possible assertions from an inconsistent
theory, such a theory must be meaningless –
entirely lacking in significance or predictive power. It
makes no sense that good experimental results could
come out of an inconsistent theory.”

“It is tempting to explain this conundrum by suggesting
that the inconsistencies in question were only minor
ones. However, there is no such thing as a theory
that is just ‘a little bit’ inconsistent. A theory cannot
be just a little bit inconsistent, any more than the
scientist proposing it can be just a little bit pregnant.
A logical inconsistency, if its implications are followed
out, destroys a theory utterly. It is a disaster, and is
totally unacceptable. If rationality is to be preserved,
inconsistencies simply cannot be tolerated.”

2. Latent Semantic Indexing

Contrary to SEO myths, LSI is not an “indexing” model, but an SVD matrix decomposition technique. Despite of what has been written in the early literature on LSI, today we know that LSI does not assesses semantics, but gets its power from high order co-occurrence relationships. Furthermore, one can grasp latent (i.e., hidden) structures present in a system by using diverse techniques other than LSI. In the early papers on LSI, structured collections of journal articles and abstracts about specific topics -medical, science, and computers-  were used. These were rich in synonyms. Evidently, the latent semantic structures identified these by forming clusters of synonyns and related terms.

Soon after, SEOs that misread those papers developed a synonymity myth around LSI. The fact is that in LSI we can arrive to the so-called “LSI latent clusters” regardless if terms are synonyms or not. This myth is easy to debunk. Once the clusters have been obtained, arbitrarily replace one term from the term-doc matrix that is known to belong to a cluster by a new arbitrary term not present in the collection, keeping the weights intact. Run the SVD algorithm again. You should arrive at the same latent structure, but the new term might not be semantically related at all to terms in the cluster since the algorithm only understands and processes numbers, not semantics. Thus, it processes a la garbage in-garbage out. This is a great topic for another ‘SEO Myth Debunking’ IRW article.

For those SEOs, marketers, and spammers that claim to know what is LSI, like Aaron Wall and the likes, read: http://irthoughts.wordpress.com/2007/05/01/irwatch-may-issue-demystifying-lsi/

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Mike Grehan
Lies, Lies, and LSI by Mike Grehan
http://www.clickz.com/showPage.html?page=3623571

Bill Slawski
Personalization Through Tracking Triplets of Users, Queries, and Web Pages
http://www.seobythesea.com/?p=535

Rand Fishkin
InfoSearch Media & ContentLogic – Purveyors of Falsehoods
http://www.seomoz.org/blog/infosearch-media-contentlogic-purveyors-of-falsehoods

Lee Odden
5 Myths about SEO
http://www.toprankblog.com/2006/12/5-myths-about-seo/

Marios Alexandrou
The History of Latent Semantic Indexing
http://www.searchgrit.com/history-of-latent-semantic-indexing/

Carson
Web content and LSI mega-rant. Part Two…
http://contentdonebetter.com/2007/03/30/web-content-and-lsi-mega-rant-part-two/

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/  

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

Claps and Slaps, the LSI Way

August 4, 2008

Claps

We are happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at the Indian Institute of Technology in Madras, India is using our SVD LSI tutorial as lecture material for his course: CS625, Memory Based Reasoning in AI. http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf 

Another investigator, this time from the cancer research field, congratulated us for the LSI tutorials. Jaime Fernandez Vera from Structural Biology and Biocomputing, Centro Nacional de Investigaciones Oncologicas, Madrid, Spain wrote (contact info removed):

Estimado Dr. García:

Muchas gracias por poner a disposición de la Comunidad sus magníficas guías prácticas y, en especial, la de LSI que es la que he seguido.

Un abrazo,

Jaime Fernández Vera

Biología Estructural y Biocomputación Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas

Our LSI/SVD tutorials are also listed in http://www-timc.imag.fr/Benoit.Lemaire/lsa.html huge repository of LSI research resources.

For additional IR resources quoting our tutorials, check the following link at http://www.miislita.com.

http://www.miislita.com/searchito/educational-links.html.

Slaps

Talking about LSI…

Spammers disguised as ethical SEOs and that promote LSI crap are now hidding. There is less talking on the blogosphere on “SEO LSI” and “LSI-friendly SEO Optimization” myths. As we always say, these crooks are a black eye to the ethical sector of the search marketing industry.

Their signature seems to be the promotion of crap tools and services like Keyword Density tools, Markov Chain generators (if you believe that crap), TFIDF rarity calculators, “semantic page strength” estimators, lookup lists based on “LSI operators”, etc. What will be their next effort at misleading the public? Latent Dirichlet Allocation (LDA) tools?

However, in an effort to save face, the usual suspects are still making gymnastic wording. They are desperate. It is clear that our efforts at exposing these crook marketers through IR knowledge are working.

Many are learning why they should stay away from the incorrect knowledge promoted by marketers that ocassionally use IR jargon to pretend they know what they are talking about. They often do these IR-like talking attempts to promote their image as “experts” before either naive or ignorant followers. We still cannot assess the dumbers, if the snakeoil sellers or their groupies. They even game each others.

When we expose SEO myths from their competitors they praise us as long as the debunkig works for them, but when their own myths are exposed they get angry at us. Ha, Ha.

 

Posts somehow related with this post

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/ 

http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

SEOs and their Exhaustivity Search Myths

July 21, 2008

Some SEOs, in an effort to sell something, gain credibility, or save face, will come up with all sort of theories made out of thin air about search engines. When not citing themselves, they cite each other hearsays, often through their link farms. When caught with the pants down, they will lie or edit qualifiers in their posts. Can you guess who, according to Mike Duz, wrote this?

“Some of those well in the know attribute this to latent semantic indexing, which Google has been using for a while, but recently increased its weighting”. (From the Internet Archive)”

According to Duz, this guy later changed his categorical assertion into this:

“Even if they are not using LSI, Google has likely been using other word relationship technologies for a while, but recently increased its weighting”.

Note that in this case changing the qualifier (“had” to “if”) also changes the categorically asserted facts, which is not a minor thing since flies against Credibility. Thanks, Mike.

Answer: (Aaron Wall) 
http://www.internetbusiness.co.uk/09072008/different-google-algos-for-different-keywords/

What a saving face effort!

Instances of such kind of edits are not new across the Web.

We roast these folks simply because they sell search engine snake oil and lies often to promote themselves, their peers, or some kind of crap tool or service. We do this through IR knowledge. One of our goals is to warn the ethical sector of the search marketing industry about such pseudo experts.

We will hammer their myths any day of the year, which takes us to another persistent myth about how search engines work: the search exhaustivity myth.

SEOs have this idea that when a user submits a query, the system does an exhaustive search through the entire document collection or index to compute term weights and rank documents according to a particular similarity measure. Evidently these folks do not know how an inverted index works. One of the reasons (there are many) for using inverted indexes is to avoid searching through all the documents listed, present in a collection. “Jumping” and “intersecting” posting lists is one of the reasons why search engines return results so fast.

BTW, when we understand how positional inverted indexes work, the benefits of document linearization, a topic we have written on before, become clear.

How an inverted index works is a good topic for IR Watch – The Newsletter.

Claps and Slaps

July 14, 2008

Claps

Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.

Slaps

There is something funny about SEOs that sell snake oil ( http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.

The following links provide additional information about AIRWeb and EMNLP

AIRWeb
http://irthoughts.wordpress.com/2008/04/29/for-seo-spammers-airweb-2008-presentations/  

EMNLP
http://conferences.inf.ed.ac.uk/emnlp08/

Understanding TFIDF

July 7, 2008

What TFIDF Is

In IR, a function of the form

aij = f(Lij,Gi,Nj) = Lij Gi Nj

defines term weights, where aij is the score assigned to term i in document j. Lij is the local weight of term i in document j. Gi is a collection-wide weight. Nj is a normalization factor over document j, often set to 1. Other values and models for N are possible.

There are many definitions for Lj and Gi. One of these consists in setting

Lij = tij, where tfij is the frequency of term i over document j
Gi = IDFi, where IDFi = log(N/ni) is the inverse document frequency of term i over a collection of N documents, where ni is the number of documents containing term i.
Nj = 1

Terms weights then reduces to evaluating

aij = tfij*IDFi = tfij*log(N/ni)

This is the so-called classic TFIDF term weight scheme, one of several examined by Salton et al. in 1975.

TFIDF is frequently used to construct a term vector space model. In this model, terms define the dimensions of a vector space called “the term space”. Dimension units are aij scores. Vectors representing documents and queries are embedded in this space and often converted into unit vectors to simplify further matrix analyses.

Terms do not posses positions in the term vector space simply because they are the dimensions of the space. Since dimensions are assumed to be orthogonal, term independence is assumed. Documents and queries are then treated as “bags of words” and represented as vectors in this space.

What TFIDF Scores Are Not

The TFIDF product, aij = tfij*IDFi, combines two different event spaces: the event space of terms over TF and the event space of documents over IDF. This product can hardly be defined as a term importance score or term relevance judgment for several reasons. Still this is what many informational electronic outlets like Wikipedia and search marketing blogs have claimed.

Firstly, let’s consider the contribution of tf to aij. For a given IDF value a plot of aij vs tfij presumes proportionality. However, a term i repeated x times in document j is not necessarily x times more pertinent or relevant. Term repetition is not precisely a good indicator of term importance.

Secondly, let’s consider the contribution of IDF to aij. IDF weights do not incorporate relevance information. According to Robertson (1), “we can regard IDF as a simple version of the RSJ weight, applicable when we have no relevance information.”

IDF is a measure of the discriminative power of a term over a collection of documents. Sparck Jones called this “term specificity”. Numerically, it is a log estimate of the inverse probability that a random document ni from a collection of N documents would contain term i. Both IDF and TFIDF have never been a buzzword in IR since the ‘70s.

On Term Importance

IDF as the TFIDF product, aij = tfij*IDFi, does not estimates term importance either. The importance of a term, a string, a passage, a message, etc is linked to many things like its meaning (semantics) and amount of information carried  (entropy). A TFIDF product does not evaluate either one.

So far we have considered two event spaces, but we could include other event spaces that add up to other variables that affect term importance; for example, the query space.

For the purpose of keyword-driven services on the Web, other considerations can be added to the “term importance” label. One of these is search volume. Terms with a high search volume or with a good conversion ratio are frequently deemed as “important” to keyword-driven services, regardless if these are quite discriminatory or “rare”.

Average users rarely search for very rare terms simply because they are rare. The odds are that very rare terms will exhibit low search volume. Thus, a highly discriminatory term not necessarily is important in this case. Thus, the query is a third and different event space to consider before deeming a term as “important”.

It is true that in the TFIDF model, we can take the cosine angle between any two vectors as a similarity measure and rank whatever the vectors represent. However, ranking by similarity in such a way can hardly incorporate semantics or information entropy, despite what some have written about the topic.

A partial solution to this consists in replacing IDF in the aij expression by an entropy expression. Such models do exist, but are computationally expensive and certainly prohibitively expensive for large scale web search engines.

On IDF and Information Entropy

Over the years, several authors including Robertson himself have tried to link IDF to entropy. According to Robertson (1), “We might be tempted to link this interpretation to the IDF formula, by regarding IDF as measuring the amount of information carried by the term. Indeed, something along these lines has been done several times, including by the present author (Robertson, 1974). Some problems with this view of IDF are discussed below.”

We are not going to go over those problems here, but the reader can go to the original source:

Robertson (1) contended:

“The essence of the problem is that it is hard to identify a single event space and probability measure within which all the relevant random variables can be defined. Without such a unifying event space, any process that involves comparing and/or mixing different measures (even a process as simple as adding IDFs for different terms) may well be invalid in terms of a Shannon-like formalism.”

Robertson then expanded on efforts along those lines (1):

“But there is a more serious problem. When we search using weighted terms, we typically take the query terms and assign weights to them, but ignore all the other terms in the vocabulary. It is hard to see how such a practice could make sense in a Shannon-like model: every term in the document must be assumed to carry information as well as any other. That is, the presence of a term ti would carry −log P(ti) amount of information irrespective of whether or not it is in the query. There is nothing to link the amount of information to the specific query. So we would have no justification for leaving it out of the calculation. Nevertheless, the similarity of the usual IDF formulation to a component of entropy has stimulated other researchers to try to make connections, sometimes somewhat differently from the link suggested above.”

In the rest of the paper Robertson mentioned by name authors that proposed such efforts.

Reference

1. Robertson, S.; Understanding Inverse Document Frequency: On theoretical arguments for IDF.
Journal of Documentation,Volume 60, Number 5, 2004 pp 503–520.

Posts somehow related with this post

RSJ-PM: Probabilistic Model Tutorial

Why IDF is Expressed using Logs?

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/ 

http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

SEOs and their IDF Myths: Part 2

July 3, 2008

This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long standing presence in IR since its introduction back in 1972 by the late Karen Sparck Jones. Since then the model has been thoroughly researched and incorporated into IR models (Salton’s tf-idf, RSJ, and BM25 models).

SEO myths and misconceptions in connection with IDF is featured in the current issue of IRW newsletter.

It appears that repeating SEO crap across the blogosphere makes some to become experts on the subject. These pseudo teachers should have researched the topic before making dumb claims or repeating their peers’s hearsay or come up with definitions made out of thin air. Nothing new coming from SEOs. Such practices are almost their trademarks.

For instance Aaron Wall, has incorrectly defined IDF as follows:

“Inverse Document Frequency is a term used to help determine the position of a term in a vector space model.”

As usual other seos have repeated like parrots such misinformation. It is not the first time. It reminds me of Wall’s claims about LSI, a topic he wrote extensively on until his ignorance about the topic was exposed in several blogs that discuss IR. He was not alone. Andy Beal, Mike Marshall, and few other vocal SEOs have claimed to know about LSI or have used LSI in SEO work. Really?

But this post is not about LSI, but IDF which is a topic equally misunderstood by SEOs. So, let us debunk their claims. As stated in IRW-2008-06:

Salton et al. proposed the vector space model in 1975 in the paper A Vector Space Model for Automatic Indexing (15). In that paper several schemes for scoring term weights were proposed. One of these consisted in combining term frequency (tf) with IDF. Over the years, a family of tf-IDF models has been proposed. Obviously, these are predated by the IDF model of 1972.

In Salton’s vector space model documents are represented as vectors. A query is represented just as another document. Vectors are projected in a vector space, whose dimensions are terms. The units of those dimensions are weights. Coordinates associated to a point (or a vector) in that space are computed according to a scoring model. Terms cannot have positions in this vector space because they are the dimensions of the space. It is that simple.

Despite the fact that IDF has been around since 1972 and tf-IDF since 1975, some search marketers like those that repeat Andy Edmonds’s claims are saying that IDF or tf-IDF is a “new” buzzword in the IR field. WOW! IDF and tf*IDF is a “new” buzzword in IR circles. Really?

Others have claimed that it is not possible to evaluate the IDF of a phrase. Even some that plan to teach IR have claimed that calling log(N/n) “inverse document frequency” is an “insult to students”. Before making a fool of themselves they should read Robertson and Sparck Jones legacy papers on the topic.

Sorry to sound harsh, but I wonder what kind of crap all these pseudo teachers are lecturing while sitting in the dark of their  empty classrooms and forums.

Did search engines use IDF? Yes, absolutely.

Do all search engines use IDF? No, absolutely.

Do I think X search engine currently uses IDF? I cannot speculate what X is doing simply because I don’t work at X.

Do I use IDF? Yes, in my experimental search engine students are building/researching.

What are the drawbacks of IDF? Several. Its stability as N gets larger is an ongoing research topic.

Before commenting on IDF, SEOs please don’t lose credibility and do your homework. Start here.

New Research on the topic: http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Sneak Preview of IRWatch: Understanding IDF

June 23, 2008

idf

“IDF is simply neither a pure heuristic, nor the
theoretical mystery many have made it out to be.
We have a pretty good idea why it works as well
as it does.” –Stephen E. Robertson

Here is a sneak preview of IR Watch for the month of June, 2008. It should be in subscribers inbox during the day or at the latest tomorrow. 

It is discussed within the context of co-occurrence theory and term independence/dependence assumptions. Issues and misconceptions related with this measure are addressed. Initially we made plans for including current ongoing work we are conducting on specificity measures, but we have chosen not to since is not the appropriate forum. 

IRW-2008-06: Understanding Inverse Document Frequency (IDF)

In this issue:

Introduction
Robertson-Sparck Jones Early Work on IDF
What IDF Is Not
What IDF Really Is
On Terms Independence
On Terms Dependence
Few Examples
Estimating the IDF of a Phrase
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

SEOs and their IDF Myths

June 17, 2008

Now that the semester is over we can take on other projects. After a little break from the blog, it is good to be back. We are putting the final touches to this month issue of IR Watch – The Newsletter.  During the break dozen of new subscribers signed.

The piece takes on several IDF myths and misconceptions promoted by SEOs and on what IDF is/is not. Here is an excerpt:

One recurrent misconception found across online media channels (search marketing blogs, forums, etc) is the assertion that IDF can be used to assess how important or relevant a term might be to the content of a document. This claim has no basis.

 

It should be stressed that as a measure of term specificity over N, IDF is not a local, but a global measure. IDF evaluates the discriminating power of a term within a collection of documents. A term ti might be relevant or important to the content of a document. However, if this document is part of a collection wherein all documents repeat ti, the term loses its discriminating power since N = ni and IDFi = log(N/ni) = 0.

Somehow, these marketers are mistaking IDF for the RSJ model or who knows what to possibly, as is often the case, promote themselves or whatever they sell.

SEOs Scams: LSI, KW, and Markov Chains

June 3, 2008

I’m happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras, India is using my LSI and Term Vector tutorials for his graduate courses:

http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf

http://aidb.cs.iitm.ernet.in/cs625/10.VectorSpace-model.pdf

It is great to see that more and more IRs and graduate students are realizing how certain SEOs have induced the public and their clients into error; that is, by selling their snakeoil in the form of “LSI optimization” and keyword density services. The most recent scam comes in the form of “markov chain” services. Like if they really know about matrix algebra and markov chain processes. Same old tricks…

It is not surprising to hear colleagues referring to these SEOs as vulgar crooks and scammers.

Vector Space Models and Search Engines

April 21, 2008

This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/

That said, today’s post is in reaction to the article at
http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.

Blockquoted passages and my comments are next.

Blockquoted Passage

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (“the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Comments

At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?

Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as

IDF = log(1/p) = log(N/n)

Logs at a given base (often 2 or 10) are used.

Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.

Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.

The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.

For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.

IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.

For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 – p)) instead of just p. Inverting and taking logs,

IDFP = ((N – n)/n)

If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.

Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html  

There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.

Blockquoted Passage

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Comments

Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.

This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?

The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.

L = f/fmax
L = 1 + log(f)
L = log(1 + f)

etc, etc, etc.

Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.

There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:

W = L*G*N

A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.

As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.

Blockquoted Passage

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

Comments

In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.

However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.

Blockquoted Passage (from one commenter)

It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

Comments

Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.

There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.

The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).

From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.

For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.

For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.

To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.

Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.

Note

Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:

http://www.webpronews.com/topnews/2001/09/05/google-interview-by-fredrick-marckini

One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html  which incidentally mentions tf and IDF.

Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.

Blockquoted Passage

Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

Comments
Why agree to SEO hearsay? See previous comments

Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.

Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/  or just do a search for “phrase idf”.

Conclusion

To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.

Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.

How Search Engines Do Not Work

April 17, 2008

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

Demystifying LSI Video

April 7, 2008

Here is a video of my presentation, Demystifying LSI, at the OJOBuscador Congress 2.0, Madrid, Spain, 2007. One year later, nothing has changed. Many of the same crook SEOs exposed during the congress are still deceiving the public about what is LSI.

Unfortunately, the quality of the video and lights are not good enough to see the pdf slides, plus the presentation is in Spanish. Since attendees were not scientists, I talked very slow for over an hour.

Want to get bored for the next hour? View the video.

Thanks to N. Valenzuela Alonso, Director of SEO and Search Engine Marketing of Media Bit, S.L. for the link (www.ithinksearch.com/2008/03/31/video-lsi-de-edel-garcia-desmitificando-lsi/).

Here is also the presentation of Carlos Castillo (Chato), from Yahoo! Research Spain:

Adversarial IR with Web Spam, parts 1 and 2 
(http://www.ojobuscador.com/2007/06/14/ir-con-adversario-y-webspam-videopost/).

I spent great time talking with Carlos, a former grad student of Ricardo Baeza-Yates.

Baeza-Yates, Andrei Broder, Gerald Salton, and Keith van Rijsbergen and few others have helped to shape what is today known as Information Retrieval Research

Talking about Andrei Broder (one of the main researchers behind the old mighty Altavista), here is also a great interview, thanks to ojobuscador site: 
http://www.ojobuscador.com/2006/05/20/entrevista-a-andrei-broder/

 

How Many SEO Myths In One Sentence?

March 12, 2008

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than what web page creators or a KD tool tags as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the ones that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing and comparing against a whole new and extra ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

Keyword Density, SEOs, and the Deception War

February 7, 2008

I’m happy that at this Sphinnessed post: http://www.searchenginepeople.com/blog/how-search-really-works-the-keyword-density-myth.html , several SEOs are finally waking up and getting the Keyword Density Myth.

Great to see that they are realizing what IR grad students already know (http://irthoughts.wordpress.com/2008/01/25/the-power-of-document-linearization/ ):

That KD is not even a cost-effective tool for detecting spam, as search engines can use local term weight models that in addition to scoring terms can attenuate word frequencies. More on this here http://irthoughts.wordpress.com/2007/05/09/keyword-density-the-devils-advocate/  and here http://irthoughts.wordpress.com/2007/05/07/keyword-density-kd-revisiting-an-seo-myth/   

Such models effectively minimize spurious effects/advantages derived from keyword repetition. Some of these are LOGA, LOGN, ATF1, and SQRT. One could even use the idea of global ENTROPY and propose a local ENTROPY model to neutralize any attempt to misrepeating terms. All these have been discussed in my Web Mining course.

Based on the aforementioned links, it is clear that the Search Engine War never ends, especially when in addition to their spam tactics, marketers are proposing soooo many theories made out of thin air. What is worse is that from time to time they induce their peers and cheerleaders to buy into their Latest SEO Incoherences (“LSI”).

The recent round of nonsense surprisingly comes from alleged “SEO experts”. From claims about sculping PageRank (http://sphinn.com/story/26410 ) to the usual LSI non-sense (http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) to Keyword Density (http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/ ), the Deception War never ends.

I’m glad at AIRWEB we fight these folks (http://airweb.cse.lehigh.edu/2008/cfp.html ). Ironically, what Google’s Cutts calls “proper use” of nofollow link attributes and “sculping” is just another way of saying end user’s human manipulation attempts. Great pr exercises.

July 8, 2009 Update:

It is funny how search engines keep entertained SEOs so that SEO “experts” end up as useful fools giving bad advice to their peers and cheerleaders.

After somehow inducing SEOs to buying into PageRank sculping, it turns out they changed the rules of the game. Check this Google’s Matt Cutts post and this reply from Danny Sullivan;

“Matt, as you know, I was kind of annoyed when you suggested sculpting to a room full of SEOs back in 2007. We’d been told over the years to do things for humans, not to overly worry about having to do stuff for search engines — and suddenly, here you were suggesting that SEOs could flow PageRank to their most “important” pages. I’d figured Google had long since been smart enough to decide for itself what percentage of a page’s PageRank spend to assign to a particular link. That assumption didn’t just come out of the blue — it came from things Google had hinted at over the years. So being told to start overtly flowing around the PageRank? It seemed counter-productive.”

He, He. Who is effectively gaming whom or wining the Deception War?

 

 

The Power of Document Linearization

January 25, 2008

In http://www.miislita.com/fractals/keyword-density-optimization.html  I explained to the SEO community the concept of document linearization as part of document GAP analysis. Marketers learned what IR graduate students already know: that document linearization (i.e., markup removal) is just one component of document indexing.

Keyword distribution, word distances, phrase matching, etc. are obtained from the text stream that results from linearization, not from the apparent position of text that is rendered by a browser and visually inspected by average end users. Document linearization debunks the common SEO Keyword Density Myth. One thing is the apparent distribution of words as perceived when end users visually scan a document and another thing is the actual word distribution as parsed by a search engine. The futility of computing KD values is quite obvious.

Here is a report of another recent SEO that discovered the power of document linearization:

http://seo-gw.blogspot.com/2008/01/fractal-semantics-linearization.html

The testimonial is worth to read.

The post http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/  is also relevant these days.

Search for posts on keyword density: http://irthoughts.wordpress.com/?s=keyword+density

Microsoft’s Black Cloud on Yahoo! & SEO Tag Clouds

January 23, 2008

From time to time rumors spread of the black cloud of Microsoft over Yahoo!; i.e., of Microsoft buying Yahoo!. This time things are less cloudy, especially now that Yahoo! is about to cut jobs.

Early this year, Jeremy Zawodny from Yahoo!, wrote:

“Sure, there would be cultural problems, integration challenges, and many people who’d likely walk. But at the end of the day, Microsoft would end up with a much larger set of online services, a better advertising network, and people who know how to build, brand, and market web stuff that people actually use.”

Talking about clouds:

A student asked me about some SEOs claiming that text tag clouds are a kind of LSI technology.

Pure non sense coming from many SEOs, as usual.

These clouds are easy to construct. No LSI is needed:

1. Sort terms from a document or lookup list by frequencies.
2. Normalize frequencies to run between the 0,1 interval.
3. Use normalized frequencies as parameters to be passed as font sizes.

For pizzaz, store terms into array to be sorted or randomized and or use some CSS.

We can do the same with hit counts assigned to blog categories, links, etc. No special technology is needed.

Finding Topic-Specific Posts

January 18, 2008

Web Mining Week 7

January 14, 2008

Week 7 Agenda

Review of Association and Scalar Clusters
Review of Vector Space Models
LSI & SVD: Demystifying LSI SEO Myths (OJOBuscador Congress, Madrid; PDF Presentation)
LSI & Keyword Research (PDF Presentation)
SVD Noise Filtering: Principal Component Analysis (PCA)

Required Reading Material

Tutorial Series
This is part one of a five-part tutorial series:
http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

Fast Tracks
These are quick tutorials, with to-the-point calculations:
http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Blog Posts
These are IR blog posts designed to fight back against misinformation promoted by unethical SEOs and Spammers:
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/page/1/?s=lsi
http://irthoughts.wordpress.com/page/2/?s=lsi

Blog Category
This is a blog category pointing to a collage of posts that demystify SEO non sense about LSI. Some are about topics that overlap with LSI:
http://irthoughts.wordpress.com/category/latent-semantic-indexing/