This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long standing presence in IR since its introduction back in 1972 by the late Karen Sparck Jones. Since then the model has been thoroughly researched and incorporated into IR models (Salton’s tf-idf, RSJ, and BM25 models).

SEO myths and misconceptions in connection with IDF is featured in the current issue of IRW newsletter.

It appears that repeating SEO crap across the blogosphere makes some to become experts on the subject. These pseudo teachers should have researched the topic before making dumb claims or repeating their peers’s hearsay or come up with definitions made out of thin air. Nothing new coming from SEOs. Such practices are almost their trademarks.

For instance Aaron Wall, has incorrectly defined IDF as follows:

“Inverse Document Frequency is a term used to help determine the position of a term in a vector space model.”

As usual other seos have repeated like parrots such misinformation. It is not the first time. It reminds me of Wall’s claims about LSI, a topic he wrote extensively on until his ignorance about the topic was exposed in several blogs that discuss IR. He was not alone. Andy Beal, Mike Marshall, and few other vocal SEOs have claimed to know about LSI or have used LSI in SEO work. Really?

But this post is not about LSI, but IDF which is a topic equally misunderstood by SEOs. So, let us debunk their claims. As stated in IRW-2008-06:

Salton et al. proposed the vector space model in 1975 in the paper A Vector Space Model for Automatic Indexing (15). In that paper several schemes for scoring term weights were proposed. One of these consisted in combining term frequency (tf) with IDF. Over the years, a family of tf-IDF models has been proposed. Obviously, these are predated by the IDF model of 1972.

In Salton’s vector space model documents are represented as vectors. A query is represented just as another document. Vectors are projected in a vector space, whose dimensions are terms. The units of those dimensions are weights. Coordinates associated to a point (or a vector) in that space are computed according to a scoring model. Terms cannot have positions in this vector space because they are the dimensions of the space. It is that simple.

Despite the fact that IDF has been around since 1972 and tf-IDF since 1975, some search marketers like those that repeat Andy Edmonds’s claims are saying that IDF or tf-IDF is a “new” buzzword in the IR field. WOW! IDF and tf*IDF is a “new” buzzword in IR circles. Really?

Others have claimed that it is not possible to evaluate the IDF of a phrase. Even some that plan to teach IR have claimed that calling log(N/n) “inverse document frequency” is an “insult to students”. Before making a fool of themselves they should read Robertson and Sparck Jones legacy papers on the topic.

Sorry to sound harsh, but I wonder what kind of crap all these pseudo teachers are lecturing while sitting in the dark of their  empty classrooms and forums.

Did search engines use IDF? Yes, absolutely.

Do all search engines use IDF? No, absolutely.

Do I think X search engine currently uses IDF? I cannot speculate what X is doing simply because I don’t work at X.

Do I use IDF? Yes, in my experimental search engine students are building/researching.

What are the drawbacks of IDF? Several. Its stability as N gets larger is an ongoing research topic.

Before commenting on IDF, SEOs please don’t lose credibility and do your homework. Start here.

New Research on the topic: