Now that the semester is over we can take on other projects. After a little break from the blog, it is good to be back. We are putting the final touches to this month issue of IR Watch – The Newsletter. During the break dozen of new subscribers signed.
The piece takes on several IDF myths and misconceptions promoted by SEOs and on what IDF is/is not. Here is an excerpt:
One recurrent misconception found across online media channels (search marketing blogs, forums, etc) is the assertion that IDF can be used to assess how important or relevant a term might be to the content of a document. This claim has no basis.
It should be stressed that as a measure of term specificity over N, IDF is not a local, but a global measure. IDF evaluates the discriminating power of a term within a collection of documents. A term ti might be relevant or important to the content of a document. However, if this document is part of a collection wherein all documents repeat ti, the term loses its discriminating power since N = ni and IDFi = log(N/ni) = 0.
Somehow, these marketers are mistaking IDF for the RSJ model or who knows what to possibly, as is often the case, promote themselves or whatever they sell.
June 18, 2008 at 12:52 pm |
An example of such non sense is given in http://alwaysbetesting.com/abtest/index.cfm/2008/5/24/Term-Frequency-Inverse-Document-Frequency-TDIDF-Exploring-TheRarestWordscom
July 3, 2008 at 9:24 am |
[...] and their IDF Myths: Part 2 This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long [...]
March 20, 2009 at 11:59 am |
[...] and Their IDF Myths: Part 3 By E. Garcia In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document [...]
August 26, 2009 at 3:22 am |
I recently got in touch with TF-IDF concept and this is how I understand it in simple english.
– TF is a measure of how relevant a document is as compared to a term. Assumption here is more relevant documents will have the term repeated often.
– IDF measures how important (or how specialized subject) is the term itself. Assumption here is that the terms that occur too frequently among documents constitute lesser specialized subjects.
Would appreciate if someone can verify my understanding.
August 26, 2009 at 2:45 pm |
Hi optimmysql:
Thank you for stopping by.
That was the perception back in the early days of IR, repeated often by some authors unaware of current research. A lot of understanding has been realized since then. You might want to check
http://irthoughts.wordpress.com/2008/07/07/understanding-tfidf/
Terms repeated x times are not necessarily x times more relevant to a document or means that the document is x times more pertinent to the term in question.
IDF is a crude measure of the discriminatory power of a term and is used in the absence of relevance information.
Cheers