Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.
I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.
Questions 1 and 3 are related so I will answer both now. After that, I will answer question 2.
1) Why is a log function used for calculating IDF?
3) Would it be accurate to describe IDF as “the ratio of documents in a collection to documents in that collection with a given term”? I’m guessing your answer would be, IDF is the [LOG of ” the ratio of documents in a collection to documents in that collection with a given term”]? Which brings us back to question, I guess? hehe
These are recurrent questions students asked me before. The reason for using logs is due to two assumptions frequently made in most IR models; i.e.
I. that scoring functions are additive.
II. that terms are independent.
While in some models II might not be present, both (I and II) play well with logs since these also are additive.
These functions and why the use of logs is explained in the recent RSJ-PM Tutorial http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf
Document Frequency (DF) is defined as d/D, where d is number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).
But since often D > d the log of d/D, that is log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared. Now log(D/d) is conveniently called Inverse Document Frequency.
Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B
p(AB) = p(A)p(B).
Taking logs we can write
log[p(AB)] = log[p(A)]+ log[p(B)]
It is easy to show that for two terms
log(d12/D) = log(d1/D) + log(d2/D)
Inverting and using the definition of IDF we end up with
IDF12 = IDF1 + IDF2
validating assumption I; that IDF as a scoring function is additive.
That is the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,
p(AB) > p(A) + p(B)
p(AB) < p(A) + p(B)
and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think in order to promote some dumb keyword research tools is plain snakeoil.
2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a
This is legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents. If a a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.
The RSJ-PM Tutorial mentioned above was written to kill for good some misconceptions regarding IDF. In it we explain why IDF is considered by Robertson and Jones a particular RSJ weight in the absence of relevance information.
In a nutshell, IDF is a collection wide estimate and as such the information on whether documents containing the terms being queried are relevant to these is unknown. Similarly, the information on whether documents not containing the query terms are relevant or not is unknown and often remains unscrambled when we just look at the d/D and d/(D – d) collection-wide ratios. All we can say is that relevant documents might have a higher probability of containing query terms in comparison with other documents from the collection as a whole. But we could make such assertion without resourcing to IDF as well.
In the case of Web documents, often these are about multiple topics. Many documents aggregate content from dissimilar sources (news headlines, rss, blogs, etc) and said document content might change in time. The mere mention of a term (regardless of repetition) is not a proof of its relevancy or of its importance with respect to the topics discussed in a document.
Thus, the idea that we can assess if terms are relevant to a document by simply comparing IDF values is missing the whole point and defeats the purpose for which the RSJ-PM model and many of its variants (e.g., BM25) were developed.
I hope this helps to clear up some SEO misconceptions on the topic.