Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.

I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.

Questions 1 and 3 are related so I will answer both first.

1) Why is a log function used for calculating IDF?
3) Would it be accurate to describe IDF as “the ratio of documents in a collection to documents in that collection with a given term”? I’m guessing your answer would be, IDF is the [LOG of ” the ratio of documents in a collection to documents in that collection with a given term”]? Which brings us back to question, I guess? hehe

The reason for using logs can be trace back to two assumptions made in IR:

I. that scoring functions are additive.
II. that terms are independent.

While in some models II might not be present, both (I and II) play well with logs since these also are additive.

These functions and why the use of logs is explained in the recent RSJ-PM Tutorial http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf

Document Frequency (DF) is defined as d/D. This is a probability vale where d is the number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).

For D > d, log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression. Essentially we are compressing the scale of probability values so that very large or very small quantities can be compared. Log(D/d) is now conveniently called Inverse Document Frequency.

Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B

p(AB) = p(A)p(B).

Taking logs we can write

log[p(AB)] = log[p(A)]+ log[p(B)]

It is easy to show that, for two terms,

log(d12/D) = log(d1/D) + log(d2/D)

Inverting and using the definition of IDF we end up with

IDF12 = IDF1 + IDF2

validating assumption I that IDF, as a scoring function, is additive.

That is, the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,

p(AB) > p(A) + p(B)

or

p(AB) < p(A) + p(B)

and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary, as some SEOs claim to promote their keyword tools, is plain snakeoil.

2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a
collection.”

This is a legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents. If a  term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.

The RSJ-PM Tutorial was written to kill for good some misconceptions regarding IDF. In it we explain why IDF is considered by Robertson and Jones a particular RSJ weight in the absence of relevance information.

In a nutshell, IDF is a collection-wide estimate and as such the question on whether documents containing the terms being queried are relevant to these is unknown. Similarly, the questio on whether documents without the query terms are relevant or not is unknown and often remains unscrambled when we just look at the d/D and d/(D – d) collection-wide ratios. All we can say is that relevant documents might have a higher probability of containing query terms in comparison with other documents from the collection as a whole. But we could make this assertion without resourcing to IDF.

In the case of Web documents, these often are about multiple topics. Many documents aggregate content from dissimilar sources (news headlines, rss, blogs, etc) and said document content might change in time. The mere mention of a term (regardless of repetition) is not a proof of its relevancy or of its importance with respect to the topics discussed in a document.

Thus, the idea that we can assess if terms are relevant to a document by simply comparing IDF values is missing the whole point and defeats the purpose for which the RSJ-PM model and many of its variants were developed.