What TFIDF Is
In IR, a function of the form
aij = f(Lij,Gi,Nj) = Lij Gi Nj
defines term weights, where aij is the score assigned to term i in document j. Lij is the local weight of term i in document j. Gi is a collection-wide weight. Nj is a normalization factor over document j, often set to 1. Other values and models for N are possible.
There are many definitions for Lj and Gi. One of these consists in setting
Lij = tij, where tfij is the frequency of term i over document j
Gi = IDFi, where IDFi = log(N/ni) is the inverse document frequency of term i over a collection of N documents, where ni is the number of documents containing term i.
Nj = 1
Terms weights then reduces to evaluating
aij = tfij*IDFi = tfij*log(N/ni)
This is the so-called classic TFIDF term weight scheme, one of several examined by Salton et al. in 1975.
TFIDF is frequently used to construct a term vector space model. In this model, terms define the dimensions of a vector space called “the term space”. Dimension units are aij scores. Vectors representing documents and queries are embedded in this space and often converted into unit vectors to simplify further matrix analyses.
Terms do not posses positions in the term vector space simply because they are the dimensions of the space. Since dimensions are assumed to be orthogonal, term independence is assumed. Documents and queries are then treated as “bags of words” and represented as vectors in this space.
What TFIDF Scores Are Not
The TFIDF product, aij = tfij*IDFi, combines two different event spaces: the event space of terms over TF and the event space of documents over IDF. This product can hardly be defined as a term importance score or term relevance judgment for several reasons. Still this is what many informational electronic outlets like Wikipedia and search marketing blogs have claimed.
Firstly, let’s consider the contribution of tf to aij. For a given IDF value a plot of aij vs tfij presumes proportionality. However, a term i repeated x times in document j is not necessarily x times more pertinent or relevant. Term repetition is not precisely a good indicator of term importance.
Secondly, let’s consider the contribution of IDF to aij. IDF weights do not incorporate relevance information. According to Robertson (1), “we can regard IDF as a simple version of the RSJ weight, applicable when we have no relevance information.”
IDF is a measure of the discriminative power of a term over a collection of documents. Sparck Jones called this “term specificity”. Numerically, it is a log estimate of the inverse probability that a random document ni from a collection of N documents would contain term i. Both IDF and TFIDF have never been a buzzword in IR since the ‘70s.
On Term Importance
IDF as the TFIDF product, aij = tfij*IDFi, does not estimates term importance either. The importance of a term, a string, a passage, a message, etc is linked to many things like its meaning (semantics) and amount of information carried (entropy). A TFIDF product does not evaluate either one.
So far we have considered two event spaces, but we could include other event spaces that add up to other variables that affect term importance; for example, the query space.
For the purpose of keyword-driven services on the Web, other considerations can be added to the “term importance” label. One of these is search volume. Terms with a high search volume or with a good conversion ratio are frequently deemed as “important” to keyword-driven services, regardless if these are quite discriminatory or “rare”.
Average users rarely search for very rare terms simply because they are rare. The odds are that very rare terms will exhibit low search volume. Thus, a highly discriminatory term not necessarily is important in this case. Thus, the query is a third and different event space to consider before deeming a term as “important”.
It is true that in the TFIDF model, we can take the cosine angle between any two vectors as a similarity measure and rank whatever the vectors represent. However, ranking by similarity in such a way can hardly incorporate semantics or information entropy, despite what some have written about the topic.
A partial solution to this consists in replacing IDF in the aij expression by an entropy expression. Such models do exist, but are computationally expensive and certainly prohibitively expensive for large scale web search engines.
On IDF and Information Entropy
Over the years, several authors including Robertson himself have tried to link IDF to entropy. According to Robertson (1), “We might be tempted to link this interpretation to the IDF formula, by regarding IDF as measuring the amount of information carried by the term. Indeed, something along these lines has been done several times, including by the present author (Robertson, 1974). Some problems with this view of IDF are discussed below.”
We are not going to go over those problems here, but the reader can go to the original source:
Robertson (1) contended:
“The essence of the problem is that it is hard to identify a single event space and probability measure within which all the relevant random variables can be defined. Without such a unifying event space, any process that involves comparing and/or mixing different measures (even a process as simple as adding IDFs for different terms) may well be invalid in terms of a Shannon-like formalism.”
Robertson then expanded on efforts along those lines (1):
“But there is a more serious problem. When we search using weighted terms, we typically take the query terms and assign weights to them, but ignore all the other terms in the vocabulary. It is hard to see how such a practice could make sense in a Shannon-like model: every term in the document must be assumed to carry information as well as any other. That is, the presence of a term ti would carry −log P(ti) amount of information irrespective of whether or not it is in the query. There is nothing to link the amount of information to the specific query. So we would have no justification for leaving it out of the calculation. Nevertheless, the similarity of the usual IDF formulation to a component of entropy has stimulated other researchers to try to make connections, sometimes somewhat differently from the link suggested above.”
In the rest of the paper Robertson mentioned by name authors that proposed such efforts.
1. Robertson, S.; Understanding Inverse Document Frequency: On theoretical arguments for IDF.
Journal of Documentation,Volume 60, Number 5, 2004 pp 503–520.
Posts somehow related with this post
RSJ-PM: Probabilistic Model Tutorial
Why IDF is Expressed using Logs?