Good question from a reader:

Hi, I have heard the expression inverse document frequency, but recently I came across a paper mentioning inverse term frequency. Can you clarify these?

Sure.

Consider a set, D, consisting of n documents:

D ={d1, d2…dn}

and a set, T, consisting of m unique terms extracted from D:

T = {t1, t2…tm}

Thus,

  1. the probability of randomly selecting any document from D is P(d) = 1/n. The probability that this document mentions a term i from T is P(di) = di/n (thus, 1 – P(di) = 1 – di/n is the probability that the term is not mentioned).
  2. the probability of randomly selecting any term from T is P(t) = 1/m. The probability that this term is present in a document j from D is P(tj) = tj/m (thus, 1 – P(tj) = 1 – tj/m is the probability that is not present)

Inverting these quantities and taking logs we obtain the following weight measures:

log(n/di), known as Inverse Document Frequency (IDF).
log(m/tj), known as Inverse Term Frequency (ITF).

The former is a global weight measure and the later is a local weight measure.

Both can be rewritten in function of the odds as follows:

log((n – di)/di), known as IDF-Probabilistic (IDFP)
log((m – tj)/tj), known as ITF-Probabilistic (ITFP)

These adopt a value of 0 for n = di and m = tj.

Negative IDFP weights are computed when di > n/2.
Negative ITFP weights are computed when tj > m/2.

Negative weights are notorious for introducing retrieval complexities in IR and some search systems that use IDFP and ITFP. A workaround for dealing with these “negative terms” consists in rewriting these as having 0 weight values.

Thus, terms which occur in more than 50% of a collection of documents are reduced to stopterms. Documents with more than 50% of the index terms receive a similar treatment. The system might elect to ignore these terms or documents altogether during scoring and ranking.

Additional information is given in Implementation and Application of Term Weights in a MySQL Environment and at the MySQL site.

PS. (1) I’ve refined some lines of this post to make it clearer. (2) Next week I will explain another expression known as RITF or Redundant Inverse Term Frequency.

Advertisements