Thomas Richard Lynam, has researched extensively a variant of ITF called Redundant ITF (RITF). His 2002 master thesis, “Exploitation of Redundant Inverse Term Frequency”, is a must-read for anyone interested in the topic. His thesis is available as a PDF and Postscript.

The justification for using RITF is as follows.

According to Lynam, a corpus C can be thought of as a sequence of documents. Those documents contain a sequence of terms. Thus, C can be thought of as a sequence of terms. A passage is a substring of C and therefore also a sequence of terms. The probability that a term i is in the corpus is

p(i) = fi/|C|

where fi is the frequency of term i in the corpus

inverting this and taking logs gives the following measure for the importance or rareness of a term:

s(i) = log(|C|/fi)

Therefore terms occurring in the passages or terms with lower probability are more likely to be more important (rare) than other terms or common terms.

To reflect the chance of finding a term in more than one passage this rareness score is multiplied by the number of passages mentioning the term (ci):

RITFi = ci*log(|C|/fi)

where

RITFi = RITF of term i
C = corpus
fi = frequency of term i over C.
ci = number of passages containing term i

To sum up, an RITF value estimates the likelihood of a term being a correct answer to an information need.

I can think of many information need scenarios wherein RITF can be applied: questionnaires, student tests, customers feedback forms, book summaries, term papers, abstracts, office memos, slogans, text creatives, news stories, queries (searches), search engine abstracts, snippet optimization (SOP), duplicated content detection, and so forth.

Chapters 5 and 6 of Lynam’s thesis explain the many possibilities for RITF-based strategies.

Advertisements