In my previous post I explained to a reader the difference between inverse term frequency (ITF) and inverse document frequency (IDF), but did not provide practical applications. This post is to explain what ITF is good for.Like IDF, ITF is a global weight measure; i.e., Gi = ITF. Combined with a local weight measure (Lij), it can be used to compute an overall weight.Local weights can be defined in many different ways. Here is one definition:

Let Lij = 0.5 + 0.5(fij/fijmax) be the local weight of term i in doc j, where

fij = frequency of term i in doc j

fijmax = max frequency of term i over all j documents

Then

wij = Lij*Gi

wij = (0.5 + 0.5(fij/fijmax))*ITF

Please note that in this case, fijmax is not defined as the max term freq in a document, but is computed over all j documents of the collection.

Thus, an overall weight, wij, can be computed for each term of the index of terms.

Now the interesting part.

Terms can be represented as vectors in a concept space with docs as indexing elements. Thus, terms can be compared by computing cosine similarity values between vectors. This allows us to cluster terms by similarity values. More important, it allows us to construct a reference lookup list we call a similarity thesaurus.

Such similarity thesaurus can be used to expand a query. Some authors refer to such query reformulations as “query expansion through a global similarity thesaurus”. Note that the thesauri are not based on a term-term co-occurrence matrix. Indeed such thesauri are not built using co-occurrence information at all.

This tells us that not all thesauri are constructed in the same way. Some are built using statistical information (e.g. a Statistical Thesaurus); that is, based on clustering techniques derived from frequency data. In addition, these can be built using co-occurrence information. Even others may be based in a concept hierarchical structure or derived from subsumption relationships.

Some can be defined according to a particular clustering technique; for example, based on association, scalar, or metric clusters or even based on dimensionality reduction techniques like LSI. And still we haven’t mentioned variants based on combination of heuristics.

SEOs, Keyword Researchers, and end users in general: Before drawing conclusions from a lookup list or thesaurus, it is important to know what is behind the suggested terms. You don’t want some to sell you one thing for another (e.g. LSI).

Advertisements