A graduate student asked about distance metrics and metric clusters. Unfortunately, many IR textbooks either ignore the subject or provide abstract explanations, hard to grasp by first-year students.
This post might help graduate students and casual readers to grasp the concept.
Terms that appear closer together in a document are assumed more related than those that appear far apart in the same document. To capture this notion of relatedness, metric clusters have been proposed.
The distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:
d(A, B) = | p(A) – p(B) |
This difference is taken for the number of words between the terms. Such distance can be computed before or after filtration–the removal of stop words.
A metric weight, mw, is therefore defined as the inverse of this distance; i.e.
mw = 1/d(A, B)
This is one way of estimating the dispersion between any two words and the distribution of these in a document.
For instance, if A appears once and B appears three times in a document, calculating the metric weight of A reduces to computing the following quantities:
mw(A) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A) = 1/| p(A1) – p(B1) | + 1/| p(A1) – p(B2) | + 1/| p(A1) – p(B3) |
The subscripts in this summation refer to the different instances of the terms. Thus, mw considers the frequency of occurrence, position, and distance between the terms.
When a word appears several times in a document, its metric weight is a combination of the weight associated to each instance. For instance, if A appears twice and B appears three times in a document
mw(A1) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A2) = 1/d(A2, B1) + 1/d(A2, B2) + 1/d(A2, B3)
mw(A) = mw(A1) + mw(A2)
This actually is a double summation.
Some have argued to average these.
What applies to words also applies to stems. One just needs to identify sets of words with common stems.
Once metric weights are computed, a term-term co-weight matrix is constructed and clusters are identified by applying standard clustering techniques. These are the so-called metric clusters.