Yesterday I provided a basic explanation of metric weights and metric clusters. This time I want to expand on this topic and provide some how-to calculations.
On Counting Word Distances
As mentioned, the distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:
d(A, B) = | p(A) – p(B) |
Some IR authors use a different definition, though. For instance, Baeza-Yates and Ribeiro-Neto in Modern Information Retrieval, page 126, defines the distance between any two words as the number of words between these. The difference between these two definitions reduces to 1 count in the distance scale.
For example, the following position (P) vs. word (W) table depicts a 16-word text stream that mentions A twice and B three times:
Adopting the initial definition, d(A1, B1) = 4 – 1 = 3; adopting Baeza-Yates and Ribeiro-Neto’s definition, d(A1, B1) = 2.
The former definition is frequently used since is easier to automate than the later and insures non-zero distance values. To accommodate a distance count to the later definition, subtract 1.
Some authors, like Koberstein and Ng (http://faculty.cs.byu.edu/~dennis/papers/mtcluster.ps), use Baeza-Yates and Ribeiro-Neto’s definition, but then add 1 to the counts to insure that word distances between terms are always non-zero. This is the same as computing distances based on the former definition.
Regardless of how you compute word distances, be consistent. Understand that conversion of distances to metric weights produces relative, comparative data.
In the rest of this post, I use Baeza-Yates and Ribeiro-Neto’s definition of word distances, without adding 1. You can rework the calculations by adding 1 if you wish, but you should be able to arrive to the same general observations.
It can be demonstrated that for any word pair, A and B, there is a metric weight between the terms such that
mw(A, B) = mw(B, A) = mw(A) = mw(B)
That is, metric weights are symmetric. This is to be expected; metric weights are derived from distances and, by definition, a distance metric is symmetric.
Let consider the previous example.
With respect to A:
mw(A1) = ½ + 1/10 + 1/14 = 0.671
mw(A2) = ½ + ¼ + 1/8 = 0.875
mw(A) = mw(A1) + mw(A2) = 1.546
With respect to B:
mw(B1) = ½ + ½ = 1
mw(B2) = ¼ + 1/10 = 0.35
mw(B3) = 1/8 + 1/14 = 0.196
mw(B) = mw(B1) + mw(B2) + mw(B3) = 1.546
Hence, the metric weight of the A, B pair is
mw(A, B) = mw(B, A) = mw(A) = mw(B) = 1.546
In practice, mw(A, B) is computed with respect to either A or B. In IR textbooks, this quantity is frequently termed an unnormalized correlation factor and used to populate an unnormalized correlation matrix, Cu.