Often a distinction between the terms given in the title of this post is not clear in the literature.
Closeness is a generic notion that can be expressed in terms of proximity, similarity or distance.
Closeness as Proximities
Consider three points A, B, and C in a two-dimensional space. A is closer to B than to C, but B is closer to C than to A.
So, the notion of closeness in terms of proximitiy indicates that the nearest neighbor of A is B, but the nearest neighbor of B is C. We can express closeness between A an B as
clos(A, B) not equal to clos(B, A)
Thus, unlike the notion of distance, closeness in terms of proximity is not symmetric.
Closeness as Similarities
Closeness in terms of similarity is a different story.
When we say that two objects are similar we imply a notion of closeness distinct from proximity.
In IR similarity is often described in terms of cosine angles, though is not the only way of describing similarity. However, the cosine angle between any two data points represented as vectors is a symmetric measure.
In the example,
sim(A, B) = sim(B, A)
sim(B, C) = sim(C, B)
Thus, unlike closeness in terms of proximity, closeness in terms of cosine similarity is symmetric. I hope you can see the difference between proximity and similarity “walking down the street…, far apart”.
Closeness as Distances
How about the notion of closeness between points in terms of their distances?
Well distance is a metric, but similarity is not. Similarity is a measure. We can convert distances into similarities, but the reverse is not so obvious because of the triangular inequality that must be satisfied by a distance metric.
When we express closeness in terms of distances is clear that d(A, B) = d(B, A); that is, the distance between any two points is the same from and to one another.
Note that distances can assume any nonnegative value, while similarity is often restricted between 0 and 1, though cosine angles can be negative. In fact, negative values (not necessarily based on cosine values) are often found in LSI calculations. These have been referred to as “antisimilarities”.
All this tells us that a representation of closeness that incorporate both similarity and proximity of neighboring objects in LSI-based IR requires further studies. Note that similarity induced by neighborhoods is still an open question in LSI studies.
Making these distinctions in the area of keywords co-occurrence and semantic connectivity or even duplicated content is important.
* So, next time you hear SEOs talking about how “close”, similar or alike any two keywords or documents are, be sure they know what they are talking about.
This is a legacy post, originally published in 2/26/2007
* PS. Chances are they imply “aboutness”, which is closer to on-topic analysis than anything else.