The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Advertisements