As many of our readers know, one of the goals of this blog is to debunk SEO myths through IR knowledge. At times, we also try to clarify incorrect statements made in the IR field as well. It is not surprising from time to time find IR papers with clear gross errors and misnomers. That’s why we believe the name of this blog, IR Thoughts, is on target.
Today’s SEO myth to debunk is the so-called Semantic Distance Myth in connection with LSI. But, first some definitions. The following material is taken from chapter 1 of my upcoming ebook Keyword Clustering Analysis with Excel.
Dissimilarity characterizes how different objects from a cluster are.
Similarity indicates how similar the objects are.
In IR, it is customary to use the term ‘distance’ to mean dissimilarity. This is because unlike similarity, dissimilarity is a distance metric. Thus, similarity and distance (dissimilarity) are opposite terms. We can convert similarities into distances and viceversa, but not without first understanding the model at hand * (http://www.miislita.com/searchito/binary-similarity-calculator.html).
Beware of Oxymorons and Misnomers
The expression ‘similarity distance’ is an oxymoron or a combination of contradictory terms (like a ‘small giant’, ‘approximately equal’, etc). Unfortunately, the expression has been used in the IR literature (http://www.google.com/search?q=%22similarity+distance%22). Avoid it.
Some search marketers when trying to explain Latent Semantic Indexing (LSI) have used expressions like the ‘semantic distance between words’ when in fact what is being discussed is word similarity or how words relate to each other or to a topic. Used in this context, their discourses are oxymoronic if not dumb, let alone the fact that LSI does not measure any ‘semantic distance’.
Some IR authors have also used the expression ‘distance between words’ in reference to the number of words between any two words. The expression is loosely accepted to describe word spacing/distribution.
Perhaps loosely excluding the last one, all these expressions are misnomers (incorrect designations) since do not conform to the definition of a distance metric. Let us address this point.
A function f is called a distance if it exhibits reflexivity, symmetry, and triangular inequality. To grasp these concepts, visualize three points (a, b, and c) describing a triangle.
Reflexivity means that the distance from a point to itself is zero; e.g., f(a, a) = f(b, b) = f(c, c) = 0.
Symmetry refers to the fact that the distance between any two points, measured from either one, is the same; e.g., f(a, b) = f(b, a).
Triangular Inequality requires that the distance between any two points, measured from either point, must be equal or less than the distance between these measured through a third point; e.g., f(a, b) + f(b, c) => f(a, c).
If these conditions are not met, the function measure in question is not a distance.
Finally, note that distances cannot be negative and are not upperly bounded, unless their scales have been normalized.
* You might also want to check: http://irthoughts.wordpress.com/2008/01/02/simcalc-binary-similarity-calculator/
Note to spammers and SEOs: Embedding oxymorons and misnomers in documents, particularly in links, could be used as a search engine persuasion trick.
Some might argue whether the expression ‘loosely excluding’ might also qualify as a near oxymoron. For a list of oxymorons or near oxymorons, check these links: