Over the weekend some asked me to expand on oxymorons, so this post goes.

As mentioned in our previous post, an oxymoron is a combination of contradictory terms. All antonyms are contradictory terms, but not all contradictory terms are antonyms. For instance ‘alone together’ and ‘big baby’ are oxymorons, but only the former consists of antonyms. Thus, it is possible to extract these two types of clusters from a list of oxymorons. We can also extract clusters of oxymorons with a common term or theme.

We must not mistake oxymorons for misnomers. A misnomer is an incorrect designation. Not all oxymorons are misnomers. For example, ‘binary independence’ is a misnomer, but not an oxymoron. All this is explained in my upcoming ebook Keyword Clustering Analysis with Excel.

As mentioned in my previous post, ‘similarity distance’ is an oxymoron since distance is dissimilarity. The expression is also a misnomer.

Unfortunately, some IR authors and many SEOs have used the ‘similarity distance’ expression. The problem here seems to be a combination of poor selection of words and lack of knowledge about basic IR cluster analysis concepts, particularly of LSI.

In Cluster Analysis, objects are grouped into clusters using Proximity, a criterion of how ‘close’ or alike objects are. Proximity can be defined as Distance (dissimilarity) or Similarity. Clustering by distance is a minimization problem whereas by similarity is a maximization problem.

There are more definitions for similarity than for distance. Which type of proximity and definition to use depends of the type of attributes and scale of attributes of the data.

In a very high dimensional space the notion of distance or similarity is useless, if not meaningless. Thus, in a high dimensional space, talking about a ‘semantic distance’ is a waste. We can try to do dimensionality reduction with LSI, but we end up computing cosine similarity, not distance.

The current state of the art is that LSI is also a misnomer as (a) its clustering power is the result of high-order word co-occurrence not semantics and (b) it is not exactly a document indexing method (before applying LSI, documents must be already indexed).

With regard to similarity and distance, it is tempting to think that one is just the numerical complement of the other or that we can blindly transform one from the other. These are short sighted views. Fortunately few IRs believe this. Unfortunately, we cannot say the same about search marketers and SEOs

Want to know more about the difference between these two concepts? Study the topic or read my ebook. Dont’t be fooled by self-proclaimed SEO “experts” or any “seobook”.