Back in 1997, William Woods, Principal Scientist and Distinguished Engineer at Sun Microsystems Labs, wrote Conceptual Indexing: A Better Way to Organize Knowledge. Although the notion of conceptual indexing turned out to be a complex thing, his paper is still relevant these days wherein many SEOs make incorrect claims about how search engines use Latent Semantic Indexing (LSI) and wherein others are paying attention to synonymy and phrase processing patents. This post is based in part on Woods’s manuscript.

Beware of “Synonyms”

Many marketers try to improve a query or the relevance of documents by using synonyms taken from preorganized lookup lists or a synonym thesaurus. Others suggest deploying LSI-clustered lists of terms labeled as “synonyms” or classified as “similar” or “related” terms. The claim here is that this improves semantics or affects a so-called “semantic distance”; still when asked how to measure such distance they cannot provide a sounded response.

These SEO claims not necessarily are good ideas or valid arguments. At the same time I’m not suggesting that one should not use synonyms or related terms to improve a copy. You just need to do this on a per case basis, if these flow naturally, and are on topic.

To begin, one needs to know how such lists or thesauri were compiled. This is very important since some of these are based on term frequency counts instead of word order or contextuality.

In the case of LSI-clustered terms, these not necessarily have or preserve a synonymy relationship. Such clusters are the result of a co-occurrance phenomenon taking place across a collection of documents. This not necessarily grants contextuality.

It should be pointed out that web documents are notorious for being about different topics and terms can appear anywhere in the documents and in dissimilar passages. Long documents are prone to this. Thus, co-occurrence of any order (first, second, or higher orders) across documents not necessarily insures that terms are on-topic or captures contextuality. A large portion of my research work deals with the extraction of topics from clusters, so I feel I am qualified to comment on this.

Back to Woods’s paper. 

As mentioned by Woods, there are very few synonyms in English or in any language for that matter. Thus, the arbitrary use of a synonym lookup list or a standard synonym thesaurus can actually degrade retrieval performance. To top off, the use of terms incorrectly classified as synonyms or related terms can unnecessarily force a tradeoff between precision and recall wherein the differences between “synonyms” is not clear or can give the impression that such differences do not matter.

Consider the synonym set S consisting of the following six keywords (k1…k6):

S = {automobile, car, truck, bus, taxi, motor vehicle}

k6 = “motor vehicle” is more general, broader in scope, and in this case summarizes the others. Thus, we say that k6 subsumes the other keywords.

For example, if you were to query k6 = “motor vehicle” then you would probably expect to pick up hits for the following target terms k2 = “car,” k3 = “truck,” and k4 = “bus” since k6 subsumes these. Thus, a query for k6 retrieves all kind of motor vehicles of the set.

In contrast, if you queried k1 = “automobile” then you would probably not want to get the k3 = “truck” and k4 = “bus” target terms, but perhaps k2 = “cars and k5 = “taxi”.

Note that I use the “probably” and “perhaps” qualifiers since I don’t know your real intentions, what exactly is in your mind, or your real information needs. Thus, the above is a subjective statement.

The important point is this: don’t drink hearsays and drive your “SEO” vehicle because subsumption is not a two-way street.

Understanding Subsumption

Conceptual subsumption is a notion of generality, wherein a more general term is said to subsume a more specific term. Think of this as a data structure of the form

broader terms > narrower terms > specific terms

Such relationship of generality often avoids the problems caused by synonymy. Thus, a conceptual thesaurus based on hierarchical relationships is preferred over one based on frequency counts.

We can apply subsumption to improve both documents and queries. However, don’t go overboard with this as this must be done on a per case basis (collections, queries, documents tested).

In 2004 we described all this in the seminal paper On-Topic Analysis – Online Discovery of On-Topic Terms. Note that the key elements of on-topic analysis are well established in the literature.

In the case of query reformulation the goal is this:

Use terms that are at least as specific as the query, keeping in mind that a term subsumes itself and subsumes any true synonyms that it may have. As Woods states:

“Thus, subsumption is more general than synonymy (i.e., subsumption subsumes synonymy). True synonymy is equivalent to mutual subsumption. If a retrieval system is designed to retrieve all items that are subsumed by a request, then the information seeker has a way of controlling the level of generality of the search by choosing the level of generality of the query terms, thus avoiding a major source of precision/recall trade-off.”

Once one understands the differences and limitations of synonymy and subsumption, one can move on and try to understand algorithms that deal with conceptual indexing.

I recommend graduate students, SEOs, and the public to read Woods’s paper. These days many talk about patents, papers, and algorithms on phrase extraction and indexing, not knowing that his 1997 manuscript already discusses many of the key elements found in such patents and papers. They are one decade late to the dance, at least.

Read the 1997 paper. It is still a goldmine these days. Here is an excerpt, this time regarding the trilled “semantic distance”:

“However, no single definition of semantic distance has seemed to have any particular claim to being more correct than another. Indeed, it appears that whether two concepts are similar or not depends on the purposes for which they are being compared. For example, a hammer is close to a hatchet for the purposes of pounding a nail, but not for the purposes of chopping down a tree. What might be a good measure of semantic similarity for one user’s purposes might be ill-suited to the purposes of another.”

Now, where did I hear the hammer-hatchet analogy before? Hum…!