I came across a 2003 paper on mining WWW text links, in which researchers introduced THESUS. Really interesting piece.

The abstract of THESUS: ORGANIZING WEB DOCUMENT COLLECTIONS BASED ON LINK SEMANTICS follows:

“Abstract. The requirements for effective search and management
of the WWW are stronger than ever. Currently Web
documents are classified based on their content not taking into
account the fact that these documents are connected to each
other by links. We claim that a page’s classification is enriched
by the detection of its incoming links’ semantics. This
would enable effective browsing and enhance the validity of
search results in the WWW context. Another aspect that is
underaddressed and strictly related to the tasks of browsing
and searching is the similarity of documents at the semantic
level. The above observations lead us to the adoption of
a hierarchy of concepts (ontology) and a thesaurus to exploit
links and provide a better characterization ofWeb documents.
The enhancement of document characterization makes operations
such as clustering and labeling very interesting. To this
end, we devised a system called THESUS. The system deals
with an initial sets ofWeb documents, extracts keywords from
all pages’ incoming links, and converts them to semantics
by mapping them to a domain’s ontology. Then a clustering
algorithm is applied to discover groups of Web documents.
The effectiveness of the clustering process is based on the
use of a novel similarity measure between documents characterized
by sets of terms. Web documents are organized into
thematic subsets based on their semantics. The subsets are
then labeled, thereby enabling easier management (browsing,
searching, querying) of the Web. In this article, we detail the
process of this system and give an experimental analysis of
its results.”

Overall, this is an interesting research work. However, I found at least one minor point that needs a clarification. This is when they state:

“The traditional cosine measure from the Information Retrieval literature (see [SM83]) has the same behavior as the Jaccard coefficient. As a matter of fact, it can be viewed as a direct application of the Jaccard coefficient.”

I found hard to believe peer reviewers didn’t asked for a clarification. Before CS grad students and others take that statement at face value they must think about this:

First, a cosine similarity measure reduces to a product-moment correlation coefficient. This is not the case with a Jaccard’s Coefficient.

Second, a Jaccard’s Coefficient runs from 0 to 1, while a cosine similarity can adopt values between -1 to +1. An example of the later is found in cosine measures computed in LSI, which back in the days of reference [SM83] was not nearly around. 

Third, Jaccard’s Coefficient can be stored in vectors and from these cosine similarity measures can be computed. This is the case of for example a term-term correlation matrix populated with Jaccard’s Coefficients. Such matrices are well known in cluster analysis.

Advertisements