Title: Altering Document Term Vectors for Classification – Ontologies as Expectations of Co-occurrence

Authors: Meenakshi Nagarajan, Amit Sheth; LSDIS Lab, Dept. Of Computer Science, University of Georgia, Athens, GA, USA; Marcos Aguilera, Kimberly Keeton, Arif Merchant, Mustafa Uysal, HP Labs, Palo Alto, CA

This new study, presented at WWW2007, Banff, Canada, confirms the importance of co-occurrence, this time in relation with ontologies.

The abstract states:

“Document Classification, the process of classifying documents into pre-defined categories is one of the most popular tasks aimed at grouping and retrieving similar documents. Like many Information Retrieval (IR) tasks, classification techniques rely on using content independent metadata e.g. author, creation date etc. or content dependent metadata i.e. words in the document. One of the most common challenges is that classifiers tend to be inherently limited by the information that is present in the documents. While past research has made use of external dictionaries and topic hierarchies to augment the information that classifiers work with, there is still considerable room for improvement.”

“This work is an investigative effort towards exploring the use of external semantic metadata available in Ontologies in addition to the metadata central to documents, for the task of supervised document classification. The aim is to go beyond “word co-occurrence / synonym / hierarchical” similarity based classification systems to a one where the semantic relatedness between terms explicated in Ontologies is exploited. We study the effectiveness of our approach on categories in the national security domain trained using documents from the Homeland Security Digital Library. Preliminary evaluations and results indicate that the technique generally improves precision and recall of classifications in some cases and illustrates when the use of Ontologies for such a task does not add significant value.”

In explaining how ontologies are used to manipulate vectors, they come with an interesting example in the form of Abu Sayyaf and Iraq. Interesting use of their proposed model.

The article includes a walk-through example.

I found interesting that in their Case 1, their term vector model does not change the total weight of any term Ti. They justify this with the statement:

“Not changing the weight of Ti is in line with our intuition that the weights of terms are affected only by terms that are in the document.”

I am happy to see my Cosine Similarity and Term Weights Tutorial referenced by the authors.

This is a legacy post originally published in 2/7/2007