I am finishing reading the 2001 Master Thesis of Cameron Alexander Marlow, from MIT:
A Language-Based Approach to Categorical Analysis
where he proposed the use of Synchronic Imprints (SI) combined with LSI. Great thesis. Essentially, SI incorporates a spring model in which term frequencies are inversely proportional to their distances.
“This research presents a new computational model of electronic text, called a synchronic imprint, which uses structural information to contextualize the meaning of words. Every concept in the body of a text is described by its relationships with other concepts in the same text, allowing classification systems to distinguish between alternative meanings of the same word. This representation is applied to both the standard problem of text classification and also to the task of enabling people to better identify large bodies of text. The latter is achieved through the development of a visualization tool named flux that models synchronic imprints as a spring network.”
Here is one of the many applications for SI:
2.2 Synchronic imprints
“A different model for representing text, introduced in this thesis is the synchronic imprint. The impetus of the last example let to the realization that structure of language is important in defining meaning; instead of merely using words, a new representation was built with the goal of capturing characteristics related to the arrangement of words, in addition to the words themselves. This structure is explicitly the intrasentence relation of two nouns. In the previous sentence, “structure,” “relation” and “nouns” are all nouns related to each other by the coherent meaning of that sentence. In a synchronic imprint, that sentence would be represented by three structural links: “structure-relation,” “structure-noun” and “relation-noun.” Each of these symbolizes a semantic link between the words, which could also be seen as a contextual feature for interpreting the word.”
So how one could use SI with LSI? Read the thesis.
SI + LSI
“Considering the strengths of latent semantic indexing in addressing synonymy, and the ability of synchronic imprints in tackling polysemy, an interesting experiment would result from the synergy of these two techniques. As noted in chapter 3, LSI intensifies the effect of polysemy by sometimes combining two dimensions, one of which has two meanings, the other of which is only related to one of those meanings. In this case, one of the senses of the first dimension is completely lost, and the overall meaning of the new dimension conflated.”
Indeed, LSI works great at addressing synonyms but has problems with polysemy. According to Marlow, since SI addresses polysemy one could combine LSI with SI to improve retrieval. Marlow states:
“In order to alleviate this problem in LSI, the text could first be represented as a synchronic imprint, thus expanding each of the polysems into its unique imprint terms. Many of the synchronic imprints features might be combined simply by the LSI technique, but in the case of polysems, the distinct meanings would remain intact after recombination.”
Interesting thesis, which can be improved.
For instance, instead of using unnormalized co-occurrences as he did, one could use normalized co-occurrences (c-indices) as described in the On-Topic Analysis,.
Another method consists in constructing an association cluster matrix and from this obtain Jaccard Coefficients and incorporate these in the spring model. The main difference between the two approaches is that with c-indices one essentially uses a binary approach.
This is a legacy post originally published on 8/5/2006