I’m putting together a piece on several local term weight models. It should be ready in few weeks.
It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.
The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:
<last nail in KD coffin style=”intensity:100%;”>
“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:
Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.
Aboutness (A); e.g., that the document is x times more about the term.
Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.
Relevance (R);i..e., that a document repeating a term x times is x times more relevant.
Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”
</last nail in KD coffin>
what does it has to do with vsm? do you use them to calculate this metrics?
I’m not sure what you are asking. If you ask if I use KD, the answer is no.
I was asking if you use this in vector space model, and what do you use it for?
It would be hard to address PAIR with VSMs for at least two reasons:
(a) most VSMs assume term independence
(b) most are bag-of-words models.
The point to be made is that local models that take term repetition (raw frequencies) as local weights to be incorporated in Vector Space Models are susceptible to manipulations and thus are not reliable. The article touches on some of these issues.
I really apreciate you spent your time answering my doubts. Now I completelly got it. Txs
You’re welcome.
An example of implementations that take plain raw frequencies for local term weights without incorporating attenuation transformations -and thus are susceptible to keyword repetition- are some latent semantic indexing (LSI) and Vector Space models as well as the readability metric known as keyword density (KD).