I’ll be this June 7, speaking before Fundación de Investigación (
http://www.fundaciondeinvestigacion.com/
), research arm of University of PR’s School of Medicine.
I’ll be presenting on Data Mining in Clinical Trials and how we can use the Self-Weighting Model for modeling the extracted data. The audience will be medical doctors, medical technologists, and chemists.
As we can see, data mining is not just about crunching numbers for search engines and search marketing. It is all about using the same models, but for solving different problems.
Hi Dr. E. Garcia,
I do not know if you can see this, but I hope you can. I have a question regarding vector space model. I have been doing research on it, and in my framework, I want to split the long index vector into small ones, but I am not sure if this is proper, since the summation of cosine values from many sub-vectors is apparently different from the one calculated from the original long vector. Or, say I just care about a few keywords, which reside in the same small vector, I just do evaluation on this vector and rank the documents based on the cosine value. However, it is possible that the order of these cosine values is different from that of the cosine values which computed from the long vector consisting of all the extracted keywords.
I have searched on Google, but not relevant information provided. Could you give some thoughts about this issue?
Many thanks!
Hi, Sun:
Thank you for stopping by.
Normally we don’t need to do a comprehensive search and match from the index.
There are several different divide-and-conquer techniques that we can use.
One thing that we can do instead is to identify all posting lists from the index that match at least one term of a query. Let’s call this a subindex with a minimum coordination number of n = 1; i.e., subindex(n=1). Assume that the query consists of m terms such that m > n.
Next we have three different ways to proceed. We may construct a term-doc matrix from:
1. the subindex (OR match)
2. all identified posting lists from the subindex that match all search terms (AND match)
3. all identified posting lists from the subindex that match some of the search terms.
Next, we rank docs according to query-doc cosine similarities.
We can stratify this technique using coordination numbers such that
subindex(n = 1)
subindex(n = 2)
:
:
subindex(n = m)
Hi Dr. E. Garcia,
Thank you for the reply.
I think this is a inverted index+cosine measure, which certainly can resolve my problem, but unfortunately in our search context, the preferred way is: 1. split long vector 2. cosine measure on small vector 3. summation of cosine values from corresponding small vectors. So what I am concerning is if this is an appropriate similarity measure, or it has physical meaning and makes sense. In my opinion, this divide-and-conquer method is a kind of variation of cosine measure.
In fact, cosine measure,for index vectors which have different number of keywords respectively, yields different rank order of documents. It depends on how many words you extract from document set. So to some extent, even if, in our case, our final similarity score is computed by summing scores from many vectors, we can still do similarity evaluation.
Could you tell me your email address? Mine is whsun.xd@gmail.com
Hi, there:
I need additional information or a working example. However, you mentioned several times the summation of cosines. Unfortunately, that won’t work as cosines are not additive at all. So your suggested procedure is not valid.