Professor Stephen Boyd has a nice summary of Crimes Against Matrices. Are you guilty of one?
Back to blogging. I’ve been very busy putting together a paper on a weighting model and answering feedback received from colleagues on it.
So this might explain why the January IRW newsletter is delayed. It should arrive subscribers inboxes during the day. The February issue will be out in about one week. These are back to back issues on Statistical Analysis of N-Grams.
Part 1:N-Grams & Contingency Tables
Part 2: N-Grams & Association Measures
On other matters, a PhD student published few years ago an excellent application of the Vector Space Model applied to Protein Analysis. You can revisit the post at http://irthoughts.wordpress.com/2008/11/12/vector-space-model-and-protein-retrieval/ .
If others have other applications of VSM in other disciplines, let me know. I’m interested in multidisciplinary stuff.
Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.
I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.
Mistake not a simple matching strategy for a simple or basic search approach as it can evolve into the most complex one.
Unlike classic boolean searches (i.e., AND, OR, XOR), SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.
Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.
I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!
In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.
Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.
So far, my one is cost effective and does not require of extra libraries.
PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!
1. The current issue of IR Watch will be out over the weekend–a bit delayed due to getting ready for school, preparing lessons and research projects. If things go as expected, my academic schedule will be a bit busy between teaching and research at two different universities.
2. I’m researching for a manuscript that deals with affine transformations applied to several IR problems. It expands on Vector Space Theory and allows one to think out of the “term-document” box. Great stuff.
4. Talking about semantically motivated, sentiment analysis, spam, etc… Funny how some folks in the SEO world like to damage the reputation of others without presenting any evidence. This time the trolls took on Kim Krause Berge ( http://cre8pc.com/archives/1489 ). I always admire Kim’s work, consider her an usability icon, and had the privilege of meeting her back in 2005. I was surprised to see these folks having a field day at her expense at Rand’s site. Kim, I feel your pain. However, more than one SEO forum/blog had lose credibility by allowing these folks, most of which think they can be socially “ranked” by attacking whoever is at the “top”. The fact is that most trolls are paper tigers that go hidding at the first Cease & Desist or defamation lawsuit.
It is surprising how even serious information retrieval researchers and journals quote papers that were never written!
This is the thesis of David Dubin’s 2004 great article
The Most Influential Paper Gerard Salton Never Wrote
“In giving credit to Salton for the vector model, a number of authors cite an overview paper titled “A Vector Space Model for Information Retrieval,” which some show as published in the JASIS in 1975 and others as published in the Communications of the Association for Computing Machinery (CACM) in 1975. In fact, no such article was ever published, and citations to it usually represent a confusion of two 1975 articles (Salton, Wong, & Yang, 1975; Salton, Yang, & Yu, 1975), neither of which were overviews of the VSM as it is generally understood (see section 5 below). Some of Salton’s own colleagues have been guilty of this mistake: both Cardie et al. and Singhal cite the CACM version, for example (Singhal, 2001; Cardie, Ng, Pierce, & Buckley, 2000). The paper is even cited in a few of the very last articles on which Salton is listed as a coauthor (Singhal, Salton, Mitra, & Buckley, 1996; Singhal & Salton, 1995). These papers were published close to or shortly after the time of his death, and so the errors cannot be blamed on Salton (remembered by his colleagues as a very careful and meticulous writer).”
Somehow far too many IRs misquote Salton’s 1975 paper titled “A vector space model for automatic indexing“. This causes digital libraries to create a spurious record attached to many cross-referenced articles.
I searched Google for “a vector space model for information retrieval” + salton and indeed there are many reputed publications and researchers citing a paper that was never published! What a shame.
That says a lot about researchers, editors, and reviewers that were lazy enough to never bother about the accuracy of the references.
I’m putting together a piece on several local term weight models. It should be ready in few weeks.
It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.
The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:
<last nail in KD coffin style=”intensity:100%;”>
“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:
Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.
Aboutness (A); e.g., that the document is x times more about the term.
Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.
Relevance (R);i..e., that a document repeating a term x times is x times more relevant.
Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”
</last nail in KD coffin>
There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.
From VSM to LSI
Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.
VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.
In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan’s 1965 SVD algorithm to “documents” exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).
LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The “documents” were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.
It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.
Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.
There is now a crew of SEOs claiming that they can design documents “LSI-friendly” by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as “experts”). Same goes for those that claim “PLSI-SEO” strategies.
Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why “LSI-friendly” documents is plain SEO Snakeoil. Again, the same goes for those that claim “PLSI-SEO” strategies. Keep reading.
In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?
In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.
Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected “reduced description” associated with the document.
But there is a problem.
By 2003 Hofman’s PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI “is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. “
Blei and co-workers then stated that this leads to two problems:
1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting
2. it is not clear how to assign probability to a document outside of the training set.
Thus, it is not true that PLSI is the preferred model to work with in IR, as some have claimed. In addition, the model has non-trivial theoretical flaws and limitations.
In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a “bag of words”. However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for college junior or junior college produce far different results.
To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.
Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.
In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.
I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere “bags of words”. The former is IR’s Original Sin and the later is its copycat.
The image above gives me a flash back on research work I conducted in the late ’80s on sequential simplex optimization methods.
In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).
In SEOs and their IDF Myths: Part 2, we exposed some of these folks.
In Understanding TFIDF, we wrote a rebuttal.
We are still seeing so many bloggers mistaking IDF for something that is not. We have to conclude these pseudo-teachers either are just trying to sell something or they don’t really understand what term specificity stands for. They should know that IDF is a small pixel section within the bigger picture of the Robertson-Sparck Jones Probabilistic Model for information retrieval.
Thus, we are writing a tutorial on RSJ-PM to kill for good their intentionally misleading efforts. Hopefully, the tutorial will be ready before the month ends. It will be a great way of putting to rest all the false information flying around from the usual agents of misinformation (mostly SEOs). CS students interested in knowing about the pros and cons of probability models in IR will find it useful.
I’m reading with great interest the paper
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,
I’m happy to read they referenced our Tutorial on Cosine Similarity Measures. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.
Their abstract follows:
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.
I received yesterday the following email from a reader (name removed). I am reproducing it since the discussion might be of value to others with similar questions.
Hello Dr. Garcia,
I’ve seen your tutorial “PCA and SPCA Tutorial” while I was trying to
find out something about PCA. So I decided to ask this to you. I’ll be
happy if you answer.
My feature matrix contains 30 feature vectors and I want to reduce
this dimension using 95% of variance explained. Ranges of some feature vectors have great differences. For example, one feature vector’s range is about [0,10], while some others is [-10^10,10^10]. So when I directly subtract the mean and calculate covariance matrix, one of the eigenvalues suppresses the others.
Is it a proper way to scale data (z=(x-mean)/std_dev) firstly and then subtract mean of the scaled version and calculate covariance matrix?
When I try this procedure eigenvalues seem to be correct but I cannot be sure if this is a correct way or not.
Do you think that this is correct? If not, what is the correct way
using covariance matrix rather than correlation matrix?
Thanks in advance.
My answer follows.
The purpose of centering data (transforming data to z-scores) is to remove undesirable fluctuations. This is particular useful when there is a common source of error; e.g. as in a time series. Assuming this is your case, then you are doing the right thing.
An advantage of data centering is that it is part of the PCA solution of minimizing the sum of squared errors (SSE). Overall, the goal is to find the best affine linear subspace.
Centering the data has other advantages. It allows us to make cosine angles equal to Pearson’s Correlation Coeffficients so that similarity information can be explored. Also, once in a z-score form, the data can be checked to see whether it follows a normal distribution.
For additional information, check these links:
I hope this helps.