Archive for the ‘Vector Space Models’ Category

Random Notes Before School Starts

August 3, 2009

1. The current issue of IR Watch will be out over the weekend–a bit delayed due to getting ready for school, preparing lessons and research projects. If things go as expected, my academic schedule will be a bit busy between teaching and research at two different universities.

2. I’m researching for a manuscript that deals with affine transformations applied to several IR problems. It expands on Vector Space Theory and allows one to think out of the “term-document” box. Great stuff.

3. Here is a great grad project in ppt format: Semantically Motivated Information Retrieval. I thank its author for referencing my  SVD Fast Track Tutorial.

4. Talking about semantically motivated, sentiment analysis, spam, etc… Funny how some folks in the SEO world like to damage the reputation of others without presenting any evidence. This time the trolls took on Kim Krause Berge ( http://cre8pc.com/archives/1489 ). I always admire Kim’s work, consider her an usability icon, and had the privilege of meeting her back in 2005. I was surprised to see these folks having a field day at her expense at Rand’s site. Kim, I feel your pain. However, more than one SEO forum/blog had lose credibility by allowing these folks, most of which think they can be socially “ranked” by attacking whoever is at the “top”. The fact is that most trolls are paper tigers that go hidding at the first Cease & Desist or defamation lawsuit.

The Most Influential Paper that Gerard Salton Never Wrote

July 15, 2009

It is surprising how even serious information retrieval researchers and journals quote papers that were never written!

This is the thesis of David Dubin’s 2004 great article
The Most Influential Paper Gerard Salton Never Wrote

Dubin wrote:

“In giving credit to Salton for the vector model, a number of authors cite an overview paper titled “A Vector Space Model for Information Retrieval,” which some show as published in the JASIS in 1975 and others as published in the Communications of the Association for Computing Machinery (CACM) in 1975. In fact, no such article was ever published, and citations to it usually represent a confusion of two 1975 articles (Salton, Wong, & Yang, 1975; Salton, Yang, & Yu, 1975), neither of which were overviews of the VSM as it is generally understood (see section 5 below). Some of Salton’s own colleagues have been guilty of this mistake: both Cardie et al. and Singhal cite the CACM version, for example (Singhal, 2001; Cardie, Ng, Pierce, & Buckley, 2000). The paper is even cited in a few of the very last articles on which Salton is listed as a coauthor (Singhal, Salton, Mitra, & Buckley, 1996; Singhal & Salton, 1995). These papers were published close to or shortly after the time of his death, and so the errors cannot be blamed on Salton (remembered by his colleagues as a very careful and meticulous writer).”

Somehow far too many IRs misquote Salton’s 1975 paper titled “A vector space model for automatic indexing“. This causes digital libraries to create a spurious record attached to many cross-referenced articles.

I searched Google for “a vector space model for information retrieval” + salton and indeed there are many reputed publications and researchers citing a paper that was never published! What a shame.

That says a lot about researchers, editors, and reviewers that were lazy enough to never bother about the accuracy of the references.

On Term Repetition and Local Models

May 27, 2009

I’m putting together a piece on several local term weight models. It should be ready in few weeks.

It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.

The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:

<last nail in KD coffin  style=”intensity:100%;”>

“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:

 Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.

Aboutness (A); e.g., that the document is x times more about the term.

Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.

Relevance (R);i..e., that a document repeating a term x times is x times more relevant.

Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”

</last nail in KD coffin>

Vector Space, Probabilistic LSI, and LDA

April 3, 2009

 lda
source: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.

From VSM to LSI

Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate  values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.

VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.

In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan’s 1965 SVD algorithm to “documents” exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).

LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The “documents” were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.

It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.

Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.

There is now a crew of SEOs claiming that they can design documents “LSI-friendly” by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as “experts”). Same goes for those that claim “PLSI-SEO” strategies.

Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why “LSI-friendly” documents is plain SEO Snakeoil. Again, the same goes for those that claim “PLSI-SEO” strategies. Keep reading.

Enters Probabilistic Latent Semantic Indexing (PLSI) model

In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?

In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.

Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected “reduced description” associated with the document.

But there is a problem.

Enters Latent Dirichlet Allocation Model (LDA)

By 2003 Hofman’s PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI “is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. “

Blei and co-workers then stated that this leads to two problems:

1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting

2. it is not clear how to assign probability to a document outside of the training set.

Thus, it is not true that PLSI is the preferred model to work with in IR, as some have claimed. In addition, the model has non-trivial theoretical flaws and limitations.

In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a “bag of words”. However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for college junior or junior college produce far different results.

To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.

Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.

In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.

I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere “bags of words”. The former is IR’s Original Sin and the later is its copycat.

The image above gives me a flash back on research work I conducted in the late ’80s on sequential simplex optimization methods.

SEOs and Their IDF Myths: Part 3

March 20, 2009

In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).

In SEOs and their IDF Myths: Part 2, we exposed some of these folks.

In Understanding TFIDF, we wrote a rebuttal.

We are still seeing so many bloggers mistaking IDF for something that is not. We have to conclude these pseudo-teachers either are just trying to sell something or they don’t really understand what term specificity stands for. They should know that IDF is a small pixel section within the bigger picture of the Robertson-Sparck Jones Probabilistic Model for information retrieval.

Thus, we are writing a tutorial on RSJ-PM to kill for good their intentionally misleading efforts. Hopefully, the tutorial will be ready before the month ends. It will be a great way of putting to rest all the false information flying around from the usual agents of misinformation (mostly SEOs). CS students interested in knowing about the pros and cons of probability models in IR will find it useful.

A CBR Sharing Search Engine System

March 17, 2009

I’m reading with great interest the paper
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System
, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,

I’m happy to read they referenced our Tutorial on Cosine Similarity Measures. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.

Their abstract follows:

ABSTRACT
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.

Centering Data in PCA

March 13, 2009

I received yesterday the following email from a reader (name removed). I am reproducing it since the discussion might be of value to others with similar questions.

Hello Dr. Garcia,

I’ve seen your tutorial “PCA and SPCA Tutorial” while I was trying to
find out something about PCA. So I decided to ask this to you. I’ll be
happy if you answer.

My feature matrix contains 30 feature vectors and I want to reduce
this dimension using 95% of variance explained. Ranges of some feature vectors have great differences. For example, one feature vector’s range is about [0,10], while some others is [-10^10,10^10]. So when I directly subtract the mean and calculate covariance matrix, one of the eigenvalues suppresses the others.

Is it a proper way to scale data (z=(x-mean)/std_dev) firstly and then subtract mean of the scaled version and calculate covariance matrix?

When I try this procedure eigenvalues seem to be correct but I cannot be sure if this is a correct way or not.

Do you think that this is correct? If not, what is the correct way
using covariance matrix rather than correlation matrix?

Thanks in advance.

******

My answer follows.

The purpose of centering data (transforming data to z-scores) is to remove undesirable fluctuations. This is particular useful when there is a common source of error; e.g. as in a time series. Assuming this is your case, then you are doing the right thing.

An advantage of data centering is that it is part of the PCA solution of minimizing the sum of squared errors (SSE). Overall, the goal is to find the best affine linear subspace.

Centering the data has other advantages. It allows us to make cosine angles equal to Pearson’s Correlation Coeffficients so that similarity information can be explored. Also, once in a z-score form, the data can be checked to see whether it follows a normal distribution.

For additional information, check these links:

http://irthoughts.wordpress.com/2007/05/05/on-svd-and-pca-some-applications/

http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/

I hope this helps.

IDF and Vector Space Models

March 11, 2009

I’m back from SIDIM XXIV. It was a great conference in honor of Professor Oscar Moreno, from the Gauss Research Laboratory and NIC.PR (responsible for the .pr Internet domains.).

Dr. Moreno is a legend in the area of pure and applied mathematics. I have the privilege of meeting with him.

The conference plenary speakers were equally three legends:

Elwyn Berlekamp, University of California at Berkeley
Solomon W. Golomb, University of Southern California
Guang Gong, University of Waterloo

The event was a success, although some speakers read straight from their notes. As an interdisciplinary conference on pure and applied mathematics, all kind of topics were covered.

I got the chance to present research work on a new global weight algorithm we are testing called scaled inverse document frequency (SIDF), a variant of the well-known IDF scheme.

For those unfamiliar with IDF and its implementation with ranking algorithms, Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras has published a very useful tutorial presentation on Vector Space Models.

The tutorial is based on our series of articles on the subject and provides a better understanding of the theory. We could not have done it better.

SIDIM XXIV Conference

March 5, 2009

I am presenting at The Seminario Interuniversitario de Investigación en Ciencias Matemáticas (Interuniversity Seminar on Mathematical Sciences Research, SIDIM).

This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (http://sidim2009.uprr.pr/)

This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book of abstracts  is available at http://sidim.uprh.edu/libroSIDIM2009.pdf

I will be presenting new research work on IDF and a new model for the conditional specificity of terms. If you have followed previous posts on the topic of inverse document frequency, now you will understand why I have dissected the topic several times. Thank you all for your private comments and feedback on the topic.

My abstract follows:

Scaled Inverse Document Frequency: A Model for the Evaluation of the Conditional Specificity of Query Terms in Search Engine Collections

Edel Garcia, Internet Business Development Center, Interamerican University of Puerto Rico, Metropolitan Campus

Inverse document frequency (IDF) is a measure of the specificity of query terms over a collection of D number of documents that has been successfully incorporated into numerous vector space information retrieval models. Since these models assume term independence, the specificity of a given term, present in different queries, is assumed to be unique and independent from other query terms. To the best of our knowledge, there are no known models that condition the specificity of terms to the presence of other terms in a query.

This paper proposes a new measure called scaled inverse document frequency (SIDF) which evaluates the conditional specificity of query terms over a subset S of D and without making any assumption about term independence. S can be estimated from search results, OR searches, or computed from inverted index data. We have evaluated SIDF values from commercial search engines by submitting queries relevant to the financial investment domain. Results compare favorably across search engines and queries. Our approach has practical applications for `real-world’ scenarios like in Web Mining, Homeland Security, and keyword-driven marketing research scenarios. SIDF can be incorporated into a variety of information retrieval models as a global weight scoring system.

Keywords: inverse document frequency, conditional term specificity, web mining, search engines

Vector Normalization with Excel

March 4, 2009

Unit vectors are frequently used in information retrieval and data mining studies because simplify further calculations and analyses.

In the current issue of IR Watch, we show how easy is to convert column vectors into unit vectors with Excel. It is assumed you know how to define spreadsheet arrays in Excel and how to enter formulas in it.

Say we have two vectors in columns A and B each with four elements. To convert these into unit vectors, do this:

1. In cell C1, enter the formula =A1/(SQRT(SUMSQ(A$1:A$4)))

2. Paste content of C1 into cell D1. This creates a modified instance of this formula.

3. Paste content of C1 and  D1 cells into remaining empty cells of these columns by selecting these at once. This also creates modified instances of these formulas.

C and D columns represent the unit vectors.

A figure with a step-by-step example is given in IRW (free subscription)

Below is another example, but with the final results.

A B C D
1 8 0.13 0.36
2 10 0.26 0.45
4 12 0.53 0.53
6 14 0.79 0.62

That was easy!

If you use the first row to label columns, as in this example, be sure to readjust the formulas so these start at cell 2 and run up to cell 5.

If you still have questions on how to do this, email me or subscribe to IRW.

When IDF is not enough

February 25, 2009

I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:

According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.

Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *

This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.

* I have modified a bit this assertive question to make the point more clear.

References

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Hill, I.D., “Wouldn’t it be nice if we could write computer programs in ordinary English – or would it?” The Computer Bulletin, June 1972, pp. 306-312.

Neumann, Peter G., “Only his Only Grammarian Can Only Say What Only He Means,” ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.

Glottochronology: Part IV

February 20, 2009

Please read Part I, Part II, and Part III before reading this post.

I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur’s book Discrete Dynamical Systems (Oxford, 1990).

1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?

2. Consider the model of glottochronology. Assume a language is given today.

(a) How long will it take for 1/4 of the words to change?

(b) How long will it take for 10 per cent of the words to change?

3. Suppose that person A knows 60 per cent of a list of 1000 words, person B knows 70 per cent of that list, and person C knows 30 per cent of that list.

(a) How many words do you expect all three people know?

(b) What per cent of the words is known by A and B but not by C?

Hints:

Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle  to A, B, and C.

Glottochronology: Part III

February 19, 2009

Please read Part I and Part II before proceeding with this post.

Applications to cultures

Sandefur provides the following example:

Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, A and B. At time 0, they have the same language, so that for a given list of L words, A(0) = B(0) = 1 is the per cent of the words they both know (and have in common).

If we contact each of these cultures k years later, culture A will know A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k) per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,

Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)

What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent Q of this list of words is known by both cultures. Thus in the equation for Q above, Q is known, but k, the number of years since the two cultures separated, is unknown. Solving for k gives

k = 500lnQ/ln 0.805

To understand the significance of this ratio we need to look at some examples.

Examples

Sandefur provides several examples.

Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is Q = 180/300 = 0.6. Then

k = 500ln 0.6/ln 0.805 = 1177.5

We then conclude that the natives of these two islands came from a common ancestry, approximately 1200 years ago.

Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation n for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.

Drawbacks and Pitfalls of Glottochronology

The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance.  But we know that

If p(A1A2) = p(A1)p(A2) event cooccurance is by chance.
If p(A1A2) > p(A1)p(A2) event cooccurance is more than by chance.
If p(A1A2) < p(A1)p(A2) event cooccurance is less than by chance.

One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?

As noted by Sandefur

…how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes ’slightly’, we will still count it as being on the list. If the meaning of a word changes ’sigificantly’, we will delete it from the list. Thus, there is some subjectivity in determining Q which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly assumed that all words were equally likely to change. This can throw the results off.

The moral of this is that you need to be careful not to make more claims about your model than are justified.

Glottochronology: Part II

February 18, 2009

Please read Glottochronology Part I before reading this post.

Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is

p(A1A2) = p(A1)p(A2)

Some authors like Sandefur call this the multiplication principle.

As Sandefur noted and quote:

Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.

Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.

Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.

In Part III we will provide some examples of this principle to the evolution of cultures. 

Later in this series we will explain how the independence assumption affects some of the reasonings  and claims behind language dating models.

In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but  passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.

Glottochronology: Part I

February 16, 2009

Although not back-to-back during this week I will be posting on glottochronology.

Glottochronology is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.

Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,

A(n+1) =rA(n),

where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.

Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that

r^1000 = 0.805

Thus, the per cent of the original list of words that are still in use k years later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.

Vector Space Model and Protein Retrieval

November 12, 2008

I came across an interesting paper, A Modified Vector Space Model for Protein Retrieval, where in the second page of it they reproduced material from my tutorial article The Classic Vector Space Model: Description, Advantages and Limitations of the Classic Vector Space Model.

Similarity, Pearson, and Spearman Coefficients

October 29, 2008

This post shows the connection between Pearson and Spearman’s Correlation Coefficients with Cosine Similarity and Dot Products. As mentioned in previous posts:

Pearson’s Coefficient is equivalent to Spearman’s Coefficient for raw data that has been transformed into ranks.
http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Similarity is not Distance.
http://irthoughts.wordpress.com/2008/10/15/seos-and-their-semantic-distance-myths/

The ’similarity distance’ expression is an oxymoron that should be avoided.
http://irthoughts.wordpress.com/2008/10/20/having-fun-with-oxymorons/

Arbitrary transformations between Distance and Similarity must be avoided.
http://irthoughts.wordpress.com/2007/09/17/binary-similarity-calculator/

It is important to know how to define Distance
http://irthoughts.wordpress.com/2008/10/23/why-defining-distance-is-important/

Transforming Pearson’s Coefficients into Cosine Similarities

A Pearson’s Correlation Coefficient is equivalent to the cosine of the angle of a paired data represented as vectors, provided that the raw data is first transformed into z-scores:

z(xi) = (xi – mean_x)/s_x
z(yi) = (yi – mean_y)/s_y

where the mean is deducted and the difference normalized respect to the standard deviation. Thus, we end up with a mean-centered, deviation-normalized data. A z-score data is often called a ‘centered data’.

An Example

To convince yourself, Pearson’s Correlation Coefficient for the following data is 0.8837…


 

xi yi
2 3
4 2
6 5
20 7

If you center the data and calculate its cosine similarity, it should be the same.

If the centered data is converted into column unit vectors, the dot product of the vectors is also 0.8837… since for unit vectors cosine angle = dot product

Now try this.

Rank the raw data. Next, center it, and then convert into unit vectors. Convince yourself that in this case

Spearman’s = Pearson’s = Cosine Angle = Dot Product = 0.8000…

Since Similarity is not Distance, these are not Distances.

Independence, Disjointness, and IR Flaws

August 11, 2008

Often independent events are mistaken for exclusive (disjoint) events. These are two different animals.

Consider two events, A and B. Let p(A OR B) be their union probability and p(A AND B) their joint probability. The general addition law for probabilities states that for any two events A and B

p(A OR B) = p(A) + p(B) – p(A AND B)

If events are independent

p(A AND B) = p(A)p(B)

Thus,

p(A OR B) = p(A) + p(B) – p(A)p(B)

Whereas if these are exclusive

p(A AND B) = 0

Therefore,

p(A OR B) = p(A) + p(B)

Furthermore,

if p(A AND B) = p(A)p(B) events are independent, occurring by chance.
if p(A AND B) > p(A)p(B) events are positively correlated, occurring more often than by chance.
if p(A AND B) < p(A)p(B) events are negatively correlated, occurring less often than by chance.

Talking in “rice and beans” (Hablando en “arroz con habichelas”):

Exclusive events do not have common outcomes as the occurrence of one excludes the occurrence of the other. By contrast, independent events have common outcomes, but the occurrence of one does not influence the occurrence of the other.

Independence and disjointness are very different things.

In IR, assuming that the IDF of a combination of terms can be taken for the sum of individual term IDF values presumes that terms are independent regardless of the actual data.

Arbitrarily assuming event independence, ignoring the experimental evidence, is one of the main sources of innaccuracies/flaws in many IR models (Cooper, 1991). However, excluding independence altogether is also unreasonable (Sparck-Jones, Walker, and Robertson, 1998).

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Sparck Jones, K., Walker, S., & Robertson, S. E. (1998). A probabilistic model of information retrieval: development and status. TR 446, September. Computer Laboratory, University of Cambridge.

Understanding TFIDF

July 7, 2008

What TFIDF Is

In IR, a function of the form

aij = f(Lij,Gi,Nj) = Lij Gi Nj

defines term weights, where aij is the score assigned to term i in document j. Lij is the local weight of term i in document j. Gi is a collection-wide weight. Nj is a normalization factor over document j, often set to 1. Other values and models for N are possible.

There are many definitions for Lj and Gi. One of these consists in setting

Lij = tij, where tfij is the frequency of term i over document j
Gi = IDFi, where IDFi = log(N/ni) is the inverse document frequency of term i over a collection of N documents, where ni is the number of documents containing term i.
Nj = 1

Terms weights then reduces to evaluating

aij = tfij*IDFi = tfij*log(N/ni)

This is the so-called classic TFIDF term weight scheme, one of several examined by Salton et al. in 1975.

TFIDF is frequently used to construct a term vector space model. In this model, terms define the dimensions of a vector space called “the term space”. Dimension units are aij scores. Vectors representing documents and queries are embedded in this space and often converted into unit vectors to simplify further matrix analyses.

Terms do not posses positions in the term vector space simply because they are the dimensions of the space. Since dimensions are assumed to be orthogonal, term independence is assumed. Documents and queries are then treated as “bags of words” and represented as vectors in this space.

What TFIDF Scores Are Not

The TFIDF product, aij = tfij*IDFi, combines two different event spaces: the event space of terms over TF and the event space of documents over IDF. This product can hardly be defined as a term importance score or term relevance judgment for several reasons. Still this is what many informational electronic outlets like Wikipedia and search marketing blogs have claimed.

Firstly, let’s consider the contribution of tf to aij. For a given IDF value a plot of aij vs tfij presumes proportionality. However, a term i repeated x times in document j is not necessarily x times more pertinent or relevant. Term repetition is not precisely a good indicator of term importance.

Secondly, let’s consider the contribution of IDF to aij. IDF weights do not incorporate relevance information. According to Robertson (1), “we can regard IDF as a simple version of the RSJ weight, applicable when we have no relevance information.”

IDF is a measure of the discriminative power of a term over a collection of documents. Sparck Jones called this “term specificity”. Numerically, it is a log estimate of the inverse probability that a random document ni from a collection of N documents would contain term i. Both IDF and TFIDF have never been a buzzword in IR since the ‘70s.

On Term Importance

IDF as the TFIDF product, aij = tfij*IDFi, does not estimates term importance either. The importance of a term, a string, a passage, a message, etc is linked to many things like its meaning (semantics) and amount of information carried  (entropy). A TFIDF product does not evaluate either one.

So far we have considered two event spaces, but we could include other event spaces that add up to other variables that affect term importance; for example, the query space.

For the purpose of keyword-driven services on the Web, other considerations can be added to the “term importance” label. One of these is search volume. Terms with a high search volume or with a good conversion ratio are frequently deemed as “important” to keyword-driven services, regardless if these are quite discriminatory or “rare”.

Average users rarely search for very rare terms simply because they are rare. The odds are that very rare terms will exhibit low search volume. Thus, a highly discriminatory term not necessarily is important in this case. Thus, the query is a third and different event space to consider before deeming a term as “important”.

It is true that in the TFIDF model, we can take the cosine angle between any two vectors as a similarity measure and rank whatever the vectors represent. However, ranking by similarity in such a way can hardly incorporate semantics or information entropy, despite what some have written about the topic.

A partial solution to this consists in replacing IDF in the aij expression by an entropy expression. Such models do exist, but are computationally expensive and certainly prohibitively expensive for large scale web search engines.

On IDF and Information Entropy

Over the years, several authors including Robertson himself have tried to link IDF to entropy. According to Robertson (1), “We might be tempted to link this interpretation to the IDF formula, by regarding IDF as measuring the amount of information carried by the term. Indeed, something along these lines has been done several times, including by the present author (Robertson, 1974). Some problems with this view of IDF are discussed below.”

We are not going to go over those problems here, but the reader can go to the original source:

Robertson (1) contended:

“The essence of the problem is that it is hard to identify a single event space and probability measure within which all the relevant random variables can be defined. Without such a unifying event space, any process that involves comparing and/or mixing different measures (even a process as simple as adding IDFs for different terms) may well be invalid in terms of a Shannon-like formalism.”

Robertson then expanded on efforts along those lines (1):

“But there is a more serious problem. When we search using weighted terms, we typically take the query terms and assign weights to them, but ignore all the other terms in the vocabulary. It is hard to see how such a practice could make sense in a Shannon-like model: every term in the document must be assumed to carry information as well as any other. That is, the presence of a term ti would carry −log P(ti) amount of information irrespective of whether or not it is in the query. There is nothing to link the amount of information to the specific query. So we would have no justification for leaving it out of the calculation. Nevertheless, the similarity of the usual IDF formulation to a component of entropy has stimulated other researchers to try to make connections, sometimes somewhat differently from the link suggested above.”

In the rest of the paper Robertson mentioned by name authors that proposed such efforts.

Reference

1. Robertson, S.; Understanding Inverse Document Frequency: On theoretical arguments for IDF.
Journal of Documentation,Volume 60, Number 5, 2004 pp 503–520.

Posts somehow related with this post

RSJ-PM: Probabilistic Model Tutorial

Why IDF is Expressed using Logs?

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/ 

http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

SEOs and their IDF Myths: Part 2

July 3, 2008

This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long standing presence in IR since its introduction back in 1972 by the late Karen Sparck Jones. Since then the model has been thoroughly researched and incorporated into IR models (Salton’s tf-idf, RSJ, and BM25 models).

SEO myths and misconceptions in connection with IDF is featured in the current issue of IRW newsletter.

It appears that repeating SEO crap across the blogosphere makes some to become experts on the subject. These pseudo teachers should have researched the topic before making dumb claims or repeating their peers’s hearsay or come up with definitions made out of thin air. Nothing new coming from SEOs. Such practices are almost their trademarks.

For instance Aaron Wall, has incorrectly defined IDF as follows:

“Inverse Document Frequency is a term used to help determine the position of a term in a vector space model.”

As usual other seos have repeated like parrots such misinformation. It is not the first time. It reminds me of Wall’s claims about LSI, a topic he wrote extensively on until his ignorance about the topic was exposed in several blogs that discuss IR. He was not alone. Andy Beal, Mike Marshall, and few other vocal SEOs have claimed to know about LSI or have used LSI in SEO work. Really?

But this post is not about LSI, but IDF which is a topic equally misunderstood by SEOs. So, let us debunk their claims. As stated in IRW-2008-06:

Salton et al. proposed the vector space model in 1975 in the paper A Vector Space Model for Automatic Indexing (15). In that paper several schemes for scoring term weights were proposed. One of these consisted in combining term frequency (tf) with IDF. Over the years, a family of tf-IDF models has been proposed. Obviously, these are predated by the IDF model of 1972.

In Salton’s vector space model documents are represented as vectors. A query is represented just as another document. Vectors are projected in a vector space, whose dimensions are terms. The units of those dimensions are weights. Coordinates associated to a point (or a vector) in that space are computed according to a scoring model. Terms cannot have positions in this vector space because they are the dimensions of the space. It is that simple.

Despite the fact that IDF has been around since 1972 and tf-IDF since 1975, some search marketers like those that repeat Andy Edmonds’s claims are saying that IDF or tf-IDF is a “new” buzzword in the IR field. WOW! IDF and tf*IDF is a “new” buzzword in IR circles. Really?

Others have claimed that it is not possible to evaluate the IDF of a phrase. Even some that plan to teach IR have claimed that calling log(N/n) “inverse document frequency” is an “insult to students”. Before making a fool of themselves they should read Robertson and Sparck Jones legacy papers on the topic.

Sorry to sound harsh, but I wonder what kind of crap all these pseudo teachers are lecturing while sitting in the dark of their  empty classrooms and forums.

Did search engines use IDF? Yes, absolutely.

Do all search engines use IDF? No, absolutely.

Do I think X search engine currently uses IDF? I cannot speculate what X is doing simply because I don’t work at X.

Do I use IDF? Yes, in my experimental search engine students are building/researching.

What are the drawbacks of IDF? Several. Its stability as N gets larger is an ongoing research topic.

Before commenting on IDF, SEOs please don’t lose credibility and do your homework. Start here.

New Research on the topic: http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

On the Stability of Global Weights in the Presence of Spam

June 30, 2008

In A Comparison of Document, Sentence, and Term Event Spaces, Blake analyses the stability of various global weight models (IDF, ISF, and ITF) with respect to document and journal corpuses. She uses stratified samples collected based on term frequency information. Abstracts and section partitions of full-text scientific articles were studied.

We are conducting studies along similar lines, but using web collections. Such studies are extremely important in a dynamic environment like the web, which is different from abstracts and scientific journal collections. Web collections often consist of documents of general interest or noisy content. Documents can be designed using dubious practices like keyword repetition techniques. Such techniques, commonly known as keyword spam, can introduce biased term frequency data.

It should be underscored that keywords relevant to products and services are also intentionally repeated across the web. In addition, not all document content found in web collections is valid. Some are near duplicates, have been artificially generated, or are the result of vested alliances like link exchange programs. Thus, the stability of IDF, ISF, and ITF with respect to web collections or subsets of these in such a noisy environment is still an open question.

Any pointer or reference research about this topic is greately appreciated.

Sneak Preview of IRWatch: Understanding IDF

June 23, 2008

idf

“IDF is simply neither a pure heuristic, nor the
theoretical mystery many have made it out to be.
We have a pretty good idea why it works as well
as it does.” –Stephen E. Robertson

Here is a sneak preview of IR Watch for the month of June, 2008. It should be in subscribers inbox during the day or at the latest tomorrow. 

It is discussed within the context of co-occurrence theory and term independence/dependence assumptions. Issues and misconceptions related with this measure are addressed. Initially we made plans for including current ongoing work we are conducting on specificity measures, but we have chosen not to since is not the appropriate forum. 

IRW-2008-06: Understanding Inverse Document Frequency (IDF)

In this issue:

Introduction
Robertson-Sparck Jones Early Work on IDF
What IDF Is Not
What IDF Really Is
On Terms Independence
On Terms Dependence
Few Examples
Estimating the IDF of a Phrase
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Vector Space Models and Search Engines

April 21, 2008

This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/

That said, today’s post is in reaction to the article at
http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.

Blockquoted passages and my comments are next.

Blockquoted Passage

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (“the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Comments

At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?

Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as

IDF = log(1/p) = log(N/n)

Logs at a given base (often 2 or 10) are used.

Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.

Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.

The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.

For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.

IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.

For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 – p)) instead of just p. Inverting and taking logs,

IDFP = ((N – n)/n)

If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.

Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html  

There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.

Blockquoted Passage

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Comments

Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.

This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?

The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.

L = f/fmax
L = 1 + log(f)
L = log(1 + f)

etc, etc, etc.

Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.

There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:

W = L*G*N

A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.

As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.

Blockquoted Passage

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

Comments

In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.

However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.

Blockquoted Passage (from one commenter)

It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

Comments

Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.

There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.

The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).

From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.

For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.

For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.

To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.

Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.

Note

Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:

http://www.webpronews.com/topnews/2001/09/05/google-interview-by-fredrick-marckini

One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html  which incidentally mentions tf and IDF.

Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.

Blockquoted Passage

Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

Comments
Why agree to SEO hearsay? See previous comments

Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.

Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/  or just do a search for “phrase idf”.

Conclusion

To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.

Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than what web page creators or a KD tool tags as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the ones that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing and comparing against a whole new and extra ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

Global Term Weights based on Entropies

January 16, 2008

A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.

Global weights based on entropies are frequently combined with local and normalization weights into overall weights.  These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.

The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:

Global Entropy Weights

Evidently,

G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.

Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.

Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.

A Novel Perspective of Similarity

September 18, 2007

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Being Referenced Without Being Referenced

September 13, 2007

I have the honor of being “referenced” by Lau Patrick from ETH’s Databases and Information Systems Group in the manuscript Algorithms for Data Base Systems Report to the topic “Abusing Web Search” based on “A web based kernel function for measuring the similarity of short text snippets”.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/LauReport.pdf

Interesting project. I can even recognize some figures and examples of the manuscript taken from my document indexing tutorial,
http://www.miislita.com/information-retrieval-tutorial/indexing.html

In particular, Figure 1 of the tutorial, shown in the manuscript as Figure 2. No link to the tutorial was provided, but to a different document at Mi Islita. That’s fine, since I referenced that my Figure 1 is a modification from the one at  

http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

At least they should have given credit to Ruthven and Lalmas to avoid allegations of student plagiarism.

Memo to Thomas Hofmann (Google), Donald Kossmann, and Peter Widmayer from http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/:

Please! Better than talking about Web search abuse, similarity of text, and duplicated content, let’s talk about lack of originality and properly referenced work.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

THESUS: Semantic Subsets Based on WWW Links

August 2, 2007

I came across a 2003 paper on mining WWW text links, in which researchers introduced THESUS. Really interesting piece.

The abstract of THESUS: ORGANIZING WEB DOCUMENT COLLECTIONS BASED ON LINK SEMANTICS follows:

(more…)

Conceptual Similarity and Conceptual Mapping

July 26, 2007

I came across a relatively old paper authored by researchers at Indiana University and Institute for Human and Machine Cognition: Assessing Conceptual Similarity to Support Concept Mapping

Indeed, it seems like a great concept.

(more…)