• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Vector Space Models

IDF and Vector Space Models

11 Wednesday Mar 2009

Posted by egarcia in IR Tutorials, Vector Space Models

≈ 1 Comment

I’m back from SIDIM XXIV. It was a great conference in honor of Professor Oscar Moreno, from the Gauss Research Laboratory and NIC.PR (responsible for the .pr Internet domains.).

Dr. Moreno is a legend in the area of pure and applied mathematics. I have the privilege of meeting with him.

The conference plenary speakers were equally three legends:

Elwyn Berlekamp, University of California at Berkeley
Solomon W. Golomb, University of Southern California
Guang Gong, University of Waterloo

The event was a success, although some speakers read straight from their notes. As an interdisciplinary conference on pure and applied mathematics, all kind of topics were covered.

I got the chance to present research work on a new global weight algorithm we are testing called scaled inverse document frequency (SIDF), a variant of the well-known IDF scheme.

For those unfamiliar with IDF and its implementation with ranking algorithms, Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras has published a very useful tutorial presentation on Vector Space Models.

The tutorial is based on our series of articles on the subject and provides a better understanding of the theory. We could not have done it better.

SIDIM XXIV Conference

05 Thursday Mar 2009

Posted by egarcia in Conferences, Data Mining, Homeland Security, Queries, Vector Space Models

≈ 4 Comments

I am presenting at The Seminario Interuniversitario de Investigación en Ciencias Matemáticas (Interuniversity Seminar on Mathematical Sciences Research, SIDIM).

This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (http://sidim2009.uprr.pr/)

This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book of abstracts  is available at http://sidim.uprh.edu/libroSIDIM2009.pdf

I will be presenting new research work on IDF and a new model for the conditional specificity of terms. If you have followed previous posts on the topic of inverse document frequency, now you will understand why I have dissected the topic several times. Thank you all for your private comments and feedback on the topic.

My abstract follows:

Scaled Inverse Document Frequency: A Model for the Evaluation of the Conditional Specificity of Query Terms in Search Engine Collections

Edel Garcia, Internet Business Development Center, Interamerican University of Puerto Rico, Metropolitan Campus

Inverse document frequency (IDF) is a measure of the specificity of query terms over a collection of D number of documents that has been successfully incorporated into numerous vector space information retrieval models. Since these models assume term independence, the specificity of a given term, present in different queries, is assumed to be unique and independent from other query terms. To the best of our knowledge, there are no known models that condition the specificity of terms to the presence of other terms in a query.

This paper proposes a new measure called scaled inverse document frequency (SIDF) which evaluates the conditional specificity of query terms over a subset S of D and without making any assumption about term independence. S can be estimated from search results, OR searches, or computed from inverted index data. We have evaluated SIDF values from commercial search engines by submitting queries relevant to the financial investment domain. Results compare favorably across search engines and queries. Our approach has practical applications for `real-world’ scenarios like in Web Mining, Homeland Security, and keyword-driven marketing research scenarios. SIDF can be incorporated into a variety of information retrieval models as a global weight scoring system.

Keywords: inverse document frequency, conditional term specificity, web mining, search engines

Vector Normalization with Excel

04 Wednesday Mar 2009

Posted by egarcia in Data Mining, IR Tutorials, Newsletters, Vector Space Models

≈ 1 Comment

Unit vectors are frequently used in information retrieval and data mining studies because simplify further calculations and analyses.

In the current issue of IR Watch, we show how easy is to convert column vectors into unit vectors with Excel. It is assumed you know how to define spreadsheet arrays in Excel and how to enter formulas in it.

Say we have two vectors in columns A and B each with four elements. To convert these into unit vectors, do this:

1. In cell C1, enter the formula =A1/(SQRT(SUMSQ(A$1:A$4)))

2. Paste content of C1 into cell D1. This creates a modified instance of this formula.

3. Paste content of C1 and  D1 cells into remaining empty cells of these columns by selecting these at once. This also creates modified instances of these formulas.

C and D columns represent the unit vectors.

A figure with a step-by-step example is given in IRW (free subscription)

Below is another example, but with the final results.

A B C D
1 8 0.13 0.36
2 10 0.26 0.45
4 12 0.53 0.53
6 14 0.79 0.62

That was easy!

If you use the first row to label columns, as in this example, be sure to readjust the formulas so these start at cell 2 and run up to cell 5.

If you still have questions on how to do this, email me or subscribe to IRW.

When IDF is not enough

25 Wednesday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ Leave a Comment

I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:

According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.

Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *

This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.

* I have modified a bit this assertive question to make the point more clear.

References

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Hill, I.D., “Wouldn’t it be nice if we could write computer programs in ordinary English – or would it?” The Computer Bulletin, June 1972, pp. 306-312.

Neumann, Peter G., “Only his Only Grammarian Can Only Say What Only He Means,” ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.

Glottochronology: Part IV

20 Friday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ Leave a Comment

Please read Part I, Part II, and Part III before reading this post.

I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur’s book Discrete Dynamical Systems (Oxford, 1990).

1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?

2. Consider the model of glottochronology. Assume a language is given today.

(a) How long will it take for 1/4 of the words to change?

(b) How long will it take for 10 per cent of the words to change?

3. Suppose that person A knows 60 per cent of a list of 1000 words, person B knows 70 per cent of that list, and person C knows 30 per cent of that list.

(a) How many words do you expect all three people know?

(b) What per cent of the words is known by A and B but not by C?

Hints:

Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle  to A, B, and C.

Glottochronology: Part III

19 Thursday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ 1 Comment

Please read Part I and Part II before proceeding with this post.

Applications to cultures

Sandefur provides the following example:

Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, A and B. At time 0, they have the same language, so that for a given list of L words, A(0) = B(0) = 1 is the per cent of the words they both know (and have in common).

If we contact each of these cultures k years later, culture A will know A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k) per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,

Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)

What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent Q of this list of words is known by both cultures. Thus in the equation for Q above, Q is known, but k, the number of years since the two cultures separated, is unknown. Solving for k gives

k = 500lnQ/ln 0.805

To understand the significance of this ratio we need to look at some examples.

Examples

Sandefur provides several examples.

Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is Q = 180/300 = 0.6. Then

k = 500ln 0.6/ln 0.805 = 1177.5

We then conclude that the natives of these two islands came from a common ancestry, approximately 1200 years ago.

Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation n for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.

Drawbacks and Pitfalls of Glottochronology

The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance.  But we know that

If p(A1A2) = p(A1)p(A2) event cooccurance is by chance.
If p(A1A2) > p(A1)p(A2) event cooccurance is more than by chance.
If p(A1A2) < p(A1)p(A2) event cooccurance is less than by chance.

One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?

As noted by Sandefur

…how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes ‘slightly’, we will still count it as being on the list. If the meaning of a word changes ‘sigificantly’, we will delete it from the list. Thus, there is some subjectivity in determining Q which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly assumed that all words were equally likely to change. This can throw the results off.

The moral of this is that you need to be careful not to make more claims about your model than are justified.

Glottochronology: Part II

18 Wednesday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ 2 Comments

Please read Glottochronology Part I before reading this post.

Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is

p(A1A2) = p(A1)p(A2)

Some authors like Sandefur call this the multiplication principle.

As Sandefur noted and quote:

Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.

Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.

Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.

In Part III we will provide some examples of this principle to the evolution of cultures. 

Later in this series we will explain how the independence assumption affects some of the reasonings  and claims behind language dating models.

In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but  passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.

Glottochronology: Part I

16 Monday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ 3 Comments

Although not back-to-back during this week I will be posting on glottochronology.

Glottochronology is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.

Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,

A(n+1) =rA(n),

where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.

Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that

r^1000 = 0.805

Thus, the per cent of the original list of words that are still in use k years later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.

Vector Space Model and Protein Retrieval

12 Wednesday Nov 2008

Posted by egarcia in Vector Space Models

≈ 3 Comments

I came across an interesting paper, A Modified Vector Space Model for Protein Retrieval, where in the second page of it they reproduced material from my tutorial article The Classic Vector Space Model: Description, Advantages and Limitations of the Classic Vector Space Model.

Similarity, Pearson, and Spearman Coefficients

29 Wednesday Oct 2008

Posted by egarcia in Data Mining, SEO Myths, Vector Space Models

≈ 4 Comments

This post shows the connection between Pearson and Spearman’s Correlation Coefficients with Cosine Similarity and Dot Products. As mentioned in previous posts:

Pearson’s Coefficient is equivalent to Spearman’s Coefficient for raw data that has been transformed into ranks.
http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Similarity is not Distance.
http://irthoughts.wordpress.com/2008/10/15/seos-and-their-semantic-distance-myths/

The ‘similarity distance’ expression is an oxymoron that should be avoided.
http://irthoughts.wordpress.com/2008/10/20/having-fun-with-oxymorons/

Arbitrary transformations between Distance and Similarity must be avoided.
http://irthoughts.wordpress.com/2007/09/17/binary-similarity-calculator/

It is important to know how to define Distance
http://irthoughts.wordpress.com/2008/10/23/why-defining-distance-is-important/

Transforming Pearson’s Coefficients into Cosine Similarities

A Pearson’s Correlation Coefficient is equivalent to the cosine of the angle of a paired data represented as vectors, provided that the raw data is first transformed into z-scores:

z(xi) = (xi – mean_x)/s_x
z(yi) = (yi – mean_y)/s_y

where the mean is deducted and the difference normalized respect to the standard deviation. Thus, we end up with a mean-centered, deviation-normalized data. A z-score data is often called a ‘centered data’.

An Example

To convince yourself, Pearson’s Correlation Coefficient for the following data is 0.8837…


 

xi yi
2 3
4 2
6 5
20 7

If you center the data and calculate its cosine similarity, it should be the same.

If the centered data is converted into column unit vectors, the dot product of the vectors is also 0.8837… since for unit vectors cosine angle = dot product

Now try this.

Rank the raw data. Next, center it, and then convert into unit vectors. Convince yourself that in this case

Spearman’s = Pearson’s = Cosine Angle = Dot Product = 0.8000…

Since Similarity is not Distance, these are not Distances.

← Older posts
Newer posts →
May 2013
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.