Archive for the ‘Queries’ Category

SIDIM XXIV Conference

March 5, 2009

I am presenting at The Seminario Interuniversitario de Investigación en Ciencias Matemáticas (Interuniversity Seminar on Mathematical Sciences Research, SIDIM).

This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (http://sidim2009.uprr.pr/)

This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book of abstracts  is available at http://sidim.uprh.edu/libroSIDIM2009.pdf

I will be presenting new research work on IDF and a new model for the conditional specificity of terms. If you have followed previous posts on the topic of inverse document frequency, now you will understand why I have dissected the topic several times. Thank you all for your private comments and feedback on the topic.

My abstract follows:

Scaled Inverse Document Frequency: A Model for the Evaluation of the Conditional Specificity of Query Terms in Search Engine Collections

Edel Garcia, Internet Business Development Center, Interamerican University of Puerto Rico, Metropolitan Campus

Inverse document frequency (IDF) is a measure of the specificity of query terms over a collection of D number of documents that has been successfully incorporated into numerous vector space information retrieval models. Since these models assume term independence, the specificity of a given term, present in different queries, is assumed to be unique and independent from other query terms. To the best of our knowledge, there are no known models that condition the specificity of terms to the presence of other terms in a query.

This paper proposes a new measure called scaled inverse document frequency (SIDF) which evaluates the conditional specificity of query terms over a subset S of D and without making any assumption about term independence. S can be estimated from search results, OR searches, or computed from inverted index data. We have evaluated SIDF values from commercial search engines by submitting queries relevant to the financial investment domain. Results compare favorably across search engines and queries. Our approach has practical applications for `real-world’ scenarios like in Web Mining, Homeland Security, and keyword-driven marketing research scenarios. SIDF can be incorporated into a variety of information retrieval models as a global weight scoring system.

Keywords: inverse document frequency, conditional term specificity, web mining, search engines

When IDF is not enough

February 25, 2009

I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:

According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.

Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *

This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.

* I have modified a bit this assertive question to make the point more clear.

References

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Hill, I.D., “Wouldn’t it be nice if we could write computer programs in ordinary English – or would it?” The Computer Bulletin, June 1972, pp. 306-312.

Neumann, Peter G., “Only his Only Grammarian Can Only Say What Only He Means,” ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.

Glottochronology: Part IV

February 20, 2009

Please read Part I, Part II, and Part III before reading this post.

I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur’s book Discrete Dynamical Systems (Oxford, 1990).

1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?

2. Consider the model of glottochronology. Assume a language is given today.

(a) How long will it take for 1/4 of the words to change?

(b) How long will it take for 10 per cent of the words to change?

3. Suppose that person A knows 60 per cent of a list of 1000 words, person B knows 70 per cent of that list, and person C knows 30 per cent of that list.

(a) How many words do you expect all three people know?

(b) What per cent of the words is known by A and B but not by C?

Hints:

Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle  to A, B, and C.

Glottochronology: Part III

February 19, 2009

Please read Part I and Part II before proceeding with this post.

Applications to cultures

Sandefur provides the following example:

Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, A and B. At time 0, they have the same language, so that for a given list of L words, A(0) = B(0) = 1 is the per cent of the words they both know (and have in common).

If we contact each of these cultures k years later, culture A will know A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k) per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,

Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)

What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent Q of this list of words is known by both cultures. Thus in the equation for Q above, Q is known, but k, the number of years since the two cultures separated, is unknown. Solving for k gives

k = 500lnQ/ln 0.805

To understand the significance of this ratio we need to look at some examples.

Examples

Sandefur provides several examples.

Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is Q = 180/300 = 0.6. Then

k = 500ln 0.6/ln 0.805 = 1177.5

We then conclude that the natives of these two islands came from a common ancestry, approximately 1200 years ago.

Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation n for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.

Drawbacks and Pitfalls of Glottochronology

The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance.  But we know that

If p(A1A2) = p(A1)p(A2) event cooccurance is by chance.
If p(A1A2) > p(A1)p(A2) event cooccurance is more than by chance.
If p(A1A2) < p(A1)p(A2) event cooccurance is less than by chance.

One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?

As noted by Sandefur

…how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes ’slightly’, we will still count it as being on the list. If the meaning of a word changes ’sigificantly’, we will delete it from the list. Thus, there is some subjectivity in determining Q which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly assumed that all words were equally likely to change. This can throw the results off.

The moral of this is that you need to be careful not to make more claims about your model than are justified.

Glottochronology: Part II

February 18, 2009

Please read Glottochronology Part I before reading this post.

Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is

p(A1A2) = p(A1)p(A2)

Some authors like Sandefur call this the multiplication principle.

As Sandefur noted and quote:

Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.

Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.

Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.

In Part III we will provide some examples of this principle to the evolution of cultures. 

Later in this series we will explain how the independence assumption affects some of the reasonings  and claims behind language dating models.

In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but  passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.

Glottochronology: Part I

February 16, 2009

Although not back-to-back during this week I will be posting on glottochronology.

Glottochronology is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.

Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,

A(n+1) =rA(n),

where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.

Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that

r^1000 = 0.805

Thus, the per cent of the original list of words that are still in use k years later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.

Building a Query Reduction Search Engine

February 9, 2009

As part of ongoing research, I’m building a search engine with ondemand query reduction capabilities.  To our knowledge, none of the current commercial search engines provides such features.

Experimental machines that do this require the use of training sets, decision graphs and decision trees. For references on this topic, read

Query Expansion and Query Reduction in Document Retrieval

A Two-Step Approach for Tree-structured XPath Query Reduction

Unfortunately, these type of search engines  are not popular, in part because are not practical at the scale of the Web and because require retraining of both the search engine and users –not to mention that these type of search machines are not precisely user-friendly.

Think about this: In general, average users are lazy searchers. They are also too busy to do neither query expansion or query reduction as we do in IR, nor they are prone to consult lookup lists, thesaurus, query logs, etc to refine their searches while surfing across databases. At any given point in time of the year the mentality of non-IR searchers  is: “Don’t make me think”.

Thus, building a search engine that does ondemand query reductions for the Web (and that users will use without being forced to think) is not that easy.

We would like to hear of others working on similar research as we believe we have found a promising solution, at least partially. Ours is different from the approaches given in the above two references.

Time Series Semantics

February 4, 2009

The title of this post might be a bit confusing, but we couldn’t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.

Consider the key term [man].

Providing a meaning or perception for [man] during a good or a bad Economy is a good example.

[man] means something different, depending if you ask to an employed or unemployed [man].

This can be illustrated by reading Why losing a job can hurt men more.

Although it might be a quite depressing article (especially for those with a pink slip in their foreheads), note the key words/phrases that define its topic and overall semantics. Incidentally, the article starts with

“Thomas Schuler is a man.”

and almost at the ends says:

“A man is what he does.”

The key words/phrases of the article can be taken as a semantics state in time attached to the [man] key term

When OR Clusters Behave as AND

January 30, 2009

We are currently doing some testing with a new experimental engine. The experiment consists in using OR as the default mode and IDF-only for scoring terms. IDF is precomputed straight from the inverted index which is also computed at query time. We are also trying replacing IDF with Entropy scores.

With large collections, the inverted index is written to a text file and read at query time.

Since local information (e.g., term freq) is ignored, keyword spam is not an issue.

Instead of a Vector Space Model, we use a cummulative sum of scores over IDF scores, such that is not necessary to compute cosine similarities (*).

So far the results of the experiment is that with multi-term queries two extreme clusters are obtained:

1. the top N ranked documents almost behave as being queried in AND mode and as obeying the Cluster Hypothesis.

2. the M ranked documents at the bottom behave as being queried either in EXACT mode or with a single-term query. (**)

Between these extremes we have some noisy results.  

If some have tried this before, we would love to hear about it. Contact us by email.

 

PS.

(*) In this way we don’t need to make independence assumptions.

(**) With few changes, M now behaves as being queried with single-term queries or few query terms, which is what we expected. The N set still is the more interesting. The middle cases are now quite noisy.

IR Quiz: Searching

December 12, 2008

Here is a great quiz for IR students (2 points each).

Explain and give example for the following ways of searching:

1. horizontal

2. vertical

3. local

4. global

5. proximity

6. adjacency

7. smart

8. expanded

9. advanced

10. relevance feedback

11. remote

12. fusion-based

13. steepest ascent

14. steepest descent

15. binary

16. sequential

17. random

18. deterministic

19. recycled

20. agglomerative

21. constrained

22. contextual

23. regexp

24. boolean

25. exact

26. semantic-based

27. ontology-based

28. thesauri-based

29. log-based

30. popularity-based

31. genetic

32. cellular automata

33. markovian

34. difussion-based

Answers to IR Quiz

June 27, 2008

Answers to the IR Quiz are given below:

Term Independence Assumption:

If k1 and k2 are statistically independent they should occur by chance, co-occurring in only

(100)(200)/500 = 40 documents.

Thus, if they occur by chance, the number of documents mentioning the k1 k2 sequence should be unknown, but certainly no greater than 40.

Term Dependence Assumption:

If terms actually co-occur in 70 documents, they are co-occuring more often than by chance (70 > 40). So, terms are statistically dependent and positively correlated. It is a given that the k1 k2 terms sequence is present in 25 out of the 70 documents wherein terms co-occur.

Detailed Results:

Results are given below, rounded off to two decimal places. First/second results respectively are for terms independence/dependence assumptions. You should be able to double check these results.

1. k1 NOT k2: 60, 30

2. k2 NOT k1: 160, 130

3. k1 OR k2 (unconditional OR): 260, 230

4. k1 OR k2 (conditional OR): 220, 160

5. NOT k1: 400, 400

6. NOT k2: 300, 300

7. NOT (k1 AND k2): 460, 430

8. k1 AND k2 NOT (k1 k2): NC, 45

9. EF-Ratio of the k1 k2 terms sequence: NC, 0.36

10. c12-index of the k1 k2 terms sequence: NC, 0.11

11. c12-index of k1 AND k2: 0.15, 0.30

12. IDF of k1: 0.70, 0.70

13. IDF of k2: 0.40, 0.40

14. IDF of k1 AND k2: 1.10, 0.85

15. IDF of k1 k2 terms sequence: NC, 1.30

Additional exercises open to discussion:

Calculate the associated odds, odd ratios, and logits.

 

Sneak Preview of IRWatch: Understanding IDF

June 23, 2008

idf

“IDF is simply neither a pure heuristic, nor the
theoretical mystery many have made it out to be.
We have a pretty good idea why it works as well
as it does.” –Stephen E. Robertson

Here is a sneak preview of IR Watch for the month of June, 2008. It should be in subscribers inbox during the day or at the latest tomorrow. 

It is discussed within the context of co-occurrence theory and term independence/dependence assumptions. Issues and misconceptions related with this measure are addressed. Initially we made plans for including current ongoing work we are conducting on specificity measures, but we have chosen not to since is not the appropriate forum. 

IRW-2008-06: Understanding Inverse Document Frequency (IDF)

In this issue:

Introduction
Robertson-Sparck Jones Early Work on IDF
What IDF Is Not
What IDF Really Is
On Terms Independence
On Terms Dependence
Few Examples
Estimating the IDF of a Phrase
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Search Smart with WCC’s ELISE

August 9, 2007

The list of subscribers to IRWatch is growing at fast pace. One of our recent subscribers is a developer at WCC, makers of ELISE Smart Search & Match. This seems to be a quite interesting technology. I highly recommend readers to visit their site http://wcc-group.com/

(more…)

Row-Pruning Algorithm Tutorial

August 8, 2007

In IRW-2007-8 I introduced a row-pruning algorithm (RPA) and its unconditional version (URPA).

Let C be a collection of documents and let T = {t1, t2… tm} be m unique terms extracted from C. Assume that term i is combined with all other terms such that starting with term i a family of term sequences is obtained. Assume that these term sequences are hidden (latent) in C. The purpose of RPA and URPA is to identify the composition of these term sequences and to find those that occur more frequently in C.

(more…)

Sneak Preview of IR Watch

July 31, 2007

The current issue of IRW-2007-08 should be out within the next few days. The Association Rules Part 2 discuses how association rule mining techniques from market basket can be applied to Web Mining.

(more…)

Association Rule Mining Thesaurus

July 30, 2007

Here is a 2004 paper on Association Thesaurus Construction for Interactive Query Expansion based on Association Rule Mining

The article discusses basic association rule mining concepts like support, confidence, and pruning as we described in Association Rules Part 1 (July issue of IR Watch – The Newsletter). BTW, read Part 2 in the August issue.

(more…)

Revisiting EXACT Search Shortcuts in Google

July 24, 2007

I have discussed AND and EXACT searches many times, but did you know the following?

In addition to enclosing search terms with double quotes (”like this”), in some search engines one can invoke a shortcut to an EXACT search by using certain characters that serve as sequence connectors. These work in the same way double quotes work. The most common is the hyphen; e.g. 

(more…)

The Risks of Thesaurus-based Expansions

July 23, 2007

SEOmoz has a great discussion on why at times search engines don’t return relevant results; that is, why some results perceived by users as being not relevant to their information needs (queries) are ranked high by search engines.

Some bloggers at SEOmoz attribute this in part to precision and recall issues. We have covered these topics in different occasions; so, let revisit some points along those lines.

(more…)

Thesis on Redundant ITF

July 17, 2007

Thomas Richard Lynam, has researched extensively a variant of ITF called Redundant ITF (RITF). His 2002 master thesis, “Exploitation of Redundant Inverse Term Frequency”, is a must-read for anyone interested in the topic. His thesis is available as a PDF and Postscript.

The justification for using RITF is as follows.

(more…)

What is a Similarity Thesaurus?

July 16, 2007

In my previous post I explained to a reader the difference between inverse term frequency (ITF) and inverse document frequency (IDF), but did not provide practical applications. This post is to explain what ITF is good for.Like IDF, ITF is a global weight measure; i.e., Gi = ITF. Combined with a local weight measure (Lij), it can be used to compute an overall weight.Local weights can be defined in many different ways. Here is one definition:

(more…)

Query-Log-based Personalization

July 5, 2007

Data mining query logs? Then these research papers might interest you.

(more…)

Query Relevance Feedback Algorithms

July 4, 2007

Dependence or Independence Day? Ask regular citizens.

Meanwhile, how about some IR papers on the dependence/independence of query relevance feedback?

Here is a list of really interesting papers on the subject from Michael Ortega-Binderberger’s group:

(more…)