• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: February 2009

When IDF is not enough

25 Wednesday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ Leave a Comment

I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:

According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.

Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *

This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.

* I have modified a bit this assertive question to make the point more clear.

References


http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Hill, I.D., “Wouldn’t it be nice if we could write computer programs in ordinary English – or would it?” The Computer Bulletin, June 1972, pp. 306-312.

Neumann, Peter G., “Only his Only Grammarian Can Only Say What Only He Means,” ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.

Glottochronology: Part IV

20 Friday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ Leave a Comment

Please read Part I, Part II, and Part III before reading this post.

I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur’s book Discrete Dynamical Systems (Oxford, 1990).

1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?

2. Consider the model of glottochronology. Assume a language is given today.

(a) How long will it take for 1/4 of the words to change?

(b) How long will it take for 10 per cent of the words to change?

3. Suppose that person A knows 60 per cent of a list of 1000 words, person B knows 70 per cent of that list, and person C knows 30 per cent of that list.

(a) How many words do you expect all three people know?

(b) What per cent of the words is known by A and B but not by C?

Hints:

Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle  to A, B, and C.

Glottochronology: Part III

19 Thursday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ 1 Comment

Please read Part I and Part II before proceeding with this post.

Applications to cultures

Sandefur provides the following example:

Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, A and B. At time 0, they have the same language, so that for a given list of L words, A(0) = B(0) = 1 is the per cent of the words they both know (and have in common).

If we contact each of these cultures k years later, culture A will know A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k) per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,

Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)

What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent Q of this list of words is known by both cultures. Thus in the equation for Q above, Q is known, but k, the number of years since the two cultures separated, is unknown. Solving for k gives

k = 500lnQ/ln 0.805

To understand the significance of this ratio we need to look at some examples.

Examples

Sandefur provides several examples.

Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is Q = 180/300 = 0.6. Then

k = 500ln 0.6/ln 0.805 = 1177.5

We then conclude that the natives of these two islands came from a common ancestry, approximately 1200 years ago.

Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation n for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.

Drawbacks and Pitfalls of Glottochronology

The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance.  But we know that

If p(A1A2) = p(A1)p(A2) event cooccurance is by chance.
If p(A1A2) > p(A1)p(A2) event cooccurance is more than by chance.
If p(A1A2) < p(A1)p(A2) event cooccurance is less than by chance.

One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?

As noted by Sandefur

…how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes ‘slightly’, we will still count it as being on the list. If the meaning of a word changes ‘sigificantly’, we will delete it from the list. Thus, there is some subjectivity in determining Q which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly assumed that all words were equally likely to change. This can throw the results off.

The moral of this is that you need to be careful not to make more claims about your model than are justified.

Glottochronology: Part II

18 Wednesday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ 2 Comments

Please read Glottochronology Part I before reading this post.

Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is

p(A1A2) = p(A1)p(A2)

Some authors like Sandefur call this the multiplication principle.

As Sandefur noted and quote:

Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.

Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.

Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.

In Part III we will provide some examples of this principle to the evolution of cultures. 

Later in this series we will explain how the independence assumption affects some of the reasonings  and claims behind language dating models.

In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but  passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.

AND 2009 Conference

17 Tuesday Feb 2009

Posted by egarcia in Conferences

≈ Leave a Comment

L. Venkata Subramaniam, PhD, Manager – Information Processing and Analytics, IBM India Research Lab
http://lvs004.googlepages.com
sent us email with some great news. They are having the Third Workshop on Analytics for Noisy Unstructured Text Data (AND) on July 23-24, 2009, at Barcelona, Spain and asked us to disseminate the news.

Copy of the email follows.

——————–

Dear Edel,

We are organizing the Third Workshop on Analytics for Noisy Unstructured Text Data on July 23-24, 2009, at Barcelona, Spain.

We know you work in related areas and would be happy to have you submit your research work to this workshop.

Also I request you to add AND 2009 to the blog you are maintaining at:
http://irthoughts.wordpress.com/

I know many IR researchers visit your blog and through the blog we will be able to reach.

AND 2009:
http://and2009workshop.googlepages.com/

This is the third in the series of workshops: AND 2007 at IJCAI 2007:
http://research.ihost.com/and2007/

AND 2008 at SIGIR 2008:

http://and2008workshop.googlepages.com/

Both earlier workshops resulted in ACM proceedings and journal special issues. Here are some details of AND 09:

http://and2009workshop.googlepages.com/

Workshop Name: Third Workshop on Analytics for Noisy Unstructured Text Data (AND 09) in conjunction with ICDAR 09

Submission Date: 20 April 2009
Notification Date: 20 May 2009
Workshop Dates: 23-24 July 2009 Workshop
Location: Barcelona Spain

Regards
Venkat

————————-

So now you know. Start making plans for attending this great workshop. If you are visiting Madrid, swing by to Barcelona for a few days. Don’t miss this unique opportunity.

Glottochronology: Part I

16 Monday Feb 2009

Posted by egarcia in Data Mining, Queries, Vector Space Models

≈ 3 Comments

Although not back-to-back during this week I will be posting on glottochronology.

Glottochronology is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.

Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,

A(n+1) =rA(n),

where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.

Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that

r^1000 = 0.805

Thus, the per cent of the original list of words that are still in use k years later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.

When newspapers stick to spam marketing

12 Thursday Feb 2009

Posted by egarcia in Spam

≈ Leave a Comment

That’s what local newspapers in Puerto Rico are doing: insisting in old spam marketing tactics. Wake up local web masters. It is not 1995. Redirections, use of splash page ads, and keyword spamming  in meta keyword tags, not only does not work with search engines, but annoys users to no end.

This is exactly what some newspapers like the local El Nuevo Dia newspaper are doing.

Today I typed
http://endi.com
and was redirected to a full-size splash page ad. Then I have to opt-out to be redirected to a content page. Thank you for slapping in my face those annoying ads. Why chase away readers?

Adding insult to injury, when I looked at their horrible content page source code (
http://www.elnuevodia.com/noticias
), it is clear those that designed the page are insisting in keyword spamming through meta keyword tags. How many keywords can you count in the following sample?

<meta name=”Keywords” content=”ENDI El Nuevo Dia, periodico, Puerto, Rico, internet, noticias, boricua, Clima, Horoscopo, Coqui, Sapo,Concho,Telefonica,Coqui net,RonNueva York ,Horoscopo,Coqui,Sapo Concho,Telefonica,Coqui net,Ron,Bacardi,Estatus,Sila Maria Calderon,Menudo,Carlos Romero Barcelo,Pedro Rossello,Anibal Acevedo Vila,Daddy Yankee,Tego lderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Marcony,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne,Educacion,Cine,Entretenimiento,Ejercicio,Bienestar,Recetas,Recetarios,Musica,Boricuas,Carlos Arroyo, Islanders,TUTV,Motoristas,Internautas ,Medicos,Historiadores,Venezuela,Santo Domingo,Republica Dominicana,Cuba ,Queens,Manhattan,Bronx,Nueva York ,Espiritualidad,Maestros,Televicentro,Telemundo,Univision,Bellas Artes,Telenovela,Comunidades,Isla,Culebrita,St. Thomas,Isla Nena,Culebra,Isla Mona,Vieques,Filiberto Ojeda,Capitolio,Turismo,Periodico,Periodismo,Veteranos,Marina,Soldados,Senado,Energia Electrica,Plena,Bomba,Reggaeton,Wisin y Yandel,Casinos,Salsa,Construccion,Vacaciones,Jazz,Tito Trinidad,Oscar de la Hoya,Millie Corretjer,Roberto Clemente,San Juan,Miguel Cotto,Becas,Tercera Edad,Museos,Clasificados,Clasificados online,Clasificados en linea,Boletines,Servicios de noticias,Luis A. Ferre,Ferre Rangel,Hepatitis,Dengue,Diabetes,Obituarios,Obesidad,Librerias,Libros,Viejo San Juan,El Morro,Ballaja,Museo de Arte de Ponce,Ponce,Mayaguez,Parque Indigena,Observatorio de Arecibo,Tainos,El Yunque,El tunel Guajataca,Zoologico,RUM,UPR,Sagrado Corazon,Interamericana,Universidades,Colegios,Escuelas,Fajardo,Bahia,Luquillo,Bioluminiscente,La Parguera,Carlos Delgado,Igor Gonzalez,Olga Tanon,Piculin Ortiz,Primera Hora,Zonai,Virtual,Beisbol,Mundial ,Raul Papaleo,Sondeos,Justas,Pavas,Palmas,Munoz Marin,Lyann Puig,Elaine Lopez,Javier Lopez,Remi,Poesia ,Olga Nolla,Rosario Ferrer,Mayra Montero”>

How dumb is that? It is clear these folks have no clue about search engine marketing or how search engines work.  Otherwise, why insist in old 1995 practices proven a waste today? I question whether they have a clue about optimizing news stories and press releases for search engines.

OK, SEO firms, pitch them, but don’t try to sell them snake oil like keyword density, SEO LSI, rare terms crap, synonyms stuffing, etc.

Building a Query Reduction Search Engine

09 Monday Feb 2009

Posted by egarcia in Data Mining, Queries

≈ 2 Comments

As part of ongoing research, I’m building a search engine with ondemand query reduction capabilities.  To our knowledge, none of the current commercial search engines provides such features.

Experimental machines that do this require the use of training sets, decision graphs and decision trees. For references on this topic, read

Query Expansion and Query Reduction in Document Retrieval

A Two-Step Approach for Tree-structured XPath Query Reduction

Unfortunately, these type of search engines  are not popular, in part because are not practical at the scale of the Web and because require retraining of both the search engine and users –not to mention that these type of search machines are not precisely user-friendly.

Think about this: In general, average users are lazy searchers. They are also too busy to do neither query expansion or query reduction as we do in IR, nor they are prone to consult lookup lists, thesaurus, query logs, etc to refine their searches while surfing across databases. At any given point in time of the year the mentality of non-IR searchers  is: “Don’t make me think”.

Thus, building a search engine that does ondemand query reductions for the Web (and that users will use without being forced to think) is not that easy.

We would like to hear of others working on similar research as we believe we have found a promising solution, at least partially. Ours is different from the approaches given in the above two references.

Time Series Semantics

04 Wednesday Feb 2009

Posted by egarcia in Data Mining, Latent Semantic Indexing, Queries

≈ Leave a Comment

The title of this post might be a bit confusing, but we couldn’t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.

Consider the key term [man].

Providing a meaning or perception for [man] during a good or a bad Economy is a good example.

[man] means something different, depending if you ask to an employed or unemployed [man].

This can be illustrated by reading Why losing a job can hurt men more.

Although it might be a quite depressing article (especially for those with a pink slip in their foreheads), note the key words/phrases that define its topic and overall semantics. Incidentally, the article starts with

“Thomas Schuler is a man.”

and almost at the ends says:

“A man is what he does.”

The key words/phrases of the article can be taken as a semantics state in time attached to the [man] key term

IRW: Data Mining Credit Cards

02 Monday Feb 2009

Posted by egarcia in Data Mining, Hacking, Newsletters

≈ Leave a Comment

data-mining-credit-cards1

The current issue of IR Watch – The Newsletter will be available during the day. It consists of the following sections.

Featuring Article: Data Mining Credit Cards

In this issue of the newsletter we cover Luhn’s Algorithm, also known as the Modulus 10 or Mod-10 Test. This algorithm is used for data mining and validation of credit cards. Credit cards fraud is a topic that never goes away.

QA: Types of Links

What is the difference between in-links, out-links, co-citation, and co-reference?

Historical Notes: The Whirlwind Project

Top CS: State University of New Jersey, Rutgers

Who is Who in IR: Tefko Saracevic

Graduate Theses

Data Mining Blogs

and more.

February 2009
M T W T F S S
« Jan   Mar »
 1
2345678
9101112131415
16171819202122
232425262728  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.