• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: September 2007

Disgression to Numerical Dynamics and Chaos

27 Thursday Sep 2007

Posted by egarcia in Programming, Theses

≈ Leave a Comment

I recently reviewed a thesis project wherein the student used recursion to minimize a function. The student used number of iterations as a stopping criterion. When I revised the manuscript, it gave me a flashback to 1991.

Back then I was inmersed into Numerical Dynamics and Chaos Theory conferences. I wrote a paper wherein one little component consisted in stopping a routine if it reaches 10,000 iterations. The piece landed in the hands of a picky reviewer. It was rejected on the grounds that number of iterations was not a good stopping criterion.

The reviewer was absolutely right. I missed the point back then, though.

Why use 1,000, 10,000, or 1,000,000 iterations? The point is that no matter how many iterations one uses there will always be someone out there asking: Why use n number of iterations? Why not more?

Indeed, number of iterations as a stopping criterion is not recommended, especially if the goal is optimization. Not to mention that this is a subjective approach which requires initializing an extra parameter. The parameter might be sensitive to other variables or initial conditions.

Instead of number of iterations, a better approach consists in using a comparative function F. F can be defined as the absolute error or absolute relative error between a sequence of results.

If F is less than a threshold value, we stop the iterative procedure; otherwise, we continue with it.

Such functions are given below:

F = | (X(n) – X(n+1)) | * 100 < threshold value
F = | (X(n) – X(n+1))/X(n+1) | * 1000 < threshold value

and so forth.

For example if F is less than, let say, 10^-6, we stop the recursions.

In this case, F is based on an objective, statistical criterion: the relative error between two consecutive results. Note that:

1. the analyst states the threshold value that best sweets his/her precision needs.

2. F is independent of any initial condition or parameter.

Of course, for this to work the recursion should be toward what we call in Chaos Theory, an attractive fixed point. Methods for evaluating if this is the case do exist in Numerical Dynamics (e.g., the Fixed Point Theorem).

Fortunately, the student’s problem was not of this kind and he was not dealing with a dynamical system at all.

IPAM Workshop: Cyber-enabled Discovery & Information

26 Wednesday Sep 2007

Posted by egarcia in Conferences

≈ Leave a Comment

Dr. Mark Green, from IPAM at UCLA, sent me  this Call for Workshop:

Dear colleagues,

I am writing to let you know about a 1-day workshop IPAM is holding on October 29 which the NSF has just asked us to organize as part of a coordinated effort by the NSF Mathematics Institutes.  The workshop is intended to aid those interested in writing proposals to understand the initiative better and to help them in finding collaborators.

The NSF is rolling out a major new initiative in late September on “Cyber-enabled Discovery and Innovation.” This will begin as a $50 million dollar program the first year, and will grow over the next 5 years into a $250 million program.

The goal of this workshop is to inform the scientific community about the CDI program, with the aim of eliciting strong proposals involving mathematical scientists. This workshop will be focused on the “knowledge extraction” aspect of the CDI program. For more information about CDI, see

http://www.nsf.gov/news/news_summ.jsp?cntn_id=108366

We plan on having three types of presentations:

–An information session with Q&A with a representative from NSF

–Several panels where each panelist would present a few slides about what they consider to be the interesting and important questions of long-term significance, followed by a discussion with Q&A. Topics we envision at this point are:

(i) Numerical Methods for Fast Knowledge Extraction

(ii) Nonlinear Methods for Dimensional Reduction

(iii) Knowledge Extraction from Images and Problems of Visualization

(iv) Discrete and Graph-based Techniques for Knowledge Extraction and Analysis of Large Networks

–Selected examples of success stories of applying knowledge extraction techniques from the mathematical sciences to large scale problems

For the program webpage and an online application/registration form, see http://www.ipam.ucla.edu/programs/cdi2007/ .

 

Best regards,

Mark

A Novel Perspective of Similarity

18 Tuesday Sep 2007

Posted by egarcia in Data Mining, Machine Learning, Vector Space Models

≈ Leave a Comment

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Binary Similarity Calculator

17 Monday Sep 2007

Posted by egarcia in IR Tools, Programming

≈ 1 Comment

In my previous post I mentioned the Hamming Distance. I was thinking in adding to my Levenshtein Edit Distance calculator the ability to compute this statistic, but I changed my mind.

Rather, why not design a whole new tool that computes, in addition to the Hamming Distance, other coefficients?

So,…

Welcome to the Binary Similary Calculator and Vector Analyzer.

This is a new tool I’ve built over the weekend. It will be uploaded in the next few days.

The tool compares the similarity of a pair of strings. These should consist of binary characters and be of the same length. Since strings are treated as vectors, the tool also works as a vector analyzer; hence, its name.

The following similarity measures are computed from a contingency table of positive and negative matches and mismatches: Sokal-Michener (i.e., Simple Matching), Jaccard, Russell-Rao, Hamann, Sorensen (i.e., Dice), antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson, and Gower2.

The tool also computes Dot Products and Cosine Coefficients.

Although a distance metric, the Hamming Distance (number of positive and negative mismatches) has been included because of its close relation with some of the aforementioned similarity measures.

Hamming Distance

14 Friday Sep 2007

Posted by egarcia in IR Tools, Programming

≈ Leave a Comment

After uploading the Levenshtein Edit Distance tool, several readers asked if I could incorporate a version to account for the Hamming Distance (HMD). Some even asked if could clarify the difference between the two.

I’m not sure if I can find time for the former, but many scripts are already online for computing Hamming Distances. One only needs to write few lines of code to compare the number of different characters between strings of same length. This is the Hamming Distance.

Here is a nice exercise:

Compute LED and HMD for derivatives of “ELVIS”. Cluster results by LED and HMD values.

Being Referenced Without Being Referenced

13 Thursday Sep 2007

Posted by egarcia in IR Tutorials, Vector Space Models

≈ Leave a Comment

I have the honor of being “referenced” by Lau Patrick from ETH’s Databases and Information Systems Group in the manuscript Algorithms for Data Base Systems Report to the topic “Abusing Web Search” based on “A web based kernel function for measuring the similarity of short text snippets”.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/LauReport.pdf

Interesting project. I can even recognize some figures and examples of the manuscript taken from my document indexing tutorial,
http://www.miislita.com/information-retrieval-tutorial/indexing.html

In particular, Figure 1 of the tutorial, shown in the manuscript as Figure 2. No link to the tutorial was provided, but to a different document at Mi Islita. That’s fine, since I referenced that my Figure 1 is a modification from the one at  

http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

At least they should have given credit to Ruthven and Lalmas to avoid allegations of student plagiarism.

Memo to Thomas Hofmann (Google), Donald Kossmann, and Peter Widmayer from http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/:

Please! Better than talking about Web search abuse, similarity of text, and duplicated content, let’s talk about lack of originality and properly referenced work.

Random Notes and LauraMansfield

12 Wednesday Sep 2007

Posted by egarcia in Data Mining, Homeland Security, Machine Learning, Miscellaneous

≈ Leave a Comment

These are some late random notes. Sorry for the delay.

1. I am putting together a research project for a graduate student. The topic is quite interesting: homeland security. While researching the topic I came across LauraMansfield.com site. Mansfield’s site is a goldmine of information, especially for those interested in co-occurrence and word association research applied to the terrorist knowledge domain.

2. I am reviewing a graduate thesis in which logistic regression is used for data mining medical claims. Quite interesting the thesis topic. The manuscript needs some rework, though.

3. I am reading bits and pieces of an old paper on the non-transitivity nature of Jaccard’s Coefficient and a proposed indirect similarity measure.

European Conference on Digital Libraries (ECDL 2007)

11 Tuesday Sep 2007

Posted by egarcia in Conferences

≈ Leave a Comment

The Computer and Automation Research Institute of the Hungarian Academy of Sciences (MTA SZTAKI) cordially invites you to participate in the 11th ECDL conference to be held in Budapest, Hungary on 16-21 September 2007.

Aims and Scopes:

This unique and well-established series brings together researchers, developers and practitioners working in various disciplines and related areas of digital libraries all over the world. The conference will consist of a three days technical program, preceded by a tutorial day and followed by workshops. The technical program will include refereed paper presentations, plenary sessions, panels and poster sessions. The tutorials will provide in-depth looks at areas of current interest.

ECDL 2007 will be devoted to discussions about hot issues and applications and will primarily provide a forum to reinforce the collaboration of researchers towards the Ubiquitous Digital Libraries.

Topics include, but not limited to:

Digital curation
Theoretical models for digital information management
Personal and personalized digital libraries
Concepts of Digital Libraries and digital content
Collection building, management and integration
System architectures, integration and interoperability
Information organization, search and usage
Multilingual information Access and Multimedia Information Handling
User interfaces for Digital Libraries
User studies and system evaluation
Digital archiving and preservation: methodological, technical and legal issues
Collaboration in DLs
Digital Library applications in e-science, e-learning, e-government, cultural heritage, etc.

For additional information, visit the conference site

Brute Force Algorithm

10 Monday Sep 2007

Posted by egarcia in Machine Learning, Programming

≈ Leave a Comment

The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order.

This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code is given below.

void BF(char *x, int m, char *y, int n)
{
int i, j; 

for(j = 0; j <= n - m; ++j) 

{ 

for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} 

} 

}

This code can be converted straightforward to JavaScript or PHP.
 Modern Information Retrieval, Chapter 8, p. 210, has additional information on this and derivative string matching algorithms. These all differ in the way they check and shift the window.

Metric Clusters 2

07 Friday Sep 2007

Posted by egarcia in Data Mining, Machine Learning

≈ Leave a Comment

Yesterday I provided a basic explanation of metric weights and metric clusters. This time I want to expand on this topic and provide some how-to calculations.

On Counting Word Distances

As mentioned, the distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) – p(B) |

Some IR authors use a different definition, though. For instance, Baeza-Yates and Ribeiro-Neto in Modern Information Retrieval, page 126, defines the distance between any two words as the number of words between these. The difference between these two definitions reduces to 1 count in the distance scale.

For example, the following position (P) vs. word (W) table depicts a 16-word text stream that mentions A twice and B three times:

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W A1 X X B1 X X A2 X X X X B2 X X X B3

Adopting the initial definition, d(A1, B1) = 4 – 1 = 3; adopting Baeza-Yates and Ribeiro-Neto’s definition, d(A1, B1) = 2.

The former definition is frequently used since is easier to automate than the later and insures non-zero distance values. To accommodate a distance count to the later definition, subtract 1.

Some authors, like Koberstein and Ng (http://faculty.cs.byu.edu/~dennis/papers/mtcluster.ps), use Baeza-Yates and Ribeiro-Neto’s definition, but then add 1 to the counts to insure that word distances between terms are always non-zero. This is the same as computing distances based on the former definition.

Regardless of how you compute word distances, be consistent. Understand that conversion of distances to metric weights produces relative, comparative data.

In the rest of this post, I use Baeza-Yates and Ribeiro-Neto’s definition of word distances, without adding 1. You can rework the calculations by adding 1 if you wish, but you should be able to arrive to the same general observations.

An Example

It can be demonstrated that for any word pair, A and B, there is a metric weight between the terms such that

mw(A, B) = mw(B, A) = mw(A) = mw(B)

That is, metric weights are symmetric. This is to be expected; metric weights are derived from distances and, by definition, a distance metric is symmetric.

Let consider the previous example.

With respect to A:

mw(A1) = ½ + 1/10 + 1/14 = 0.671

mw(A2) = ½ + ¼ + 1/8 = 0.875

mw(A) = mw(A1) + mw(A2) = 1.546

With respect to B:

mw(B1) = ½ + ½ = 1

mw(B2) = ¼ + 1/10 = 0.35

mw(B3) = 1/8 + 1/14 = 0.196

mw(B) = mw(B1) + mw(B2) + mw(B3) = 1.546

Hence, the metric weight of the A, B pair is

mw(A, B) = mw(B, A) = mw(A) = mw(B) = 1.546

In practice, mw(A, B) is computed with respect to either A or B. In IR textbooks, this quantity is frequently termed an unnormalized correlation factor and used to populate an unnormalized correlation matrix, Cu.

← Older posts

♣  

September 2007
M T W T F S S
« Aug   Oct »
 12
3456789
10111213141516
17181920212223
24252627282930

♣ Favorite Sites

  • Mi Islita

♣ Pages

  • About IR Thoughts

♣ Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

♣ Recent Posts

  • Puerto Rico’s Science and Technology Trust Fund: Innovation Island Blast II
  • The L’Hôpital Rule: Deriving the Geometric Mean
  • Understanding the L’Hôpital Rule
  • How to Create Windows Metro Style Apps with JavaScript
  • Electronic Drugs and Hackers
  • Why a Social and Search Presence is Important for You
  • NY SES – 2012: My little briefing
  • Hello, World. I’m SWM.
  • SES NY – See You All There!
  • Which separators to use with title tags?
  • A Study of Puerto Rico Newspaper Home Pages
  • Hey, SEOs: On Information Gain, Keyword Wallop, and Relevance
  • Social Media and Puerto Rico Local Brands
  • When and Why not to take arithmetic averages
  • l’Hopital’s Rule and the 0^0 Power Controversy

♣ Archives

  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

♣ Category Cloud

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Image Compression Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.