Archive for September, 2007

Disgression to Numerical Dynamics and Chaos

September 27, 2007

I recently reviewed a thesis project wherein the student used recursion to minimize a function. The student used number of iterations as a stopping criterion. When I revised the manuscript, it gave me a flashback to 1991.

Back then I was inmersed into Numerical Dynamics and Chaos Theory conferences. I wrote a paper wherein one little component consisted in stopping a routine if it reaches 10,000 iterations. The piece landed in the hands of a picky reviewer. It was rejected on the grounds that number of iterations was not a good stopping criterion.

The reviewer was absolutely right. I missed the point back then, though.

Why use 1,000, 10,000, or 1,000,000 iterations? The point is that no matter how many iterations one uses there will always be someone out there asking: Why use n number of iterations? Why not more?

Indeed, number of iterations as a stopping criterion is not recommended, especially if the goal is optimization. Not to mention that this is a subjective approach which requires initializing an extra parameter. The parameter might be sensitive to other variables or initial conditions.

Instead of number of iterations, a better approach consists in using a comparative function F. F can be defined as the absolute error or absolute relative error between a sequence of results.

If F is less than a threshold value, we stop the iterative procedure; otherwise, we continue with it.

Such functions are given below:

F = | (X(n) - X(n+1)) | * 100 < threshold value
F = | (X(n) - X(n+1))/X(n+1) | * 1000 < threshold value

and so forth.

For example if F is less than, let say, 10^-6, we stop the recursions.

In this case, F is based on an objective, statistical criterion: the relative error between two consecutive results. Note that:

1. the analyst states the threshold value that best sweets his/her precision needs.

2. F is independent of any initial condition or parameter.

Of course, for this to work the recursion should be toward what we call in Chaos Theory, an attractive fixed point. Methods for evaluating if this is the case do exist in Numerical Dynamics (e.g., the Fixed Point Theorem).

Fortunately, the student’s problem was not of this kind and he was not dealing with a dynamical system at all.

IPAM Workshop: Cyber-enabled Discovery & Information

September 26, 2007

Dr. Mark Green, from IPAM at UCLA, sent me  this Call for Workshop:

Dear colleagues,

I am writing to let you know about a 1-day workshop IPAM is holding on October 29 which the NSF has just asked us to organize as part of a coordinated effort by the NSF Mathematics Institutes.  The workshop is intended to aid those interested in writing proposals to understand the initiative better and to help them in finding collaborators.

The NSF is rolling out a major new initiative in late September on “Cyber-enabled Discovery and Innovation.” This will begin as a $50 million dollar program the first year, and will grow over the next 5 years into a $250 million program.

The goal of this workshop is to inform the scientific community about the CDI program, with the aim of eliciting strong proposals involving mathematical scientists. This workshop will be focused on the “knowledge extraction” aspect of the CDI program. For more information about CDI, see

http://www.nsf.gov/news/news_summ.jsp?cntn_id=108366

We plan on having three types of presentations:

–An information session with Q&A with a representative from NSF

–Several panels where each panelist would present a few slides about what they consider to be the interesting and important questions of long-term significance, followed by a discussion with Q&A. Topics we envision at this point are:

(i) Numerical Methods for Fast Knowledge Extraction

(ii) Nonlinear Methods for Dimensional Reduction

(iii) Knowledge Extraction from Images and Problems of Visualization

(iv) Discrete and Graph-based Techniques for Knowledge Extraction and Analysis of Large Networks

–Selected examples of success stories of applying knowledge extraction techniques from the mathematical sciences to large scale problems

For the program webpage and an online application/registration form, see http://www.ipam.ucla.edu/programs/cdi2007/ .

 

Best regards,

Mark

A Novel Perspective of Similarity

September 18, 2007

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Binary Similarity Calculator

September 17, 2007

In my previous post I mentioned the Hamming Distance. I was thinking in adding to my Levenshtein Edit Distance calculator the ability to compute this statistic, but I changed my mind.

Rather, why not design a whole new tool that computes, in addition to the Hamming Distance, other coefficients?

So,…

Welcome to the Binary Similary Calculator and Vector Analyzer.

This is a new tool I’ve built over the weekend. It will be uploaded in the next few days.

The tool compares the similarity of a pair of strings. These should consist of binary characters and be of the same length. Since strings are treated as vectors, the tool also works as a vector analyzer; hence, its name.

The following similarity measures are computed from a contingency table of positive and negative matches and mismatches: Sokal-Michener (i.e., Simple Matching), Jaccard, Russell-Rao, Hamann, Sorensen (i.e., Dice), antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson, and Gower2.

The tool also computes Dot Products and Cosine Coefficients.

Although a distance metric, the Hamming Distance (number of positive and negative mismatches) has been included because of its close relation with some of the aforementioned similarity measures.

Hamming Distance

September 14, 2007

After uploading the Levenshtein Edit Distance tool, several readers asked if I could incorporate a version to account for the Hamming Distance (HMD). Some even asked if could clarify the difference between the two.

I’m not sure if I can find time for the former, but many scripts are already online for computing Hamming Distances. One only needs to write few lines of code to compare the number of different characters between strings of same length. This is the Hamming Distance.

Here is a nice exercise:

Compute LED and HMD for derivatives of “ELVIS”. Cluster results by LED and HMD values.

Being Referenced Without Being Referenced

September 13, 2007

I have the honor of being “referenced” by Lau Patrick from ETH’s Databases and Information Systems Group in the manuscript Algorithms for Data Base Systems Report to the topic “Abusing Web Search” based on “A web based kernel function for measuring the similarity of short text snippets”.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/LauReport.pdf

Interesting project. I can even recognize some figures and examples of the manuscript taken from my document indexing tutorial,
http://www.miislita.com/information-retrieval-tutorial/indexing.html

In particular, Figure 1 of the tutorial, shown in the manuscript as Figure 2. No link to the tutorial was provided, but to a different document at Mi Islita. That’s fine, since I referenced that my Figure 1 is a modification from the one at  

http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

At least they should have given credit to Ruthven and Lalmas to avoid allegations of student plagiarism.

Memo to Thomas Hofmann (Google), Donald Kossmann, and Peter Widmayer from http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/:

Please! Better than talking about Web search abuse, similarity of text, and duplicated content, let’s talk about lack of originality and properly referenced work.

Random Notes and LauraMansfield

September 12, 2007

These are some late random notes. Sorry for the delay.

1. I am putting together a research project for a graduate student. The topic is quite interesting: homeland security. While researching the topic I came across LauraMansfield.com site. Mansfield’s site is a goldmine of information, especially for those interested in co-occurrence and word association research applied to the terrorist knowledge domain.

2. I am reviewing a graduate thesis in which logistic regression is used for data mining medical claims. Quite interesting the thesis topic. The manuscript needs some rework, though.

3. I am reading bits and pieces of an old paper on the non-transitivity nature of Jaccard’s Coefficient and a proposed indirect similarity measure.

European Conference on Digital Libraries (ECDL 2007)

September 11, 2007

The Computer and Automation Research Institute of the Hungarian Academy of Sciences (MTA SZTAKI) cordially invites you to participate in the 11th ECDL conference to be held in Budapest, Hungary on 16-21 September 2007.

Aims and Scopes:

This unique and well-established series brings together researchers, developers and practitioners working in various disciplines and related areas of digital libraries all over the world. The conference will consist of a three days technical program, preceded by a tutorial day and followed by workshops. The technical program will include refereed paper presentations, plenary sessions, panels and poster sessions. The tutorials will provide in-depth looks at areas of current interest.

ECDL 2007 will be devoted to discussions about hot issues and applications and will primarily provide a forum to reinforce the collaboration of researchers towards the Ubiquitous Digital Libraries.

Topics include, but not limited to:

Digital curation
Theoretical models for digital information management
Personal and personalized digital libraries
Concepts of Digital Libraries and digital content
Collection building, management and integration
System architectures, integration and interoperability
Information organization, search and usage
Multilingual information Access and Multimedia Information Handling
User interfaces for Digital Libraries
User studies and system evaluation
Digital archiving and preservation: methodological, technical and legal issues
Collaboration in DLs
Digital Library applications in e-science, e-learning, e-government, cultural heritage, etc.

For additional information, visit the conference site

Brute Force Algorithm

September 10, 2007

The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order.

This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code is given below.

void BF(char *x, int m, char *y, int n)
{
int i, j; 

for(j = 0; j <= n - m; ++j) 

{ 

for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} 

} 

}

This code can be converted straightforward to JavaScript or PHP.
 Modern Information Retrieval, Chapter 8, p. 210, has additional information on this and derivative string matching algorithms. These all differ in the way they check and shift the window.

Metric Clusters 2

September 7, 2007

Yesterday I provided a basic explanation of metric weights and metric clusters. This time I want to expand on this topic and provide some how-to calculations.

On Counting Word Distances

As mentioned, the distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) - p(B) |

Some IR authors use a different definition, though. For instance, Baeza-Yates and Ribeiro-Neto in Modern Information Retrieval, page 126, defines the distance between any two words as the number of words between these. The difference between these two definitions reduces to 1 count in the distance scale.

For example, the following position (P) vs. word (W) table depicts a 16-word text stream that mentions A twice and B three times:

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W A1 X X B1 X X A2 X X X X B2 X X X B3

Adopting the initial definition, d(A1, B1) = 4 - 1 = 3; adopting Baeza-Yates and Ribeiro-Neto’s definition, d(A1, B1) = 2.

The former definition is frequently used since is easier to automate than the later and insures non-zero distance values. To accommodate a distance count to the later definition, subtract 1.

Some authors, like Koberstein and Ng (http://faculty.cs.byu.edu/~dennis/papers/mtcluster.ps), use Baeza-Yates and Ribeiro-Neto’s definition, but then add 1 to the counts to insure that word distances between terms are always non-zero. This is the same as computing distances based on the former definition.

Regardless of how you compute word distances, be consistent. Understand that conversion of distances to metric weights produces relative, comparative data.

In the rest of this post, I use Baeza-Yates and Ribeiro-Neto’s definition of word distances, without adding 1. You can rework the calculations by adding 1 if you wish, but you should be able to arrive to the same general observations.

An Example

It can be demonstrated that for any word pair, A and B, there is a metric weight between the terms such that

mw(A, B) = mw(B, A) = mw(A) = mw(B)

That is, metric weights are symmetric. This is to be expected; metric weights are derived from distances and, by definition, a distance metric is symmetric.

Let consider the previous example.

With respect to A:

mw(A1) = ½ + 1/10 + 1/14 = 0.671

mw(A2) = ½ + ¼ + 1/8 = 0.875

mw(A) = mw(A1) + mw(A2) = 1.546

With respect to B:

mw(B1) = ½ + ½ = 1

mw(B2) = ¼ + 1/10 = 0.35

mw(B3) = 1/8 + 1/14 = 0.196

mw(B) = mw(B1) + mw(B2) + mw(B3) = 1.546

Hence, the metric weight of the A, B pair is

mw(A, B) = mw(B, A) = mw(A) = mw(B) = 1.546

In practice, mw(A, B) is computed with respect to either A or B. In IR textbooks, this quantity is frequently termed an unnormalized correlation factor and used to populate an unnormalized correlation matrix, Cu.

Metric Clusters 1

September 6, 2007

A graduate student asked about distance metrics and metric clusters. Unfortunately, many IR textbooks either ignore the subject or provide abstract explanations, hard to grasp by first-year students.

This post might help graduate students and casual readers to grasp the concept.

Terms that appear closer together in a document are assumed more related than those that appear far apart in the same document. To capture this notion of relatedness, metric clusters have been proposed.

The distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) - p(B) |

This difference is taken for the number of words between the terms. Such distance can be computed before or after filtration–the removal of stop words.

A metric weight, mw, is therefore defined as the inverse of this distance; i.e.

mw = 1/d(A, B)

This is one way of estimating the dispersion between any two words and the distribution of these in a document.

For instance, if A appears once and B appears three times in a document, calculating the metric weight of A reduces to computing the following quantities:

mw(A) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A) = 1/| p(A1) - p(B1) | + 1/| p(A1) - p(B2) | + 1/| p(A1) - p(B3) |

The subscripts in this summation refer to the different instances of the terms. Thus, mw considers the frequency of occurrence, position, and distance between the terms.
 
When a word appears several times in a document, its metric weight is a combination of the weight associated to each instance. For instance, if A appears twice and B appears three times in a document

mw(A1) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A2) = 1/d(A2, B1) + 1/d(A2, B2) + 1/d(A2, B3)
mw(A) = mw(A1) + mw(A2)

This actually is a double summation.

Some have argued to average these.

What applies to words also applies to stems. One just needs to identify sets of words with common stems.

Once metric weights are computed, a term-term co-weight matrix is constructed and clusters are identified by applying standard clustering techniques. These are the so-called metric clusters.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

Constrained Co-Occurrence Searches

September 4, 2007

In the current issue of IRW, “Constrained Co-Occurrence Searches”, we described cc searching in its two variants: proximity searching and adjacency searching. The difference between these two way of searching was explained and illustrated with few examples.

A case was made against the indiscriminate use of the NEAR, proximity, and adjacency searching expressions. A 2005 cc searching algorithm proposed by a research group from the Office of Naval Research (ONR) was also investigated.

In addition, we compared Google’s tilde operator with cc searching. Contrary to SEO opinions, the former is not an LSI operator, but used to conduct a lookup for synonyms; the later allows users to discover on-topic, in-context terms.

In our tests we have found that performance discovery is improved when cc searches are combined with Google’s commands like allintitle: and allinurl: commands, as in

allintitle: “car*insurance”

car rental insurance
car driver insurance
car accident insurance
car motor insurance
car dealers insurance

allinurl: “car*insurance”

…car-teacher-insurance…
…/car-accident/insurance-…
…/car-breakdown-insurance…
…Motor-Car-Import-Insurance…
…car-home-insurance.net…

To expand the text window, add a sequence of asterisks like this:

“car***insurance”

car insurance and home insurance
car rental loss damage insurance
car home and business insurance
car insurance young driver insurance
car life and commercial insurance

This allows users to retrieve documents wherein search terms are separated by at least three terms. To limit the search window to exactly three terms, the ONR algorithm has been suggested. The IRW issue discusses some advantages and limitations of this algorihtm.

Possible applications include SERPs snippet optimization, keyphrase discovery, contextual targeting of terms, and advanced EF-Ratio calculations, amongst other applications. It is clear that Web Mining of answer sets is possible. On-topic analysis is here to stay.

Subscribe to IRW and stay ahead of the curve. Learn about research that normally does not reach mainstream.

LSI, According to an SEOMOZ Glossary

September 3, 2007

In A Complete Glossary of Essential SEO Jargon an SEOMOZ poster defines LSI as follows:

“LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document. SEOs refer to these same groups of words as “Long Tail”. The majority of searches consist of three or more words strung together. See also “long tail”. The significance is that it might be almost impossible to rank well for “mortgage”, but fairly easy to rank for “second mortgage to finance monster truck team”

I have been asked to comment about this.

To put the post in perspective, a jargon glossary is like a collection of expressions used within a specific group of individuals with similar interests. Normally jargon is not intended for outsiders.

Overall, the post is a nice coffee table reading. The title states this is a complete glossary of essential SEO jargon. However, it can be argued whether the glossary is complete or if some entries of the glossary are indeed essential to SEOs.

Within SEO circles, jargon connected to search engine technology often comes with two elements:

(a) oversimplification

(b) misinformation

To the poster’s credit, not all entries of the glossary have (a), (b), or both, but are actually informative. Like some of the comments these generate, some are entertaining.

Unfortunately the LSI entry comes with both, (a) and (b). Last time I revisited the post the LSI entry was ignored by commenters. I could have posted these comments there and add content to their blog, but I decided at the last minute to add content to this blog, instead.

Now let’s comment on the sustantive part.

Firstly, two different concepts are almost concatenated by the poster: LSI and the so-called “long tail”. The former is based on SVD, and the later is an expression that describes a distribution. Research on long tail-shaped distributions are found in Mandelbrot’s early work from the 50’s and 60’s, and even before Mandelbrot. Page 84 of James Gleick’s best-seller, Chaos (1987) also mentions a long tail distribution Mandelbrot came across.

Secondly, LSI is not exactly document indexing as some may loosely imply by reading the LSI entry and as many SEOs have claimed in the past. LSI is applied to already indexed documents from which terms have been extracted and already scored with a particular term weight model. Thus before applying LSI, terms and docs are identified and indexed. Now using LSI to cluster terms and documents and then reclassifying these is a different thing. Sometimes this is called reindexing and loosely referred to as “indexing” by few folks.

The initial statement of the LSI entry is simply sloppy, a hearsay, and made out of thin air: “LSI(Latent Semantic Indexing) This mouthful just means that the search engines index commonly associated groups of words in a document”.

The other problem with this statement is the informational service it provides to the casual reader, who might believe and repeat such notion of LSI across the Web. Besides, LSI is not essential to SEOs.