Archive for the ‘Data Mining’ Category

IRW 2008-04: Principal Component Analysis (PCA)

April 16, 2008

PCA

Visualizing the two principal components of a data set.

 

The current issue of IR Watch - The Newsletter should be in your inbox during the day.

It is on Principal Component Analysis and covers the followings:

Introduction
What is PCA
A Reaction Equation Approach
Computing PCA with SVD
A Practical Example
Applying SVD to the Covariance Matrix
Improving Results with SPCA
Beyond the Covariance Matrix
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Understanding Search Engines

April 3, 2008

On April 23rd I’ll be presenting a seminar lecture at University of Puerto Rico, Bayamon (http://www.uprb.edu).

Topic, time, abstract, level, and requirements follows. 

Topic: Understanding Search Engines

Time: 12 Noon

Abstract: What are search engines? How do they work? What are their main components? How do they analyze document relevancy? What it takes to rank a web page in the top 10 search results of Google, Yahoo, and other search engines? These questions will be addressed in this conference.

Level: Beginners.
Requirements: None.

If you are an uprb student, a faculty, or staff and happen to be an SEO or webmaster, don’t miss this rare opportunity to learn answers to these and similar questions.

I will also use that opportunity to promote the upcoming conference we are co-organizing at Polytechnic University:

Search Engines and Information Security

This will be held in October 3 & 4, 2008. Additional information will be provided soon at PUPR.edu and Mi Islita.com sites.

Important Terrier Upgrade and Call for Participation

March 20, 2008

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 - 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Principal Component Analysis Tutorial and Other Algos

March 18, 2008

I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods

Unsupervised

Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).

Supervised

Discriminant analysis
Neural Networks (Back propagation)
SIMCA
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

Back Mapping for SEOs

March 13, 2008

A reader asked me how he could apply back mapping (BM) to SEO work.

Essentially, any algorithm like BM that works at the level of collections can be scaled down and used at a page level. One just needs to decompose a page into passages (chunks of text or text windows) and treat each passage as a pseudo document.

First, convert a document into a text stream; i.e., linearize, tokenize, and filtrate the document. You should end with a linearized piece of text.

 If it turns out that the linearized document is almost identical to the original document,  you might want to go back and partition the original document every x paragraphs or sentences. If not, partition the text stream so you end with segments or text windows. You need to decide if you want to do this very x number of words or characters.

Next, treat each segment or chunk of text as a “document” and construct a term-”document” matrix “A”.

If you want to narrow this down to the level of links, construct a term-link matrix.

 In general, an algorithm that applies to a true term-document matrix A can be applied to “A”.

Welcome to scaling.

Google-IBM-NSF Partnership in Data-Intensive Computing

February 28, 2008

We received few days ago the following Call for Proposals:

Dear Colleague:

We have some exciting news to share with all of you: NSF is partnering with Google and IBM to explore data-intensive computing. Through NSF’s reach, Google and IBM are providing software and services running on a large cluster to the broad academic community to explore innovative research and education ideas in data-intensive computing. Google and IBM launched the Academic Cluster Computing Initiative [1] last October with instructional programs at six pilot universities, and the NSF will be joining this initiative as the first research-oriented pilot partner. We are calling the NSF program to provide access to these types of resources the Cluster Exploratory (CluE).

Please read more about data-intensive computing, NSF’s partnership with Google and IBM, and the upcoming CISE solicitation for CluE.

Jeannette M. Wing, Assistant Director, Computer and Information Science and Engineering Directorate Dan Atkins, Director, Office of Cyberinfrastructure

Data-Intensive Computing

Data-intensive computing is a computational paradigm in which the sheer volume of data is the dominant performance parameter. Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. For example, Google runs an average of 100,000 MapReduce jobs per day on its clusters, processing over 20 petabytes daily [2]. This scale of computing effectively supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. The architecture is extremely fault-tolerant and exhibits high degrees of reliability and availability.

Data intensive computing raises important research challenges:

O For science
  - What are the fundamental capabilities and limitations of this paradigm?
  - What new programming abstractions (including models, languages, algorithms) does this computational model suggest?

O For technology
  - How can we automatically manage the hardware and software of these systems?
  - How can we reduce their power consumption?
O For society
  - What (new) applications can best exploit this computing paradigm?

Data-intensive computing is at the forefront of ultra-large-scale commercial data processing. A July 2006 New York Times article [3] notes that “Google, Microsoft and Yahoo are spending vast sums of capital to build out their computing capabilities.” Not only is there an increasing need for advances in data-intensive computing systems software and hardware, but also an increasing demand for a trained workforce to operate and use these systems. To date, however, the academic community has had limited access to such systems.

Enter Google and IBM

On October 8, 2007, Google and IBM announced they had teamed to provide six universities access to a large-scale computing cluster together with the software and services to use it effectively [4]. After several months of discussions, the NSF will be joining this initiative and will be partnering with Google and IBM to broaden the reach of this powerful computing resource to foster more innovation than might be possible in the initial pilot.

Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do fundamental, disruptive research in data-intensive computing and to explore powerful new applications. This facility can also serve as a tool for educating the next generation of scientists and engineers. This partnership is an excellent example of an academic-industry-government relationship that is a win-win-win situation for all.

System Description

The Google-IBM cluster contains well over a thousand processors connected to terabytes of memory and hundreds of terabytes of storage with internal networking as well as a substantial external network connection. The system will be configured with open source software to include Linux and Apache Hadoop [5] - a large-scale distributed computing platform inspired by Google’s MapReduce [6] and the Google File System [7]. IBM’s Tivoli [8] software will also be used for management, monitoring and dynamic resource provisioning of the cluster.

The system will provide a powerful resource for large-scale data analysis, mining and visualization in addition to support for Internet-scale computing applications. Tutorial information describing the programming environment of the Google-IBM academic cluster available via the CluE program can be found on the Google Code for Educators website [9]. Much of this material was developed in collaboration with the University of Washington, and all of it is available under permissive licenses such as the Creative Commons Attribution License.

Upcoming Solicitation

CISE is currently developing a program solicitation that will invite researchers to submit proposals requesting allocations of the Google-IBM cluster for any new, innovative use of the system and to probe the possibilities and fundamental limits of this new computing paradigm. The emphasis of the program will be to develop new approaches and applications that are outside the typical high-performance computing applications running on today’s supercomputers.

The challenge to the academic community is three-fold: to use existing tools and to develop new programming abstractions for such a “computer” to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications. This resource will also provide an opportunity to teach students how to build, use and manage data-intensive computing systems systems that are already being used widely in industry, but are largely unavailable to students and faculty today.

Please look for the new solicitation which will be posted on the CISE web site. CISE looks forward to your bold, creative proposals for CluE!

[1] Official Google Blog: http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. of the ACM, 51(1), January 2008, pp. 107-113.

[3] http://www.nytimes.com/2006/06/14/technology/14search.html?_r=1&pagewanted=2&oref=slogin

[4] See http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html or http://www-03.ibm.com/press/us/en/pressrelease/22414.wss for the text of the press release.

[5] http://hadoop.apache.org/

[6] http://labs.google.com/papers/mapreduce.html

[7] http://labs.google.com/papers/gfs.html

[8] http://www.ibm.com/software/tivoli/

[9] http://code.google.com/edu/content/parallel.html

Predating Link Models and PageRank

February 14, 2008

Did you know that before PageRank and current link models, and even before CLEVER and HITS, there were some actually working on hyper search engines and hyper text?

In

The Quest for Correct Information on the Web: Hyper Search Engines

Massimo Marchiori described just that: components and elements now in use by many current link models, including PageRank. It is clear that there is nothing new under the Sun, but is nice to have a flashback and read about search engines that no longer are among us.

Marchiori’s abstract follows:

Finding the right information on the World Wide Web is becoming a fundamental problem, since the amount of global information that the WWW contains is growing at an incredible rate. In this paper, we present a novel method for extracting from a Web object its “hyper” informative content, in contrast with current search engines, which only deal with the “textual” informative content. This method is not only valuable per se, but it is shown to be able to considerably increase the precision of current search engines, Moreover, it integrates smoothly with existing search engine technology since it can be implemented on top of every search engine, acting as a post-processor, thus automatically transforming a search engine into its corresponding “hyper” version. We also show how, interestingly, the hyper information can be usefully employed to face the search engine persuasion problem.

Some highlights from the WWW6 paper follows (emphasis added). Even when some statements in it are no longer valid and other are, I prefer readers to dissect these in the light of the current state of the art:

The problem is that visibility says nothing about the informative content of a Web object. The misleading assumption is that if a Web object has high visibility, then this is a sign of importance and consideration, and so de facto its informative content must be more valuable than other Web objects that have fewer links pointing to them….In a nutshell, visibility is likely to be a synonym of popularity, which is something completely different than quality, and thus using it to gain higher score from search engines is a rather poor choice.

What is really missing

As said, what is really missing in the evaluation of the score of a Web object is its hyper part, that is the dynamic information content which is provided by hyperlinks (henceforth, simply links).

We call this kind of information hyper information: this information should be added to the textual information of the Web object, giving its (overall) information to the World Wide Web. We indicate these three kinds of information as HYPERINFO, TEXTINFO and INFORMATION, respectively. So, for every Web object A we have that INFORMATION(A) = TEXTINFO(A) + HYPERINFO(A) (note that in general these information functions depend on a specific query, that is to say they measure the informative content of a Web object with respect to a certain query: in the sequel; we will always consider such a query to be understood).

The presence of links in a Web object clearly augments the informative content with the information contained in the pointed Web objects (although we have to establish to what extent).

Recursively, links present in the pointed Web objects further contribute, and so on. Thus, in principle, the analysis of the informative content of a Web object A should involve all the Web objects that are reachable from it via hyperlinks (i.e., “navigating” in the World Wide Web).

Regarding SEP (now called SEO)

A big problem that search engines have to face is the phenomenon of so-called sep (search engines persuasion). Indeed, search engines have become so important in the advertising market that it has become essential for companies to have their pages listed in top positions of search engines, in order to get a significant Web-based promotion. Starting with the already pioneering work of Rhodes ([5]), this phenomenon is now boosting at such a rate as to have provoked serious problems for search engines, and has revolutioned the Web design companies, which are now specifically asked not only to design good Web sites, but also to make them rank high in search engines. A vast number of new companies was born just to make customer Web pages as visible as possible. More and more companies, like Exploit, Allwilk, Northern Webs, Ryley & Associates, etc., explicitly study ways to rank a page highly in search engines. OpenText arrived even to sell “preferred listings;” i.e., assuring that a particular entry will stay in the top ten for some time, a policy that has provoked some controversies (cf. [9]).

Interesting pictures

Regarding information depth:

Regarding the cost of non determinism:

Regarding back links:

12 years later, no much has changed.

Back Mapping for the Masses

January 31, 2008

In a recent tutorial on association and scalar clusters, http://www.miislita.com/information-retrieval-tutorial/association-scalar-clusters-tutorial-1.pdf, I introduced a back mapping technique wherein once features conforming clusters are extracted from objects, the clusters are mapped back to objects.

The technique works well with clusters of terms extracted from documents. The reverse case is also possible: given a cluster of documents extracted from terms, it is possible to map these back to terms.

What do we gain from such two-way manipulations? A lot. Consider the first scenario: Mapping term clusters back to documents; a tutorial on the second scenario will be available soon.

Back Mapping Term Clusters to Documents

A document is just a distribution over topics while topics are distributions over words. Thus, across a collection of documents there are topics hidden (latent) and waiting to be uncovered. Back mapping allows us to recover these, precisely.

Combinations of terms that do not amount to topics across the collection are discovered as well. Reasonably, one would expect these to be least relevant across other documents than those distributed across the collection. In addition, one would expect documents traced back to clusters to be the most relevant documents, from the collection and with respect to the topics.

The implications of this for search engine optimization and keyword bidding are quite obvious. Implementation is straightforward. To learn more about it, read Part 1 of the tutorial.

Association & Scalar Clusters Tutorial - Part 1

January 22, 2008

I am writing a tutorial series on Cluster Analysis. It is my pleasure to announce that the
Association and Scalar Clusters Tutorial - Part 1: Back Mapping Term Clusters to Documents was uploaded few days ago.

Online publication was announced in advanced to subscribers of the IR Watch - The Newsletter, so they already have an edge over regular readers and visitors of Mi Islita

Abstract follows:

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A “reaction” equation approach is used to break down the classification problem to a sequence of steps. From the initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified. A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters. Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation, and document classification are discussed.

During last night lecture (Web Mining Course), I applied the back mapping technique to scalar clusters generated from LSI. The technique provides additional information and reasons as to how and why documents score as observed after implementing SVD. A clear connection with Fuzzy Set Theory was made.

Students taking the Web Mining Course will find this tutorial quite handy.

A Novel Perspective of Similarity

September 18, 2007

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Random Notes and LauraMansfield

September 12, 2007

These are some late random notes. Sorry for the delay.

1. I am putting together a research project for a graduate student. The topic is quite interesting: homeland security. While researching the topic I came across LauraMansfield.com site. Mansfield’s site is a goldmine of information, especially for those interested in co-occurrence and word association research applied to the terrorist knowledge domain.

2. I am reviewing a graduate thesis in which logistic regression is used for data mining medical claims. Quite interesting the thesis topic. The manuscript needs some rework, though.

3. I am reading bits and pieces of an old paper on the non-transitivity nature of Jaccard’s Coefficient and a proposed indirect similarity measure.

Metric Clusters 2

September 7, 2007

Yesterday I provided a basic explanation of metric weights and metric clusters. This time I want to expand on this topic and provide some how-to calculations.

On Counting Word Distances

As mentioned, the distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) - p(B) |

Some IR authors use a different definition, though. For instance, Baeza-Yates and Ribeiro-Neto in Modern Information Retrieval, page 126, defines the distance between any two words as the number of words between these. The difference between these two definitions reduces to 1 count in the distance scale.

For example, the following position (P) vs. word (W) table depicts a 16-word text stream that mentions A twice and B three times:

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W A1 X X B1 X X A2 X X X X B2 X X X B3

Adopting the initial definition, d(A1, B1) = 4 - 1 = 3; adopting Baeza-Yates and Ribeiro-Neto’s definition, d(A1, B1) = 2.

The former definition is frequently used since is easier to automate than the later and insures non-zero distance values. To accommodate a distance count to the later definition, subtract 1.

Some authors, like Koberstein and Ng (http://faculty.cs.byu.edu/~dennis/papers/mtcluster.ps), use Baeza-Yates and Ribeiro-Neto’s definition, but then add 1 to the counts to insure that word distances between terms are always non-zero. This is the same as computing distances based on the former definition.

Regardless of how you compute word distances, be consistent. Understand that conversion of distances to metric weights produces relative, comparative data.

In the rest of this post, I use Baeza-Yates and Ribeiro-Neto’s definition of word distances, without adding 1. You can rework the calculations by adding 1 if you wish, but you should be able to arrive to the same general observations.

An Example

It can be demonstrated that for any word pair, A and B, there is a metric weight between the terms such that

mw(A, B) = mw(B, A) = mw(A) = mw(B)

That is, metric weights are symmetric. This is to be expected; metric weights are derived from distances and, by definition, a distance metric is symmetric.

Let consider the previous example.

With respect to A:

mw(A1) = ½ + 1/10 + 1/14 = 0.671

mw(A2) = ½ + ¼ + 1/8 = 0.875

mw(A) = mw(A1) + mw(A2) = 1.546

With respect to B:

mw(B1) = ½ + ½ = 1

mw(B2) = ¼ + 1/10 = 0.35

mw(B3) = 1/8 + 1/14 = 0.196

mw(B) = mw(B1) + mw(B2) + mw(B3) = 1.546

Hence, the metric weight of the A, B pair is

mw(A, B) = mw(B, A) = mw(A) = mw(B) = 1.546

In practice, mw(A, B) is computed with respect to either A or B. In IR textbooks, this quantity is frequently termed an unnormalized correlation factor and used to populate an unnormalized correlation matrix, Cu.

Metric Clusters 1

September 6, 2007

A graduate student asked about distance metrics and metric clusters. Unfortunately, many IR textbooks either ignore the subject or provide abstract explanations, hard to grasp by first-year students.

This post might help graduate students and casual readers to grasp the concept.

Terms that appear closer together in a document are assumed more related than those that appear far apart in the same document. To capture this notion of relatedness, metric clusters have been proposed.

The distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) - p(B) |

This difference is taken for the number of words between the terms. Such distance can be computed before or after filtration–the removal of stop words.

A metric weight, mw, is therefore defined as the inverse of this distance; i.e.

mw = 1/d(A, B)

This is one way of estimating the dispersion between any two words and the distribution of these in a document.

For instance, if A appears once and B appears three times in a document, calculating the metric weight of A reduces to computing the following quantities:

mw(A) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A) = 1/| p(A1) - p(B1) | + 1/| p(A1) - p(B2) | + 1/| p(A1) - p(B3) |

The subscripts in this summation refer to the different instances of the terms. Thus, mw considers the frequency of occurrence, position, and distance between the terms.
 
When a word appears several times in a document, its metric weight is a combination of the weight associated to each instance. For instance, if A appears twice and B appears three times in a document

mw(A1) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A2) = 1/d(A2, B1) + 1/d(A2, B2) + 1/d(A2, B3)
mw(A) = mw(A1) + mw(A2)

This actually is a double summation.

Some have argued to average these.

What applies to words also applies to stems. One just needs to identify sets of words with common stems.

Once metric weights are computed, a term-term co-weight matrix is constructed and clusters are identified by applying standard clustering techniques. These are the so-called metric clusters.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

Constrained Co-Occurrence Searches

September 4, 2007

In the current issue of IRW, “Constrained Co-Occurrence Searches”, we described cc searching in its two variants: proximity searching and adjacency searching. The difference between these two way of searching was explained and illustrated with few examples.

A case was made against the indiscriminate use of the NEAR, proximity, and adjacency searching expressions. A 2005 cc searching algorithm proposed by a research group from the Office of Naval Research (ONR) was also investigated.

In addition, we compared Google’s tilde operator with cc searching. Contrary to SEO opinions, the former is not an LSI operator, but used to conduct a lookup for synonyms; the later allows users to discover on-topic, in-context terms.

In our tests we have found that performance discovery is improved when cc searches are combined with Google’s commands like allintitle: and allinurl: commands, as in

allintitle: “car*insurance”

car rental insurance
car driver insurance
car accident insurance
car motor insurance
car dealers insurance

allinurl: “car*insurance”

…car-teacher-insurance…
…/car-accident/insurance-…
…/car-breakdown-insurance…
…Motor-Car-Import-Insurance…
…car-home-insurance.net…

To expand the text window, add a sequence of asterisks like this:

“car***insurance”

car insurance and home insurance
car rental loss damage insurance
car home and business insurance
car insurance young driver insurance
car life and commercial insurance

This allows users to retrieve documents wherein search terms are separated by at least three terms. To limit the search window to exactly three terms, the ONR algorithm has been suggested. The IRW issue discusses some advantages and limitations of this algorihtm.

Possible applications include SERPs snippet optimization, keyphrase discovery, contextual targeting of terms, and advanced EF-Ratio calculations, amongst other applications. It is clear that Web Mining of answer sets is possible. On-topic analysis is here to stay.

Subscribe to IRW and stay ahead of the curve. Learn about research that normally does not reach mainstream.

Identifying Hidden Sequences

August 27, 2007

In the Row-Pruning Algorithm Tutorial post I introduced an algorithm for identifying terms sequences hidden in collections. The tutorial is available in Mi Islita.com site. Several applications were also presented.

In a nutshell, given a set of terms T={t1, t2..tm} extracted from a document collection, C={d1, d2..dn} the goal is to combine each of these terms with all other terms of T and then find the family of term sequences more frequently present in C. Such families are of the form

“k1″
“k1 + k2″
“K1 + k2 +..km”

The double quotes indicate that these are term sequences.

The algorithm then is aimed at establishing the identity of the k’s using T and C. Once k1 is identified, the algorithm reduces to conducting a binary partition on the corresponding largest answer sets. Normally T consists of few terms (m < threshold value) and as such is a subset of the index of terms. Whenever possible these should be on-topic terms and preselected by a suitable algorithm or heuristic criterion.

Confidence-based pruning is used, but one could as well use other measures like co-occurrence, support, or similar measures. One does not need to restrict the algorithm to just terms.

I see many applications to this, including visitations:

“Users clicking on X and then on Y tend to click next on Z

“Customers visiting store X and then store Y tend to visit next store Z.”

and so forth…

Data Mining and Reports on Terrorism

August 22, 2007

I’m researching the topic of Data Mining (KDD) and Terrorism Information Awareness (TIA) for a graduate course and came across a great old resource:

Data Mining

It is oldie, but the important part are the references.

It may interest IRs conducting similar research.

Here is another great resource:

Data Mining and Homeland Security

Levenshtein Edit-Distance Based Tool

August 20, 2007

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

Search Smart with WCC’s ELISE

August 9, 2007

The list of subscribers to IRWatch is growing at fast pace. One of our recent subscribers is a developer at WCC, makers of ELISE Smart Search & Match. This seems to be a quite interesting technology. I highly recommend readers to visit their site http://wcc-group.com/

(more…)

Row-Pruning Algorithm Tutorial

August 8, 2007

In IRW-2007-8 I introduced a row-pruning algorithm (RPA) and its unconditional version (URPA).

Let C be a collection of documents and let T = {t1, t2… tm} be m unique terms extracted from C. Assume that term i is combined with all other terms such that starting with term i a family of term sequences is obtained. Assume that these term sequences are hidden (latent) in C. The purpose of RPA and URPA is to identify the composition of these term sequences and to find those that occur more frequently in C.

(more…)

THESUS: Semantic Subsets Based on WWW Links

August 2, 2007

I came across a 2003 paper on mining WWW text links, in which researchers introduced THESUS. Really interesting piece.

The abstract of THESUS: ORGANIZING WEB DOCUMENT COLLECTIONS BASED ON LINK SEMANTICS follows:

(more…)

DARPA Agent Markup Language (DAML)

August 1, 2007

DARPA Agent Markup language (DAML) site has tons of tools and resources CS/IR graduate students and SEM/SEO practitioners with some IR knowledge can use for data mining purposes. These can help with nice experiments, from ontology-based keyword discovery to the construction of crawlers (or at least really learn how these actually work).

Here is a list of resources.

(more…)

Snake Preview of IR Watch

July 31, 2007

The current issue of IRW-2007-08 should be out within the next few days. The Association Rules Part 2 discuses how association rule mining techniques from market basket can be applied to Web Mining.

(more…)

Association Rule Mining Thesaurus

July 30, 2007

Here is a 2004 paper on Association Thesaurus Construction for Interactive Query Expansion based on Association Rule Mining

The article discusses basic association rule mining concepts like support, confidence, and pruning as we described in Association Rules Part 1 (July issue of IR Watch - The Newsletter). BTW, read Part 2 in the August issue.

(more…)

Conceptual Similarity and Conceptual Mapping

July 26, 2007

I came across a relatively old paper authored by researchers at Indiana University and Institute for Human and Machine Cognition: Assessing Conceptual Similarity to Support Concept Mapping

Indeed, it seems like a great concept.

(more…)

IRW 2007-7: Association Rules - Part 1

July 11, 2007

Association Rules

Association Rules based on co-occurrence can be used to address relationships like: Customers buying X tend to buy Y. These can be used to support business-related services such as marketing promotions, inventory, and CRM programs. Learn how by reading the July 2007 issue of IR Watch - The Newsletter.

(more…)

Still Standing Up

July 3, 2007

Finally we completed some backend changes to http://www.miislita.com.  Sorry for the inconvenient. These changes were necessary, but have the effect of delaying the publication of the July issue of IR Watch.

(more…)

Mining End User Locations

June 19, 2007

Ever wonder how to conduct data mining from end user locations? This is easier to do than you think.

At Mi Islita we have been testing for a while a redirection mechanism that collects directory and file path information from end users. Our goals are:

(a) to illustrate that on the Web privacy is an illusion.
(b) to conduct data mining from user’s behaviors.

(more…)

Data Mining for All Disciplines

June 18, 2007

It looks like I’ll be teaching this Fall a graduate course on Data Mining (DM) for CS and Business students. I often find myself explaining across disciplines that DM is the Discipline of Knowledge (DK), that there is nothing unusual for someone with a background in chemistry, biology, or business to cross the line of university departments and reach computer engineering courses, looking for data mining or knowledge discovery in data bases (KDD). This might explain why search engine companies hire PhDs from all disciplines.

(more…)

What is Data Mining?

May 31, 2007

What is Data Mining? Good question.

After a great one week vacation away from the blog, it is good to be back. During my vacation I was asked to explain the difference between data mining and information retrieval; so this post goes.

Here is a standard definition I wrote for a graduate course syllabus to be taught next fall at a local university:

(more…)