Archive for the ‘Machine Learning’ Category

How Search Engines Do Not Work

April 17, 2008

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

Experiment in Parsing Techniques

April 14, 2008

Here are some useful notes for those taking the Search Engines Architecture grad course.

In Lab 4: Experiment in Parsing Techniques, you are building a Query Normalizer, a Document Parser, and a URL Normalizer.

In section 1, be sure the tool removes all kind of HTML instructions from the search interface.

In section 2, the document parser should does document linearization, tokenization, and filtration, but no stemming. The final output should be a list of tuples consisting of unique terms and occurrences.

Regarding section 3.1, Building a URL Normalizer, I’m adding new restrictions to this part.

This part challenges you to use regular expressions, only. No arrays, no scripting loops, no conditionals (if-then), no lookup libraries, no DNS lookups, just regexps. It should work for all valid urls available on the Web. It can be done.

The parser for the URL normalizer tool should remove from a URL all kind of:

protocols (http, https, ftp, etc)
www prefixes
top-level domains (TLD)
ports (:80, :8000, etc)
file extensions (.html, .php, etc)
parameters (name=value pairs)
named anchors or fragments (#)
URL-forbidden characters, international characters, or script lines

Be sure you understand the difference between top-level domains and subdomains. For example, .com and .pr are TLDs, but .com.pr defines a subdomain.

Let say we have the http://www.xxx.com.pr site. If we remove http://www. and .pr from this we end with xxx.com, but .com is not the TLD in this case. The TLD still is .pr, though. Redirection mechanisms are “another twenty bucks” (”otros veinte pesos”), which might confuse the concept.

You should also know that a TLD can be a generic top-level-domain (gTLD), a country-code top-level domain (ccTLD), international TLDs (iTLDs), or US legacy TLDs (usTLDs). When combined, these define a subdomain. For example,

.edu.pr is a subdomain with .pr as TLD
.co.uk is a subdomain with .uk as TLD

That is, for a url on the Web, only one string sequence works as and defines the top-level domain.

Thus, for:

http://www.google.com.pr
https://www.google.com/adsense/login/en_US/?gsessionid=wfx3oxHhgDU
ftp://www.gobierno.pr/index.html
http://search.music.yahoo.com/search/?m=all&p=britney
http://www.telegraph.co.uk
http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s
http://www.lis.ntu.edu.tw/~mctang/index.htm

your tool should return:

google.com
google
gobierno
search.music.yahoo
telegraph.co
video.google.co
lis.ntu.edu

Recommended Lecture Material:

http://en.wikipedia.org/wiki/.pr
http://www.icann.org/meetings/saopaulo/presentation-dns-conrad-07dec06.pdf
http://www.mattcutts.com/blog/seo-glossary-url-definitions/  

Important Terrier Upgrade and Call for Participation

March 20, 2008

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 - 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Principal Component Analysis Tutorial and Other Algos

March 18, 2008

I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods

Unsupervised

Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).

Supervised

Discriminant analysis
Neural Networks (Back propagation)
SIMCA
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

Back Mapping for SEOs

March 13, 2008

A reader asked me how he could apply back mapping (BM) to SEO work.

Essentially, any algorithm like BM that works at the level of collections can be scaled down and used at a page level. One just needs to decompose a page into passages (chunks of text or text windows) and treat each passage as a pseudo document.

First, convert a document into a text stream; i.e., linearize, tokenize, and filtrate the document. You should end with a linearized piece of text.

 If it turns out that the linearized document is almost identical to the original document,  you might want to go back and partition the original document every x paragraphs or sentences. If not, partition the text stream so you end with segments or text windows. You need to decide if you want to do this very x number of words or characters.

Next, treat each segment or chunk of text as a “document” and construct a term-”document” matrix “A”.

If you want to narrow this down to the level of links, construct a term-link matrix.

 In general, an algorithm that applies to a true term-document matrix A can be applied to “A”.

Welcome to scaling.

Google-IBM-NSF Partnership in Data-Intensive Computing

February 28, 2008

We received few days ago the following Call for Proposals:

Dear Colleague:

We have some exciting news to share with all of you: NSF is partnering with Google and IBM to explore data-intensive computing. Through NSF’s reach, Google and IBM are providing software and services running on a large cluster to the broad academic community to explore innovative research and education ideas in data-intensive computing. Google and IBM launched the Academic Cluster Computing Initiative [1] last October with instructional programs at six pilot universities, and the NSF will be joining this initiative as the first research-oriented pilot partner. We are calling the NSF program to provide access to these types of resources the Cluster Exploratory (CluE).

Please read more about data-intensive computing, NSF’s partnership with Google and IBM, and the upcoming CISE solicitation for CluE.

Jeannette M. Wing, Assistant Director, Computer and Information Science and Engineering Directorate Dan Atkins, Director, Office of Cyberinfrastructure

Data-Intensive Computing

Data-intensive computing is a computational paradigm in which the sheer volume of data is the dominant performance parameter. Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. For example, Google runs an average of 100,000 MapReduce jobs per day on its clusters, processing over 20 petabytes daily [2]. This scale of computing effectively supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. The architecture is extremely fault-tolerant and exhibits high degrees of reliability and availability.

Data intensive computing raises important research challenges:

O For science
  - What are the fundamental capabilities and limitations of this paradigm?
  - What new programming abstractions (including models, languages, algorithms) does this computational model suggest?

O For technology
  - How can we automatically manage the hardware and software of these systems?
  - How can we reduce their power consumption?
O For society
  - What (new) applications can best exploit this computing paradigm?

Data-intensive computing is at the forefront of ultra-large-scale commercial data processing. A July 2006 New York Times article [3] notes that “Google, Microsoft and Yahoo are spending vast sums of capital to build out their computing capabilities.” Not only is there an increasing need for advances in data-intensive computing systems software and hardware, but also an increasing demand for a trained workforce to operate and use these systems. To date, however, the academic community has had limited access to such systems.

Enter Google and IBM

On October 8, 2007, Google and IBM announced they had teamed to provide six universities access to a large-scale computing cluster together with the software and services to use it effectively [4]. After several months of discussions, the NSF will be joining this initiative and will be partnering with Google and IBM to broaden the reach of this powerful computing resource to foster more innovation than might be possible in the initial pilot.

Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do fundamental, disruptive research in data-intensive computing and to explore powerful new applications. This facility can also serve as a tool for educating the next generation of scientists and engineers. This partnership is an excellent example of an academic-industry-government relationship that is a win-win-win situation for all.

System Description

The Google-IBM cluster contains well over a thousand processors connected to terabytes of memory and hundreds of terabytes of storage with internal networking as well as a substantial external network connection. The system will be configured with open source software to include Linux and Apache Hadoop [5] - a large-scale distributed computing platform inspired by Google’s MapReduce [6] and the Google File System [7]. IBM’s Tivoli [8] software will also be used for management, monitoring and dynamic resource provisioning of the cluster.

The system will provide a powerful resource for large-scale data analysis, mining and visualization in addition to support for Internet-scale computing applications. Tutorial information describing the programming environment of the Google-IBM academic cluster available via the CluE program can be found on the Google Code for Educators website [9]. Much of this material was developed in collaboration with the University of Washington, and all of it is available under permissive licenses such as the Creative Commons Attribution License.

Upcoming Solicitation

CISE is currently developing a program solicitation that will invite researchers to submit proposals requesting allocations of the Google-IBM cluster for any new, innovative use of the system and to probe the possibilities and fundamental limits of this new computing paradigm. The emphasis of the program will be to develop new approaches and applications that are outside the typical high-performance computing applications running on today’s supercomputers.

The challenge to the academic community is three-fold: to use existing tools and to develop new programming abstractions for such a “computer” to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications. This resource will also provide an opportunity to teach students how to build, use and manage data-intensive computing systems systems that are already being used widely in industry, but are largely unavailable to students and faculty today.

Please look for the new solicitation which will be posted on the CISE web site. CISE looks forward to your bold, creative proposals for CluE!

[1] Official Google Blog: http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. of the ACM, 51(1), January 2008, pp. 107-113.

[3] http://www.nytimes.com/2006/06/14/technology/14search.html?_r=1&pagewanted=2&oref=slogin

[4] See http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html or http://www-03.ibm.com/press/us/en/pressrelease/22414.wss for the text of the press release.

[5] http://hadoop.apache.org/

[6] http://labs.google.com/papers/mapreduce.html

[7] http://labs.google.com/papers/gfs.html

[8] http://www.ibm.com/software/tivoli/

[9] http://code.google.com/edu/content/parallel.html

IRW-2008-2 Snake Preview

February 18, 2008

Scalar Clusters

Masking Effects in Similarity Matrices:
Topics with strong similarities can mask other topics,
making these invisible to the clustering process.

Here is the snake preview of the current issue of IR Watch - The Newsletter. Due to academic duties, it is running late. It should be in subscribers’s inboxes during the day. The topic is Scalar Clusters and Back Mapping and is based on lecture material covered in the Web Mining course. The following topics are covered:

Introduction
On Association and Scalar Clusters
The Neighborhood Similarity Matrix, Mn
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Term Clusters to Documents
Masking Effects in Similarity Matrices
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Not a subscribers? What are you waiting for?

Back Mapping for the Masses

January 31, 2008

In a recent tutorial on association and scalar clusters, http://www.miislita.com/information-retrieval-tutorial/association-scalar-clusters-tutorial-1.pdf, I introduced a back mapping technique wherein once features conforming clusters are extracted from objects, the clusters are mapped back to objects.

The technique works well with clusters of terms extracted from documents. The reverse case is also possible: given a cluster of documents extracted from terms, it is possible to map these back to terms.

What do we gain from such two-way manipulations? A lot. Consider the first scenario: Mapping term clusters back to documents; a tutorial on the second scenario will be available soon.

Back Mapping Term Clusters to Documents

A document is just a distribution over topics while topics are distributions over words. Thus, across a collection of documents there are topics hidden (latent) and waiting to be uncovered. Back mapping allows us to recover these, precisely.

Combinations of terms that do not amount to topics across the collection are discovered as well. Reasonably, one would expect these to be least relevant across other documents than those distributed across the collection. In addition, one would expect documents traced back to clusters to be the most relevant documents, from the collection and with respect to the topics.

The implications of this for search engine optimization and keyword bidding are quite obvious. Implementation is straightforward. To learn more about it, read Part 1 of the tutorial.

Kendall’s Test and PageRank

December 26, 2007

For years marketers have put PageRank-based models into question, via “sandbox” theories and similar tales.

This old paper might help to address the issue on the biased/unbiased nature of these models: Paradoxical Effects in PageRank Incremental Computations.

What interest me the most about the paper is how the authors put into use Kendall’s t theory. Written by Paolo Boldi, Massimo Santini, and Sebastiano Vigna the abstract states:

“Abstract. Deciding which kind of visiting strategy accumulates high-quality pages morequickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectivenessfor crawl strategies: whether the graph obtained after a partial visit is in some senserepresentative of the underlying web graph as far as the computation of PageRank isconcerned. More precisely, we are interested in determining how rapidly the computationof PageRank over the visited subgraph yields node orders that agree with theones computed in the complete graph; orders are compared using Kendall’s t .We describe a number of large-scale experiments that show the following paradoxicaleffect: visits that gather PageRank more quickly (e.g., highest-quality first) arealso those that tend to miscalculate PageRank. Finally, we perform the same kind ofexperimental analysis on some synthetic random graphs, generated using well-knownweb-graph models: the results are almost opposite to those obtained on real webgraphs.”

The authors also state:

“The most classical visit strategies are the following:

Depth-first order: the crawler chooses the next page as the last that wasadded to the frontier; in other words, the visit proceeds in a LIFO fashion.
Random order: the crawler chooses randomly the next page from the frontier.
Breadth-first order: the crawler chooses the next page as the first that wasadded to the frontier; in other words, the visit proceeds in a FIFO fashion.
Omniscient order (or quality-first order): the crawler uses a queue prioritisedby PageRank values [Cho et al. 98]; in other words, it chooses to visitthe page with highest quality among the ones in the frontier. This visit ismeaningless unless a previous PageRank computation of the entire graphhas been performed before the visit, but it is useful for comparisons. A variant of this strategy may also be adopted if we have already performeda crawl and thus we have the (old) PageRank values of (at least some ofthe) pages.”

“Both common sense and experiments (see, in particular, [Boldi et al. 05])suggest that the visits listed above accumulate PageRank in an increasinglyquicker way. This is to be expected, as the omniscient visit will point immediatelyto pages of high quality. The fact that the breadth-first visit yields high-qualitypages was noted in [Najork and Wiener 01].”

“There is, however, a different and also quite relevant problem that has beenpreviously overlooked in the literature: if we assume that the crawler has noprevious knowledge of the web region it has to crawl, it is natural that it will try to detect page quality during the crawl itself, by computing PageRank onthe region it has just seen. We would like to know whether a crawler doing sowill obtain reasonable results or not.”

A Case-Based Experience Sharing Search Engine

November 28, 2007

Mobyen Ahmed, Erik Olsson, Peter Funk, Ning Xiong from Department of Computer Science and Electronics, Malardalen University, Sweden have published the paper “Efficient Condition Monitoring and Diagnosis Using a Case-Based Experiene Sharing System”
(http://www.mrtc.mdh.se/publications/1269.pdf)  at a workshop at the Swedish Artificial Intelligence Society, p 70-80, in May, 2007.

This is a case-based reasoning search engine system that could be used in an industrial environment. That’s quite interesting.

In their paper, the authors kindly referenced me. That’s an honor. It appears that more CS researchers are happy with my Cosine Similarity Tutorial (http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html) and with Mi Islita’s content (http://www.miislita.com/searchito/educational-links.html). If they are happy, I’m happy.

Terrier IR 1.1.1 Update

November 9, 2007

Craig MacDonald, from the Terrier project at University of Glasgow sent me this update few days ago.

“Terrier, IR Platform v 1.1.1 - 24/10/2007. http://ir.dcs.gla.ac.uk/terrier/

Terrier, the open source IR platform from the University of Glasgow has been updated to version 1.1.1.

This is a minor update, which contains mostly bug fixes. Some minor code enhancements and a test harness are also included. Moreover, the Snowball stemmers were added to boost support for languages other than English.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL) and is available from the Terrier website: http://ir.dcs.gla.ac.uk/terrier

Back to Business

November 8, 2007

Back to business. IRW will soon be out. This month issue covers the popular K-Means Algorithm and its variants. A back issue on Genetic Algorithms will also be sent. This issue was supposed to be out last month. Sorry it took that long. As most of you know, I was away from the internet, attending other duties.

My inbox is full, as expected. I will reply to all of your emails, … eventually. Stay with me.

Well, next week begins the Winter semester at Polytechnic University. I’ll be teaching the graduate course:

Web Mining: A First Course in Web Mining, Search Engines, and Business Intelligence.

Description: This is a hands-on, one-full semester course on Web Mining, search engines, and business intelligence. Students will learn by doing: (a) how search engines index and rank web documents, (b) how to conduct business intelligence from online resources, and (c) how to apply Web Mining strategies and algorithms in their research or workplace.Target: Students in Business, Engineering, and Computer Sciences and from other disciplines are encouraged to register for this special course. Requirements: Calculus II or Permission from advisor or department. Grading: Take-home work and a final exam. Topics: The following topics will be covered, not necessarily in this order: 

  • Document Indexing: Indexing of Web sites and text operations used by search engines including document linearization, tokenization, stop word filtration, stemming, and parsing.
  • Search Engine Optimization: Mining search engine relevance algorithms for ranking high Web pages.
  • Intelligence Searching: Covers undocumented (smart) searches in Google and other search engines; includes Hacking and Penetration through customized searches.
  • Keyword Research and Clustering: Discovery of word patterns and keywords for branding and marketing through Association, Scalar, and Metric Clusters.
  • Term Matching Algorithms: Vector Space Models used by search engines. Scoring of local, global, and entropy term weights.
  • Concept Matching Algorithms: Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) models for clustering and ranking.
  • Link Analysis Models: Google’s PageRank, Hubs & Authorities, and other link-based models.
  • Spam Intelligence: Tools and techniques for spamming search engines and web sites. Includes techniques based on scripts, cloacking, keyword spam techniques, link-bombs, email marketing, viral marketing, Web 2.0, and Web 3.0.
  • Introduction to Business Dashboards (BDs): Overview of dashboard technology, including open source, and customized add-on components.
  • Special Topics: On-Topic Analysis, Co-Occurrence Theory, and Latent Graphs. This is my own area of research.

 Textbook: Web Mining evolves on a daily basis; thus, there is no official textbook. However, the following reference books are recommended for research. Additional references will be provided in class. 

  1. Modern Information Retrieval (Baeza-Yates and Ribeiro-Neto; Addison Wesley).
  2. Information Retrieval – Algorithms and Heuristics (Grossman and Frieder; Springer).

This is going to be fun! 

A Novel Perspective of Similarity

September 18, 2007

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Random Notes and LauraMansfield

September 12, 2007

These are some late random notes. Sorry for the delay.

1. I am putting together a research project for a graduate student. The topic is quite interesting: homeland security. While researching the topic I came across LauraMansfield.com site. Mansfield’s site is a goldmine of information, especially for those interested in co-occurrence and word association research applied to the terrorist knowledge domain.

2. I am reviewing a graduate thesis in which logistic regression is used for data mining medical claims. Quite interesting the thesis topic. The manuscript needs some rework, though.

3. I am reading bits and pieces of an old paper on the non-transitivity nature of Jaccard’s Coefficient and a proposed indirect similarity measure.

Brute Force Algorithm

September 10, 2007

The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order.

This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code is given below.

void BF(char *x, int m, char *y, int n)
{
int i, j; 

for(j = 0; j <= n - m; ++j) 

{ 

for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} 

} 

}

This code can be converted straightforward to JavaScript or PHP.
 Modern Information Retrieval, Chapter 8, p. 210, has additional information on this and derivative string matching algorithms. These all differ in the way they check and shift the window.

Metric Clusters 2

September 7, 2007

Yesterday I provided a basic explanation of metric weights and metric clusters. This time I want to expand on this topic and provide some how-to calculations.

On Counting Word Distances

As mentioned, the distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) - p(B) |

Some IR authors use a different definition, though. For instance, Baeza-Yates and Ribeiro-Neto in Modern Information Retrieval, page 126, defines the distance between any two words as the number of words between these. The difference between these two definitions reduces to 1 count in the distance scale.

For example, the following position (P) vs. word (W) table depicts a 16-word text stream that mentions A twice and B three times:

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W A1 X X B1 X X A2 X X X X B2 X X X B3

Adopting the initial definition, d(A1, B1) = 4 - 1 = 3; adopting Baeza-Yates and Ribeiro-Neto’s definition, d(A1, B1) = 2.

The former definition is frequently used since is easier to automate than the later and insures non-zero distance values. To accommodate a distance count to the later definition, subtract 1.

Some authors, like Koberstein and Ng (http://faculty.cs.byu.edu/~dennis/papers/mtcluster.ps), use Baeza-Yates and Ribeiro-Neto’s definition, but then add 1 to the counts to insure that word distances between terms are always non-zero. This is the same as computing distances based on the former definition.

Regardless of how you compute word distances, be consistent. Understand that conversion of distances to metric weights produces relative, comparative data.

In the rest of this post, I use Baeza-Yates and Ribeiro-Neto’s definition of word distances, without adding 1. You can rework the calculations by adding 1 if you wish, but you should be able to arrive to the same general observations.

An Example

It can be demonstrated that for any word pair, A and B, there is a metric weight between the terms such that

mw(A, B) = mw(B, A) = mw(A) = mw(B)

That is, metric weights are symmetric. This is to be expected; metric weights are derived from distances and, by definition, a distance metric is symmetric.

Let consider the previous example.

With respect to A:

mw(A1) = ½ + 1/10 + 1/14 = 0.671

mw(A2) = ½ + ¼ + 1/8 = 0.875

mw(A) = mw(A1) + mw(A2) = 1.546

With respect to B:

mw(B1) = ½ + ½ = 1

mw(B2) = ¼ + 1/10 = 0.35

mw(B3) = 1/8 + 1/14 = 0.196

mw(B) = mw(B1) + mw(B2) + mw(B3) = 1.546

Hence, the metric weight of the A, B pair is

mw(A, B) = mw(B, A) = mw(A) = mw(B) = 1.546

In practice, mw(A, B) is computed with respect to either A or B. In IR textbooks, this quantity is frequently termed an unnormalized correlation factor and used to populate an unnormalized correlation matrix, Cu.

Metric Clusters 1

September 6, 2007

A graduate student asked about distance metrics and metric clusters. Unfortunately, many IR textbooks either ignore the subject or provide abstract explanations, hard to grasp by first-year students.

This post might help graduate students and casual readers to grasp the concept.

Terms that appear closer together in a document are assumed more related than those that appear far apart in the same document. To capture this notion of relatedness, metric clusters have been proposed.

The distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) - p(B) |

This difference is taken for the number of words between the terms. Such distance can be computed before or after filtration–the removal of stop words.

A metric weight, mw, is therefore defined as the inverse of this distance; i.e.

mw = 1/d(A, B)

This is one way of estimating the dispersion between any two words and the distribution of these in a document.

For instance, if A appears once and B appears three times in a document, calculating the metric weight of A reduces to computing the following quantities:

mw(A) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A) = 1/| p(A1) - p(B1) | + 1/| p(A1) - p(B2) | + 1/| p(A1) - p(B3) |

The subscripts in this summation refer to the different instances of the terms. Thus, mw considers the frequency of occurrence, position, and distance between the terms.
 
When a word appears several times in a document, its metric weight is a combination of the weight associated to each instance. For instance, if A appears twice and B appears three times in a document

mw(A1) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A2) = 1/d(A2, B1) + 1/d(A2, B2) + 1/d(A2, B3)
mw(A) = mw(A1) + mw(A2)

This actually is a double summation.

Some have argued to average these.

What applies to words also applies to stems. One just needs to identify sets of words with common stems.

Once metric weights are computed, a term-term co-weight matrix is constructed and clusters are identified by applying standard clustering techniques. These are the so-called metric clusters.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

How Complex Simplex Words Can Be

August 28, 2007

1. Are you interested in the word frequency effect on users?

2. Want to know why high-frequency words such as car is recognized more quickly than a low-frequency word such as doe?

3. Interested in why processing time is determined not only by the frequency of complex words, but by the frequencies of its constitutents and surrounding terms?

If you answer yes to any of these, this post is for you.

Back in 1997, Schreuder and Harald wrote in How Complex Simplex Words Can Be:

“A series of experiments investigated components of the word frequency effect in visual lexical decision, progressive demasking, and subjective frequency ratings. For simplex, i.e., monomorphemic, nouns in Dutch, we studied the effect of the frequency of the monomorphemic noun itself as well as the effect of the frequencies of morphologically related forms on the processing of these monomorphemic nouns. The experiments show that the frequency of the (unseen) plural forms affects the experimental measures. Nouns with high-frequency plurals are responded to more quickly in visual lexical decision, and they receive higher subjective frequency ratings. However, the summed frequencies of the formations in the morphological family of a given noun (the compounds and derived words in which that noun appears as a constituent) did not affect the experimental measures. Surprisingly, the size of the morphological family, i.e., the number of different words in the family, emerged as a substantial factor. A monomorphemic noun with a large family size elicits higher subjective frequency ratings and shorter response latencies in visual lexical decision than a monomorphemic noun with a small family size. The effect of family size disappears in progressive demasking, a task which taps into the earlier stages of form identification. This suggests that the effect of family size arises at more central, post-identification stages of lexical processing.”

Although oldie, I still find their research relevant.

Identifying Hidden Sequences

August 27, 2007

In the Row-Pruning Algorithm Tutorial post I introduced an algorithm for identifying terms sequences hidden in collections. The tutorial is available in Mi Islita.com site. Several applications were also presented.

In a nutshell, given a set of terms T={t1, t2..tm} extracted from a document collection, C={d1, d2..dn} the goal is to combine each of these terms with all other terms of T and then find the family of term sequences more frequently present in C. Such families are of the form

“k1″
“k1 + k2″
“K1 + k2 +..km”

The double quotes indicate that these are term sequences.

The algorithm then is aimed at establishing the identity of the k’s using T and C. Once k1 is identified, the algorithm reduces to conducting a binary partition on the corresponding largest answer sets. Normally T consists of few terms (m < threshold value) and as such is a subset of the index of terms. Whenever possible these should be on-topic terms and preselected by a suitable algorithm or heuristic criterion.

Confidence-based pruning is used, but one could as well use other measures like co-occurrence, support, or similar measures. One does not need to restrict the algorithm to just terms.

I see many applications to this, including visitations:

“Users clicking on X and then on Y tend to click next on Z

“Customers visiting store X and then store Y tend to visit next store Z.”

and so forth…

Row-Pruning Algorithm Tutorial

August 8, 2007

In IRW-2007-8 I introduced a row-pruning algorithm (RPA) and its unconditional version (URPA).

Let C be a collection of documents and let T = {t1, t2… tm} be m unique terms extracted from C. Assume that term i is combined with all other terms such that starting with term i a family of term sequences is obtained. Assume that these term sequences are hidden (latent) in C. The purpose of RPA and URPA is to identify the composition of these term sequences and to find those that occur more frequently in C.

(more…)

Association Rule Mining Thesaurus

July 30, 2007

Here is a 2004 paper on Association Thesaurus Construction for Interactive Query Expansion based on Association Rule Mining

The article discusses basic association rule mining concepts like support, confidence, and pruning as we described in Association Rules Part 1 (July issue of IR Watch - The Newsletter). BTW, read Part 2 in the August issue.

(more…)

Conceptual Similarity and Conceptual Mapping

July 26, 2007

I came across a relatively old paper authored by researchers at Indiana University and Institute for Human and Machine Cognition: Assessing Conceptual Similarity to Support Concept Mapping

Indeed, it seems like a great concept.

(more…)

The Risks of Thesaurus-based Expansions

July 23, 2007

SEOmoz has a great discussion on why at times search engines don’t return relevant results; that is, why some results perceived by users as being not relevant to their information needs (queries) are ranked high by search engines.

Some bloggers at SEOmoz attribute this in part to precision and recall issues. We have covered these topics in different occasions; so, let revisit some points along those lines.

(more…)

On Recall, Precision, and Relevance

June 25, 2007

Recall, Precision, and Relevance

Two important concepts for estimating the retrieval performance of search systems are recall (R) and precision (P). In laymen terms, picture two partially overlapped circles A and B representing answer sets (group of documents). Let C be the overlapping region between A and B and wherein

(more…)

IR Relevance vs. Suggestion Task Relevance

June 21, 2007

This is a great topic for a graduate thesis: Traditional IR considers the problem of matching documents to a query as a single information need to be satisfied. However, since a system doesn’t know what is in the mind of users, the query itself can be a multiple information need.

When you think thoroughly this is why Web searching, how users search and reformulate queries on the Web, is different from, for example, IR searching –wherein user resource to query expansion and relevance feedback mechanisms.

(more…)

On n-Grams and IR Theses

June 15, 2007

A grad student asked me about n-grams in IR as a thesis topic.

What are n-grams

Well, most of the modern work on n-grams is due to the work by D’Amore and Mah (1985) ONE-TIME COMPLETE INDEXING OF TEXT: THEORY AND PRACTICE

Page 116 of Grossman and Frieder (Information Retrieval: Algorithms and Heuristics) has great introductory material.

(more…)

Subsumptions vs Synonyms - Conceptual Indexing Revisited

June 11, 2007

Back in 1997, William Woods, Principal Scientist and Distinguished Engineer at Sun Microsystems Labs, wrote Conceptual Indexing: A Better Way to Organize Knowledge. Although the notion of conceptual indexing turned out to be a complex thing, his paper is still relevant these days wherein many SEOs make incorrect claims about how search engines use Latent Semantic Indexing (LSI) and wherein others are paying attention to synonymy and phrase processing patents. This post is based in part on Woods’s manuscript.

(more…)

Relevance Gap Analysis as Part of SEO Work

June 5, 2007

IR colleagues and marketers are now reading the IRW 2007-06 issue wherein I elaborate on The User-Machine Relevance Perception Gap. So far, I have received some feedback from these. As part of the process, Ben Pfeiffer from Ranksmart asked me a valid question:

(more…)

Call for Papers to Knowledge Discovery Conference

June 4, 2007

Dr. Ellen Voorhees, Director of TREC, over at NIST.gov informed me by email of this Call for Papers. Over the years, I have received invitations to several TREC tracks and no doubt that the groups that conform these are a great place to be.

For those that want to submit manuscript, here is the full Call:

(more…)