Archive for the ‘Machine Learning’ Category

A CBR Sharing Search Engine System

March 17, 2009

I’m reading with great interest the paper
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System
, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,

I’m happy to read they referenced our Tutorial on Cosine Similarity Measures. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.

Their abstract follows:

ABSTRACT
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.

Yahoo! BOSS and Google SearchWiki

November 21, 2008

Search engines are realizing they can earn more revenues by allowing users to manipulate their indexes. Yahoo! BOSS and Google SearchWiki are efforts in that direction.

Essentially these two search platforms provide search refrits of whatever the search engines already have in their indexes, but with some features added for resorting, editing, and making annotations.

We can see some value for using these as aggregation layers for third-party search technologies and for some type of searchers prone to personalization. Such type of searches might be useful for conducting annotated Web Intelligence.

Still these are not platforms designed for searching the Deep Web.

Best Algorithm Combinations for Speech Processing

October 15, 2008

I am happy to read that my Cosine Similarity tutorial is being referenced by Serguei Mokhov in his paper

“Study of best algorithm combinations for speech processing tasks in machine learning using median vs. mean clusters in MARF”

http://portal.acm.org/citation.cfm?id=1370262

The paper was presented in the ACM International Conference Proceeding Series; Vol. 290 Proceedings of the 2008 C3S2E conference and is a great read.

Independence, Disjointness, and IR Flaws

August 11, 2008

Often independent events are mistaken for exclusive (disjoint) events. These are two different animals.

Consider two events, A and B. Let p(A OR B) be their union probability and p(A AND B) their joint probability. The general addition law for probabilities states that for any two events A and B

p(A OR B) = p(A) + p(B) – p(A AND B)

If events are independent

p(A AND B) = p(A)p(B)

Thus,

p(A OR B) = p(A) + p(B) – p(A)p(B)

Whereas if these are exclusive

p(A AND B) = 0

Therefore,

p(A OR B) = p(A) + p(B)

Furthermore,

if p(A AND B) = p(A)p(B) events are independent, occurring by chance.
if p(A AND B) > p(A)p(B) events are positively correlated, occurring more often than by chance.
if p(A AND B) < p(A)p(B) events are negatively correlated, occurring less often than by chance.

Talking in “rice and beans” (Hablando en “arroz con habichelas”):

Exclusive events do not have common outcomes as the occurrence of one excludes the occurrence of the other. By contrast, independent events have common outcomes, but the occurrence of one does not influence the occurrence of the other.

Independence and disjointness are very different things.

In IR, assuming that the IDF of a combination of terms can be taken for the sum of individual term IDF values presumes that terms are independent regardless of the actual data.

Arbitrarily assuming event independence, ignoring the experimental evidence, is one of the main sources of innaccuracies/flaws in many IR models (Cooper, 1991). However, excluding independence altogether is also unreasonable (Sparck-Jones, Walker, and Robertson, 1998).

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Sparck Jones, K., Walker, S., & Robertson, S. E. (1998). A probabilistic model of information retrieval: development and status. TR 446, September. Computer Laboratory, University of Cambridge.

Sneak Preview of IRW: Graduate Research

August 1, 2008

The current issue of IRW, Graduate Students Research, is out. It consists of short abstracts of research conducted by graduate students.

In this issue:

Introduction
Genetic Algorithms, K-Means, and Fuzzy C-Means
Word Association Patterns
U-Site Search Engine Interface
Enhancement of a U-Site Search Engine Interface
News, Research, and Events
Terms of Use and Copyright

The next issue will go back to its how-to mode.

The Porter Stemmer

July 18, 2008

A grad student asks (name omitted):

Dear Dr. Garcia,

I’m interested in developing a Porter Stemmer for the Irish language.

Would it be possible to send me your lecture notes for Porter Stemmer
development from your graduate course?

I am doing an MSc thesis on developing a search engine for TEI marked up
multilingual texts and hope to use Apache Lucene as a basis.

Thanks for any help,

******
MSc student,
UCC, Cork, Ireland.

Thank you for reading this blog and for emailing me, but I normally don’t release lecture notes. However, the lecture was based on Martin Porter’s site, which can be accessed by visiting the following link:

http://tartarus.org/~martin/PorterStemmer/

You might also want to check the Porter2 Stemmer:
http://snowball.tartarus.org/algorithms/english/stemmer.html

For stemmers in other languages, check the Snowball site:
http://snowball.tartarus.org/

The great thing about the Porter stemmer is that it has been written in many programming flavors and languages.

Claps and Slaps

July 14, 2008

Claps

Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.

Slaps

There is something funny about SEOs that sell snake oil ( http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.

The following links provide additional information about AIRWeb and EMNLP

AIRWeb
http://irthoughts.wordpress.com/2008/04/29/for-seo-spammers-airweb-2008-presentations/  

EMNLP
http://conferences.inf.ed.ac.uk/emnlp08/

IR Quiz

June 18, 2008

Here is a question I included during the final examination of the Search Engines Architecture course. I am modifying the question. It might serve as a little quiz for non IR readers:

A collection consists of 500 documents. Some documents mention k1 and/or k2 keywords. If 100 mention k1, 200 mention k2, 70 mention k1 and k2, and 25 mention the k1 k2 terms sequence. Calculate the number of results for the following queries first, assuming terms independence and second assuming terms dependence. If the calculation is not possible from the provided data, write NC, ‘Not Computable’.

1. k1 NOT k2

2. k2 NOT k1

3. k1 OR k2 (unconditional OR)

4. k1 OR k2 (conditional OR)

5. NOT k1

6. NOT k2

7. NOT (k1 AND k2)

8. k1 AND k2 NOT (k1 k2)

9. EF-Ratio of the k1 k2 terms sequence

10. c12-index of the k1 k2 terms sequence

11. c12-index of k1 AND k2

12. IDF of k1

13. IDF of k2

14. IDF of k1 AND k2

15. IDF of k1 k2 terms sequence

Total Possible Scores: 15 points for terms independence and 15 points for terms dependence correct results.

Grading Yourself: A (100 – 90), B (89 – 80), C (79 – 70), D (69 -60), F(59 – 0)

Correct answers will be given during the week.

 

PowerSet Semantic Searches in Wikipedia

May 12, 2008

According to Reuters,

Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.

Powerset’s technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.

What Google has to say about the topic?

According to PCWorld:

In an interview in October with IDG News Service, Marissa Mayer, Google’s vice president of Search Products & User Experience, acknowledged that the company’s search engine should — and will — overcome its keyword dependence in time.

“People should be able to ask questions and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like ‘what is this about?’. A lot of people will turn to things like the semantic Web as a possible answer to that,” she said.

But she added that Google’s search engine acts smart thanks to the humongous amount of data it crunches. “With a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force,” she said. As examples, she cited a query like “GM,” which the engine interprets as “General Motors” but if the query is “GM foods,” it delivers results for “genetically-modified foods.” “Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart, like it achieved that semantic understanding, but it hasn’t really,” she said.

Hmm…

A search for GM goods and for GM in Powerset returns results relevant to General Motors, while Google does discriminate these searches possibly using brute force.

By contrast, a search for GM foods and for GM in both are discriminated.

PowerSet, Google, and almost all search engines do not seem to discriminate between the following two semantically different searches, which score against aforementioned semantic analysis claims:

Who is the best college junior?

Who is the best junior college?

A simple change in word order affects meaning and the information needs sought. Semantic searches? It is still a long way to go. This gonna be a nice race to watch, from the architectural side.

Talking about search engines architecture, the current issue of IRWatch – The Newsletter is the very same practice test I am giving to my grad students. Since they need to study for the finals, I thought I could kill two birds with one stone. It should reach subscribers inbox today or, at the latest, tomorrow.

How Search Engines Do Not Work

April 17, 2008

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

Experiment in Parsing Techniques

April 14, 2008

Here are some useful notes for those taking the Search Engines Architecture grad course.

In Lab 4: Experiment in Parsing Techniques, you are building a Query Normalizer, a Document Parser, and a URL Normalizer.

In section 1, be sure the tool removes all kind of HTML instructions from the search interface.

In section 2, the document parser should does document linearization, tokenization, and filtration, but no stemming. The final output should be a list of tuples consisting of unique terms and occurrences.

Regarding section 3.1, Building a URL Normalizer, I’m adding new restrictions to this part.

This part challenges you to use regular expressions, only. No arrays, no scripting loops, no conditionals (if-then), no lookup libraries, no DNS lookups, just regexps. It should work for all valid urls available on the Web. It can be done.

The parser for the URL normalizer tool should remove from a URL all kind of:

protocols (http, https, ftp, etc)
www prefixes
top-level domains (TLD)
ports (:80, :8000, etc)
file extensions (.html, .php, etc)
parameters (name=value pairs)
named anchors or fragments (#)
URL-forbidden characters, international characters, or script lines

Be sure you understand the difference between top-level domains and subdomains. For example, .com and .pr are TLDs, but .com.pr defines a subdomain.

Let say we have the http://www.xxx.com.pr site. If we remove http://www. and .pr from this we end with xxx.com, but .com is not the TLD in this case. The TLD still is .pr, though. Redirection mechanisms are “another twenty bucks” (”otros veinte pesos”), which might confuse the concept.

You should also know that a TLD can be a generic top-level-domain (gTLD), a country-code top-level domain (ccTLD), international TLDs (iTLDs), or US legacy TLDs (usTLDs). When combined, these define a subdomain. For example,

.edu.pr is a subdomain with .pr as TLD
.co.uk is a subdomain with .uk as TLD

That is, for a url on the Web, only one string sequence works as and defines the top-level domain.

Thus, for:

http://www.google.com.pr
https://www.google.com/adsense/login/en_US/?gsessionid=wfx3oxHhgDU
ftp://www.gobierno.pr/index.html
http://search.music.yahoo.com/search/?m=all&p=britney
http://www.telegraph.co.uk
http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s
http://www.lis.ntu.edu.tw/~mctang/index.htm

your tool should return:

google.com
google
gobierno
search.music.yahoo
telegraph.co
video.google.co
lis.ntu.edu

Recommended Lecture Material:

http://en.wikipedia.org/wiki/.pr
http://www.icann.org/meetings/saopaulo/presentation-dns-conrad-07dec06.pdf
http://www.mattcutts.com/blog/seo-glossary-url-definitions/  

Important Terrier Upgrade and Call for Participation

March 20, 2008

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 – 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Principal Component Analysis Tutorial and Other Algos

March 18, 2008

I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods

Unsupervised

Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).

Supervised

Discriminant analysis
Neural Networks (Back propagation)
SIMCA
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

Back Mapping for SEOs

March 13, 2008

A reader asked me how he could apply back mapping (BM) to SEO work.

Essentially, any algorithm like BM that works at the level of collections can be scaled down and used at a page level. One just needs to decompose a page into passages (chunks of text or text windows) and treat each passage as a pseudo document.

First, convert a document into a text stream; i.e., linearize, tokenize, and filtrate the document. You should end with a linearized piece of text.

 If it turns out that the linearized document is almost identical to the original document,  you might want to go back and partition the original document every x paragraphs or sentences. If not, partition the text stream so you end with segments or text windows. You need to decide if you want to do this very x number of words or characters.

Next, treat each segment or chunk of text as a “document” and construct a term-”document” matrix “A”.

If you want to narrow this down to the level of links, construct a term-link matrix.

 In general, an algorithm that applies to a true term-document matrix A can be applied to “A”.

Welcome to scaling.

Google-IBM-NSF Partnership in Data-Intensive Computing

February 28, 2008

We received few days ago the following Call for Proposals:

Dear Colleague:

We have some exciting news to share with all of you: NSF is partnering with Google and IBM to explore data-intensive computing. Through NSF’s reach, Google and IBM are providing software and services running on a large cluster to the broad academic community to explore innovative research and education ideas in data-intensive computing. Google and IBM launched the Academic Cluster Computing Initiative [1] last October with instructional programs at six pilot universities, and the NSF will be joining this initiative as the first research-oriented pilot partner. We are calling the NSF program to provide access to these types of resources the Cluster Exploratory (CluE).

Please read more about data-intensive computing, NSF’s partnership with Google and IBM, and the upcoming CISE solicitation for CluE.

Jeannette M. Wing, Assistant Director, Computer and Information Science and Engineering Directorate Dan Atkins, Director, Office of Cyberinfrastructure

Data-Intensive Computing

Data-intensive computing is a computational paradigm in which the sheer volume of data is the dominant performance parameter. Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. For example, Google runs an average of 100,000 MapReduce jobs per day on its clusters, processing over 20 petabytes daily [2]. This scale of computing effectively supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. The architecture is extremely fault-tolerant and exhibits high degrees of reliability and availability.

Data intensive computing raises important research challenges:

O For science
  - What are the fundamental capabilities and limitations of this paradigm?
  - What new programming abstractions (including models, languages, algorithms) does this computational model suggest?

O For technology
  - How can we automatically manage the hardware and software of these systems?
  - How can we reduce their power consumption?
O For society
  - What (new) applications can best exploit this computing paradigm?

Data-intensive computing is at the forefront of ultra-large-scale commercial data processing. A July 2006 New York Times article [3] notes that “Google, Microsoft and Yahoo are spending vast sums of capital to build out their computing capabilities.” Not only is there an increasing need for advances in data-intensive computing systems software and hardware, but also an increasing demand for a trained workforce to operate and use these systems. To date, however, the academic community has had limited access to such systems.

Enter Google and IBM

On October 8, 2007, Google and IBM announced they had teamed to provide six universities access to a large-scale computing cluster together with the software and services to use it effectively [4]. After several months of discussions, the NSF will be joining this initiative and will be partnering with Google and IBM to broaden the reach of this powerful computing resource to foster more innovation than might be possible in the initial pilot.

Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do fundamental, disruptive research in data-intensive computing and to explore powerful new applications. This facility can also serve as a tool for educating the next generation of scientists and engineers. This partnership is an excellent example of an academic-industry-government relationship that is a win-win-win situation for all.

System Description

The Google-IBM cluster contains well over a thousand processors connected to terabytes of memory and hundreds of terabytes of storage with internal networking as well as a substantial external network connection. The system will be configured with open source software to include Linux and Apache Hadoop [5] – a large-scale distributed computing platform inspired by Google’s MapReduce [6] and the Google File System [7]. IBM’s Tivoli [8] software will also be used for management, monitoring and dynamic resource provisioning of the cluster.

The system will provide a powerful resource for large-scale data analysis, mining and visualization in addition to support for Internet-scale computing applications. Tutorial information describing the programming environment of the Google-IBM academic cluster available via the CluE program can be found on the Google Code for Educators website [9]. Much of this material was developed in collaboration with the University of Washington, and all of it is available under permissive licenses such as the Creative Commons Attribution License.

Upcoming Solicitation

CISE is currently developing a program solicitation that will invite researchers to submit proposals requesting allocations of the Google-IBM cluster for any new, innovative use of the system and to probe the possibilities and fundamental limits of this new computing paradigm. The emphasis of the program will be to develop new approaches and applications that are outside the typical high-performance computing applications running on today’s supercomputers.

The challenge to the academic community is three-fold: to use existing tools and to develop new programming abstractions for such a “computer” to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications. This resource will also provide an opportunity to teach students how to build, use and manage data-intensive computing systems systems that are already being used widely in industry, but are largely unavailable to students and faculty today.

Please look for the new solicitation which will be posted on the CISE web site. CISE looks forward to your bold, creative proposals for CluE!

[1] Official Google Blog: http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. of the ACM, 51(1), January 2008, pp. 107-113.

[3] http://www.nytimes.com/2006/06/14/technology/14search.html?_r=1&pagewanted=2&oref=slogin

[4] See http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html or http://www-03.ibm.com/press/us/en/pressrelease/22414.wss for the text of the press release.

[5] http://hadoop.apache.org/

[6] http://labs.google.com/papers/mapreduce.html

[7] http://labs.google.com/papers/gfs.html

[8] http://www.ibm.com/software/tivoli/

[9] http://code.google.com/edu/content/parallel.html

IRW-2008-2 Sneak Preview

February 18, 2008

Scalar Clusters

Masking Effects in Similarity Matrices:
Topics with strong similarities can mask other topics,
making these invisible to the clustering process.

Here is the sneak preview of the current issue of IR Watch – The Newsletter. Due to academic duties, it is running late. It should be in subscribers’s inboxes during the day. The topic is Scalar Clusters and Back Mapping and is based on lecture material covered in the Web Mining course. The following topics are covered:

Introduction
On Association and Scalar Clusters
The Neighborhood Similarity Matrix, Mn
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Term Clusters to Documents
Masking Effects in Similarity Matrices
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Not a subscribers? What are you waiting for?

Back Mapping for the Masses

January 31, 2008

In a recent tutorial on association and scalar clusters, http://www.miislita.com/information-retrieval-tutorial/association-scalar-clusters-tutorial-1.pdf, I introduced a back mapping technique wherein once features conforming clusters are extracted from objects, the clusters are mapped back to objects.

The technique works well with clusters of terms extracted from documents. The reverse case is also possible: given a cluster of documents extracted from terms, it is possible to map these back to terms.

What do we gain from such two-way manipulations? A lot. Consider the first scenario: Mapping term clusters back to documents; a tutorial on the second scenario will be available soon.

Back Mapping Term Clusters to Documents

A document is just a distribution over topics while topics are distributions over words. Thus, across a collection of documents there are topics hidden (latent) and waiting to be uncovered. Back mapping allows us to recover these, precisely.

Combinations of terms that do not amount to topics across the collection are discovered as well. Reasonably, one would expect these to be least relevant across other documents than those distributed across the collection. In addition, one would expect documents traced back to clusters to be the most relevant documents, from the collection and with respect to the topics.

The implications of this for search engine optimization and keyword bidding are quite obvious. Implementation is straightforward. To learn more about it, read Part 1 of the tutorial.

Kendall’s Test and PageRank

December 26, 2007

For years marketers have put PageRank-based models into question, via “sandbox” theories and similar tales.

This old paper might help to address the issue on the biased/unbiased nature of these models: Paradoxical Effects in PageRank Incremental Computations.

What interest me the most about the paper is how the authors put into use Kendall’s t theory. Written by Paolo Boldi, Massimo Santini, and Sebastiano Vigna the abstract states:

“Abstract. Deciding which kind of visiting strategy accumulates high-quality pages morequickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectivenessfor crawl strategies: whether the graph obtained after a partial visit is in some senserepresentative of the underlying web graph as far as the computation of PageRank isconcerned. More precisely, we are interested in determining how rapidly the computationof PageRank over the visited subgraph yields node orders that agree with theones computed in the complete graph; orders are compared using Kendall’s t .We describe a number of large-scale experiments that show the following paradoxicaleffect: visits that gather PageRank more quickly (e.g., highest-quality first) arealso those that tend to miscalculate PageRank. Finally, we perform the same kind ofexperimental analysis on some synthetic random graphs, generated using well-knownweb-graph models: the results are almost opposite to those obtained on real webgraphs.”

The authors also state:

“The most classical visit strategies are the following:

Depth-first order: the crawler chooses the next page as the last that wasadded to the frontier; in other words, the visit proceeds in a LIFO fashion.
Random order: the crawler chooses randomly the next page from the frontier.
Breadth-first order: the crawler chooses the next page as the first that wasadded to the frontier; in other words, the visit proceeds in a FIFO fashion.
Omniscient order (or quality-first order): the crawler uses a queue prioritisedby PageRank values [Cho et al. 98]; in other words, it chooses to visitthe page with highest quality among the ones in the frontier. This visit ismeaningless unless a previous PageRank computation of the entire graphhas been performed before the visit, but it is useful for comparisons. A variant of this strategy may also be adopted if we have already performeda crawl and thus we have the (old) PageRank values of (at least some ofthe) pages.”

“Both common sense and experiments (see, in particular, [Boldi et al. 05])suggest that the visits listed above accumulate PageRank in an increasinglyquicker way. This is to be expected, as the omniscient visit will point immediatelyto pages of high quality. The fact that the breadth-first visit yields high-qualitypages was noted in [Najork and Wiener 01].”

“There is, however, a different and also quite relevant problem that has beenpreviously overlooked in the literature: if we assume that the crawler has noprevious knowledge of the web region it has to crawl, it is natural that it will try to detect page quality during the crawl itself, by computing PageRank onthe region it has just seen. We would like to know whether a crawler doing sowill obtain reasonable results or not.”

A Case-Based Experience Sharing Search Engine

November 28, 2007

Mobyen Ahmed, Erik Olsson, Peter Funk, Ning Xiong from Department of Computer Science and Electronics, Malardalen University, Sweden have published the paper “Efficient Condition Monitoring and Diagnosis Using a Case-Based Experiene Sharing System”
(http://www.mrtc.mdh.se/publications/1269.pdf)  at a workshop at the Swedish Artificial Intelligence Society, p 70-80, in May, 2007.

This is a case-based reasoning search engine system that could be used in an industrial environment. That’s quite interesting.

In their paper, the authors kindly referenced me. That’s an honor. It appears that more CS researchers are happy with my Cosine Similarity Tutorial (http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html) and with Mi Islita’s content (http://www.miislita.com/searchito/educational-links.html). If they are happy, I’m happy.

Terrier IR 1.1.1 Update

November 9, 2007

Craig MacDonald, from the Terrier project at University of Glasgow sent me this update few days ago.

“Terrier, IR Platform v 1.1.1 – 24/10/2007. http://ir.dcs.gla.ac.uk/terrier/

Terrier, the open source IR platform from the University of Glasgow has been updated to version 1.1.1.

This is a minor update, which contains mostly bug fixes. Some minor code enhancements and a test harness are also included. Moreover, the Snowball stemmers were added to boost support for languages other than English.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL) and is available from the Terrier website: http://ir.dcs.gla.ac.uk/terrier

Back to Business

November 8, 2007

Back to business. IRW will soon be out. This month issue covers the popular K-Means Algorithm and its variants. A back issue on Genetic Algorithms will also be sent. This issue was supposed to be out last month. Sorry it took that long. As most of you know, I was away from the internet, attending other duties.

My inbox is full, as expected. I will reply to all of your emails, … eventually. Stay with me.

Well, next week begins the Winter semester at Polytechnic University. I’ll be teaching the graduate course:

Web Mining: A First Course in Web Mining, Search Engines, and Business Intelligence.

Description: This is a hands-on, one-full semester course on Web Mining, search engines, and business intelligence. Students will learn by doing: (a) how search engines index and rank web documents, (b) how to conduct business intelligence from online resources, and (c) how to apply Web Mining strategies and algorithms in their research or workplace.Target: Students in Business, Engineering, and Computer Sciences and from other disciplines are encouraged to register for this special course. Requirements: Calculus II or Permission from advisor or department. Grading: Take-home work and a final exam. Topics: The following topics will be covered, not necessarily in this order: 

  • Document Indexing: Indexing of Web sites and text operations used by search engines including document linearization, tokenization, stop word filtration, stemming, and parsing.
  • Search Engine Optimization: Mining search engine relevance algorithms for ranking high Web pages.
  • Intelligence Searching: Covers undocumented (smart) searches in Google and other search engines; includes Hacking and Penetration through customized searches.
  • Keyword Research and Clustering: Discovery of word patterns and keywords for branding and marketing through Association, Scalar, and Metric Clusters.
  • Term Matching Algorithms: Vector Space Models used by search engines. Scoring of local, global, and entropy term weights.
  • Concept Matching Algorithms: Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) models for clustering and ranking.
  • Link Analysis Models: Google’s PageRank, Hubs & Authorities, and other link-based models.
  • Spam Intelligence: Tools and techniques for spamming search engines and web sites. Includes techniques based on scripts, cloacking, keyword spam techniques, link-bombs, email marketing, viral marketing, Web 2.0, and Web 3.0.
  • Introduction to Business Dashboards (BDs): Overview of dashboard technology, including open source, and customized add-on components.
  • Special Topics: On-Topic Analysis, Co-Occurrence Theory, and Latent Graphs. This is my own area of research.

 Textbook: Web Mining evolves on a daily basis; thus, there is no official textbook. However, the following reference books are recommended for research. Additional references will be provided in class. 

  1. Modern Information Retrieval (Baeza-Yates and Ribeiro-Neto; Addison Wesley).
  2. Information Retrieval – Algorithms and Heuristics (Grossman and Frieder; Springer).

This is going to be fun! 

A Novel Perspective of Similarity

September 18, 2007

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Random Notes and LauraMansfield

September 12, 2007

These are some late random notes. Sorry for the delay.

1. I am putting together a research project for a graduate student. The topic is quite interesting: homeland security. While researching the topic I came across LauraMansfield.com site. Mansfield’s site is a goldmine of information, especially for those interested in co-occurrence and word association research applied to the terrorist knowledge domain.

2. I am reviewing a graduate thesis in which logistic regression is used for data mining medical claims. Quite interesting the thesis topic. The manuscript needs some rework, though.

3. I am reading bits and pieces of an old paper on the non-transitivity nature of Jaccard’s Coefficient and a proposed indirect similarity measure.

Brute Force Algorithm

September 10, 2007

The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order.

This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code is given below.

void BF(char *x, int m, char *y, int n)
{
int i, j; 

for(j = 0; j <= n - m; ++j) 

{ 

for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} 

} 

}

This code can be converted straightforward to JavaScript or PHP.
 Modern Information Retrieval, Chapter 8, p. 210, has additional information on this and derivative string matching algorithms. These all differ in the way they check and shift the window.

Metric Clusters 2

September 7, 2007

Yesterday I provided a basic explanation of metric weights and metric clusters. This time I want to expand on this topic and provide some how-to calculations.

On Counting Word Distances

As mentioned, the distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) – p(B) |

Some IR authors use a different definition, though. For instance, Baeza-Yates and Ribeiro-Neto in Modern Information Retrieval, page 126, defines the distance between any two words as the number of words between these. The difference between these two definitions reduces to 1 count in the distance scale.

For example, the following position (P) vs. word (W) table depicts a 16-word text stream that mentions A twice and B three times:

P 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W A1 X X B1 X X A2 X X X X B2 X X X B3

Adopting the initial definition, d(A1, B1) = 4 – 1 = 3; adopting Baeza-Yates and Ribeiro-Neto’s definition, d(A1, B1) = 2.

The former definition is frequently used since is easier to automate than the later and insures non-zero distance values. To accommodate a distance count to the later definition, subtract 1.

Some authors, like Koberstein and Ng (http://faculty.cs.byu.edu/~dennis/papers/mtcluster.ps), use Baeza-Yates and Ribeiro-Neto’s definition, but then add 1 to the counts to insure that word distances between terms are always non-zero. This is the same as computing distances based on the former definition.

Regardless of how you compute word distances, be consistent. Understand that conversion of distances to metric weights produces relative, comparative data.

In the rest of this post, I use Baeza-Yates and Ribeiro-Neto’s definition of word distances, without adding 1. You can rework the calculations by adding 1 if you wish, but you should be able to arrive to the same general observations.

An Example

It can be demonstrated that for any word pair, A and B, there is a metric weight between the terms such that

mw(A, B) = mw(B, A) = mw(A) = mw(B)

That is, metric weights are symmetric. This is to be expected; metric weights are derived from distances and, by definition, a distance metric is symmetric.

Let consider the previous example.

With respect to A:

mw(A1) = ½ + 1/10 + 1/14 = 0.671

mw(A2) = ½ + ¼ + 1/8 = 0.875

mw(A) = mw(A1) + mw(A2) = 1.546

With respect to B:

mw(B1) = ½ + ½ = 1

mw(B2) = ¼ + 1/10 = 0.35

mw(B3) = 1/8 + 1/14 = 0.196

mw(B) = mw(B1) + mw(B2) + mw(B3) = 1.546

Hence, the metric weight of the A, B pair is

mw(A, B) = mw(B, A) = mw(A) = mw(B) = 1.546

In practice, mw(A, B) is computed with respect to either A or B. In IR textbooks, this quantity is frequently termed an unnormalized correlation factor and used to populate an unnormalized correlation matrix, Cu.

Metric Clusters 1

September 6, 2007

A graduate student asked about distance metrics and metric clusters. Unfortunately, many IR textbooks either ignore the subject or provide abstract explanations, hard to grasp by first-year students.

This post might help graduate students and casual readers to grasp the concept.

Terms that appear closer together in a document are assumed more related than those that appear far apart in the same document. To capture this notion of relatedness, metric clusters have been proposed.

The distance between any two words, A and B, in a document is defined by the absolute difference of the positions of any occurrence of A and B:

d(A, B) = | p(A) – p(B) |

This difference is taken for the number of words between the terms. Such distance can be computed before or after filtration–the removal of stop words.

A metric weight, mw, is therefore defined as the inverse of this distance; i.e.

mw = 1/d(A, B)

This is one way of estimating the dispersion between any two words and the distribution of these in a document.

For instance, if A appears once and B appears three times in a document, calculating the metric weight of A reduces to computing the following quantities:

mw(A) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A) = 1/| p(A1) – p(B1) | + 1/| p(A1) – p(B2) | + 1/| p(A1) – p(B3) |

The subscripts in this summation refer to the different instances of the terms. Thus, mw considers the frequency of occurrence, position, and distance between the terms.
 
When a word appears several times in a document, its metric weight is a combination of the weight associated to each instance. For instance, if A appears twice and B appears three times in a document

mw(A1) = 1/d(A1, B1) + 1/d(A1, B2) + 1/d(A1, B3)
mw(A2) = 1/d(A2, B1) + 1/d(A2, B2) + 1/d(A2, B3)
mw(A) = mw(A1) + mw(A2)

This actually is a double summation.

Some have argued to average these.

What applies to words also applies to stems. One just needs to identify sets of words with common stems.

Once metric weights are computed, a term-term co-weight matrix is constructed and clusters are identified by applying standard clustering techniques. These are the so-called metric clusters.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

How Complex Simplex Words Can Be

August 28, 2007

1. Are you interested in the word frequency effect on users?

2. Want to know why high-frequency words such as car is recognized more quickly than a low-frequency word such as doe?

3. Interested in why processing time is determined not only by the frequency of complex words, but by the frequencies of its constitutents and surrounding terms?

If you answer yes to any of these, this post is for you.

Back in 1997, Schreuder and Harald wrote in How Complex Simplex Words Can Be:

“A series of experiments investigated components of the word frequency effect in visual lexical decision, progressive demasking, and subjective frequency ratings. For simplex, i.e., monomorphemic, nouns in Dutch, we studied the effect of the frequency of the monomorphemic noun itself as well as the effect of the frequencies of morphologically related forms on the processing of these monomorphemic nouns. The experiments show that the frequency of the (unseen) plural forms affects the experimental measures. Nouns with high-frequency plurals are responded to more quickly in visual lexical decision, and they receive higher subjective frequency ratings. However, the summed frequencies of the formations in the morphological family of a given noun (the compounds and derived words in which that noun appears as a constituent) did not affect the experimental measures. Surprisingly, the size of the morphological family, i.e., the number of different words in the family, emerged as a substantial factor. A monomorphemic noun with a large family size elicits higher subjective frequency ratings and shorter response latencies in visual lexical decision than a monomorphemic noun with a small family size. The effect of family size disappears in progressive demasking, a task which taps into the earlier stages of form identification. This suggests that the effect of family size arises at more central, post-identification stages of lexical processing.”

Although oldie, I still find their research relevant.

Identifying Hidden Sequences

August 27, 2007

In the Row-Pruning Algorithm Tutorial post I introduced an algorithm for identifying terms sequences hidden in collections. The tutorial is available in Mi Islita.com site. Several applications were also presented.

In a nutshell, given a set of terms T={t1, t2..tm} extracted from a document collection, C={d1, d2..dn} the goal is to combine each of these terms with all other terms of T and then find the family of term sequences more frequently present in C. Such families are of the form

“k1″
“k1 + k2″
“K1 + k2 +..km”

The double quotes indicate that these are term sequences.

The algorithm then is aimed at establishing the identity of the k’s using T and C. Once k1 is identified, the algorithm reduces to conducting a binary partition on the corresponding largest answer sets. Normally T consists of few terms (m < threshold value) and as such is a subset of the index of terms. Whenever possible these should be on-topic terms and preselected by a suitable algorithm or heuristic criterion.

Confidence-based pruning is used, but one could as well use other measures like co-occurrence, support, or similar measures. One does not need to restrict the algorithm to just terms.

I see many applications to this, including visitations:

“Users clicking on X and then on Y tend to click next on Z

“Customers visiting store X and then store Y tend to visit next store Z.”

and so forth…

Row-Pruning Algorithm Tutorial

August 8, 2007

In IRW-2007-8 I introduced a row-pruning algorithm (RPA) and its unconditional version (URPA).

Let C be a collection of documents and let T = {t1, t2… tm} be m unique terms extracted from C. Assume that term i is combined with all other terms such that starting with term i a family of term sequences is obtained. Assume that these term sequences are hidden (latent) in C. The purpose of RPA and URPA is to identify the composition of these term sequences and to find those that occur more frequently in C.

(more…)