Archive for the ‘IR Tutorials’ Category

PCA and SPCA Tutorial

March 25, 2008

As promised, the PCA and SPCA Tutorial is now available at Mi Islita.

I wrote this tutorial for two reasons:

1. to help graduate students taking the Search Engines Architecture course with a general overview and review of linear algebra concepts.
2. because most tutorials discuss PCA, but ignore SPCA.

If you are enrolled in the course and have any question, please feel free to post.

Feel also free to comment or make suggestions so we can improve the tutorial.

Search Engines Architecture Week 2

March 14, 2008

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

IRW-2008-2 Snake Preview

February 18, 2008

Scalar Clusters

Masking Effects in Similarity Matrices:
Topics with strong similarities can mask other topics,
making these invisible to the clustering process.

Here is the snake preview of the current issue of IR Watch - The Newsletter. Due to academic duties, it is running late. It should be in subscribers’s inboxes during the day. The topic is Scalar Clusters and Back Mapping and is based on lecture material covered in the Web Mining course. The following topics are covered:

Introduction
On Association and Scalar Clusters
The Neighborhood Similarity Matrix, Mn
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Term Clusters to Documents
Masking Effects in Similarity Matrices
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Not a subscribers? What are you waiting for?

Back Mapping for the Masses

January 31, 2008

In a recent tutorial on association and scalar clusters, http://www.miislita.com/information-retrieval-tutorial/association-scalar-clusters-tutorial-1.pdf, I introduced a back mapping technique wherein once features conforming clusters are extracted from objects, the clusters are mapped back to objects.

The technique works well with clusters of terms extracted from documents. The reverse case is also possible: given a cluster of documents extracted from terms, it is possible to map these back to terms.

What do we gain from such two-way manipulations? A lot. Consider the first scenario: Mapping term clusters back to documents; a tutorial on the second scenario will be available soon.

Back Mapping Term Clusters to Documents

A document is just a distribution over topics while topics are distributions over words. Thus, across a collection of documents there are topics hidden (latent) and waiting to be uncovered. Back mapping allows us to recover these, precisely.

Combinations of terms that do not amount to topics across the collection are discovered as well. Reasonably, one would expect these to be least relevant across other documents than those distributed across the collection. In addition, one would expect documents traced back to clusters to be the most relevant documents, from the collection and with respect to the topics.

The implications of this for search engine optimization and keyword bidding are quite obvious. Implementation is straightforward. To learn more about it, read Part 1 of the tutorial.

A Case-Based Experience Sharing Search Engine

November 28, 2007

Mobyen Ahmed, Erik Olsson, Peter Funk, Ning Xiong from Department of Computer Science and Electronics, Malardalen University, Sweden have published the paper “Efficient Condition Monitoring and Diagnosis Using a Case-Based Experiene Sharing System”
(http://www.mrtc.mdh.se/publications/1269.pdf)  at a workshop at the Swedish Artificial Intelligence Society, p 70-80, in May, 2007.

This is a case-based reasoning search engine system that could be used in an industrial environment. That’s quite interesting.

In their paper, the authors kindly referenced me. That’s an honor. It appears that more CS researchers are happy with my Cosine Similarity Tutorial (http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html) and with Mi Islita’s content (http://www.miislita.com/searchito/educational-links.html). If they are happy, I’m happy.

Being Referenced Without Being Referenced

September 13, 2007

I have the honor of being “referenced” by Lau Patrick from ETH’s Databases and Information Systems Group in the manuscript Algorithms for Data Base Systems Report to the topic “Abusing Web Search” based on “A web based kernel function for measuring the similarity of short text snippets”.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/LauReport.pdf

Interesting project. I can even recognize some figures and examples of the manuscript taken from my document indexing tutorial,
http://www.miislita.com/information-retrieval-tutorial/indexing.html

In particular, Figure 1 of the tutorial, shown in the manuscript as Figure 2. No link to the tutorial was provided, but to a different document at Mi Islita. That’s fine, since I referenced that my Figure 1 is a modification from the one at  

http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

At least they should have given credit to Ruthven and Lalmas to avoid allegations of student plagiarism.

Memo to Thomas Hofmann (Google), Donald Kossmann, and Peter Widmayer from http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/:

Please! Better than talking about Web search abuse, similarity of text, and duplicated content, let’s talk about lack of originality and properly referenced work.

Levenshtein Edit-Distance Based Tool

August 20, 2007

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

Upcoming Tool on Edit Distances

August 17, 2007

I’m working on a tool for computing edit distances (number of insertions, deletions, and substitutions) in a text stream.

It will be up and running this Monday. It is great for a hands-on tutorial.

Did you know that to change Democrats into Republicans and vice versa requires of just 8 edits? :)

(more…)

Being Quoted at University of Campinas, Brazil

May 24, 2007

More universities are quoting our tutorials. I’m happy to learn that these are read even in the huge and great nation that is Brazil.

Today we found out that Prof. Wu, Shin - Ting from Department of Computer Engineeering and Industrial Automation, School of Electrical and Computer Engineering State University of Campinas, Sao Paulo, Brazil and who teaches EA978 Graphic Information Systems is referencing our Matrix Tutorial 3: Eigenvalues and Eigenvectors.

(more…)

Our Tutorials, Required Readings at University of Maryland

May 18, 2007

Yan Qu over at the College of Information Studies, University of Maryland taughts the graduate course

LBSC 670 Information Structure

For the course Qu selected as required readings our tutorials:

(more…)