Archive for March, 2008

Search Engines Architecture Week 3

March 28, 2008

Week 3 Agenda

Lecture Session

Document Indexing
Web Crawling Techniques

Lab Session

For this lab, students should have already signed to download Terrier from http://ir.dcs.gla.ac.uk/terrier. We can use the Desktop API as is, but for development we need JAVA in the local machine.

Lab instructions for using the API will be provided in class. Please read in advanced Terrier documentation. Bring with you a directory (folder) full of documents from the pupr.edu site or your favorite site to play with. This will be analyzed during the lab.

This lab report is due next week.

PCA and SPCA Tutorial

March 25, 2008

As promised, the PCA and SPCA Tutorial is now available at Mi Islita.

I wrote this tutorial for two reasons:

1. to help graduate students taking the Search Engines Architecture course with a general overview and review of linear algebra concepts.
2. because most tutorials discuss PCA, but ignore SPCA.

If you are enrolled in the course and have any question, please feel free to post.

Feel also free to comment or make suggestions so we can improve the tutorial.

Verizon, FCC, and the C Block Competition

March 24, 2008

Now that the B and C blocks of the spectrum has been allotted by the FCC things are set-ready-go to open mobile broadband U.S. networks, broadband IR, and, yes, to a whole new hacking space. It’s a matter of time. The C Block hacking competition is coming. Never ignore what can be done with such new playground.

I wonder how the FCC is going to enforce regulations on the 22-MHz portion of the spectrum, already handled to Verizon. http://www.pcworld.com/article/id,143705-c,industrynews/article.html

Meanwhile, IR research centered around open broadband networks are needed, so as search engines.

Important Terrier Upgrade and Call for Participation

March 20, 2008

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 - 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Principal Component Analysis Tutorial and Other Algos

March 18, 2008

I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods

Unsupervised

Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).

Supervised

Discriminant analysis
Neural Networks (Back propagation)
SIMCA
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

Search Engines Architecture Week 2

March 14, 2008

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

Back Mapping for SEOs

March 13, 2008

A reader asked me how he could apply back mapping (BM) to SEO work.

Essentially, any algorithm like BM that works at the level of collections can be scaled down and used at a page level. One just needs to decompose a page into passages (chunks of text or text windows) and treat each passage as a pseudo document.

First, convert a document into a text stream; i.e., linearize, tokenize, and filtrate the document. You should end with a linearized piece of text.

 If it turns out that the linearized document is almost identical to the original document,  you might want to go back and partition the original document every x paragraphs or sentences. If not, partition the text stream so you end with segments or text windows. You need to decide if you want to do this very x number of words or characters.

Next, treat each segment or chunk of text as a “document” and construct a term-”document” matrix “A”.

If you want to narrow this down to the level of links, construct a term-link matrix.

 In general, an algorithm that applies to a true term-document matrix A can be applied to “A”.

Welcome to scaling.

How Many SEO Myths In One Sentence?

March 12, 2008

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

Back Mapping Document Clusters to Terms

March 11, 2008
Back Mapping

Back Mapping is a simple technique that
elegantly identifies local and global topics.

Here is a snake preview of the current issue of IR Watch - The Newsletter. It should be in subscribers inbox during the day.

IR Watch 2008-03: Back Mapping Documents to Terms

Keywords: back mapping, documents, terms, association clusters, scalar clusters

In this issue:

Introduction
Revisiting Association and Scalar Clusters
On Back Mapping
Back Mapping Example
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Document Clusters to Terms
Identifying Local and Global Topics
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Search Engines for Penetration Testing Course

March 7, 2008

More on the topic of college education and search engines:

In the Caribbean we don’t have seasons, just early spring breaks –compared to back in the states. After a 2-week spring break vacation, the new trimester just started. I am teaching tomorrow the Search Engines Architecture Course. This is going to be fun.

BTW. I am putting together a new graduate course for the next fall semester: Search Engines for Penetration Testing. I look forward to any suggestion from colleagues and from former  or prospective students. This is going to be a nirvana for ethical hackers/spammers.

Search Engines Architecture Week 1

March 6, 2008

Week 1 Agenda

Lecture Session 

Course overview and description
Review of linear algebra, including vector analysis
Discussion of SVD and PCA

Lab Session

Lab 1. Linear Algebra with EXCEL
     Part 1. Vectors and Search Engine Marketing (SEM)
     Part 2. Vectors and Crawlers
     Part 3. Singular Value Decomposition (SVD) with LSI
     Part 4. Principal Component Analysis (PCA) with SVD

Lecture Material

http://www.miislita.com/information-retrieval-tutorial/information-retrieval-tutorials.html