Archive for the ‘Search Engines Architecture Course’ Category

Search Engines Architecture Week 9

May 9, 2008

Week 9 Agenda

Lecture Session

This lecture is an extension of our previous lecture. A detailed discussion of the search engine architecture floor plan of the following search engines is presented:

WebCrawler
Google

Main components to be discussed include:

crawlers administration, indexing, forward index, inverted index, posting lists intersection, faulty tolerance, redundancy, query servers load balance, etc.

If we don’t run out of time, we will get into Lucene and other architectures.

Lab Session

Complete previous labs.

Important Note

I am finishing the Practice Test and should be in students inbox by Monday.

Search Engines Architecture Week 8

May 2, 2008

Week 8 Agenda

Lecture Session

To understand the present we need to look at the past.

Thus, in this lecture we will take an in-depth look at early and current search engine architectures and their “floor plan”. We look at some published papers that started what we have today. Some hard-to-find material will be used. These are actual pieces of history that explain few “how-did-they-do-that”..

Since we cannot cover all the search engine architectures as we would like to, I have selected few of these. These were classified in three category:early, old glory, and currrent. The last two are open source projects.

I might add few more. Either way, this lecture might cover two weeks.

Early:

Archie
ALIWEB
WWW Wanderer
WWW Worm
JumpStation
RBSE

Old Glory:

WebCrawler
Lycos

Current:

Google
Lucene
Terrier

Lab Session

Complete previous lab.

Search Engines Architecture Week 7

April 25, 2008

Week 7 Agenda

Lecture Session

Review of a typical
Search Engine Architecture
Regular Expressions for Building Porter’s Stemmer

Lab Session

Finishing Lab 3 and 4

Building the Porter Stemmer

April 22, 2008

This post is for grad students taking the Search Engines Architecture course.

You should have in your email inbox a pdf of Porter’s original article from 1980 and a revised version of Lab 5 to follow the nomenclature utilized by Porter, along with the expected results to check against. Please disregard previous lab version.

The Stemmer is easy to build in any language. You can take a look at some versions on the Web to get some ideas, but you cannot copy these. Your tool should only do what is required in the experiment.

Deadline will be negotiated next time we meet. If you have any questions, feel free to blog these, whether these are about programming or content.

Important quotes or notes from Porter’s original article

1. “A consonant is a word other than a, e, i, o, u or a letter other than y preceded by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of itself does not make it ambiguous.) So in toy the consonants are t and y, in syzygy they are s, z, and g. If a letter is not a consonant it is a vowel.”–Porter.

2. Any word has a measure m, defined as the VC frequency (‘shift’) or number of times the VC pair is repeated, where V is a vowel sequence and C is a consonant sequence; e.g.,

m = 0 tr, ee, y, by
m = 1 trouble, oats, trees, ivy
m = 2 troubles, private, oaten, orrery

Search Engines Architecture Week 6

April 18, 2008

Week 6 Agenda

Lecture Session

Review on Regular Expressions
Lovins, Krovetz, and Porter Stemmers
Xu & Croft Stemmer (Word Co-occurrence-based Stemmers)
The Parallel Stemmer (LSI/Vector Space-based Stemmer)

Lab Session

Building a Stemmer

Comments: Lab 3 is due next week, but if you prefer to turning in earlier is ok. Lab 4 and this lab (Lab 5) are similar in nature. We can negotiate tomorrow in class their deadlines. A reminder that all labs are due in hard and electronic format to get full credit.

How Search Engines Do Not Work

April 17, 2008

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

Search Engines Architecture Week 5

April 11, 2008

Week 5 Agenda

Lecture Session

Building a Parser
Parsing Techniques and Implementations

Lab Session

Building a Query Normalizer
Building a Document Parser
Building a URL Normalizer

Comments

A reminder that hard and digital copies are required for all lab reports to get full credit. I need to evaluate whether the programs actually work as intended.

Report deadlines: To be announced to accommodate to class needs and previous lab needs.

Search Engines Architecture Week 4

April 4, 2008

Week 4 Agenda

Lecture Session

Building an Inverted Index
Incorporating Query Modes
Building the Term-Doc Matrix from the Inverted Index

Lab Session

Programming the Parser
Programming the Inverted Index

Adressing Some LSI Questions

April 2, 2008

At the last Search Engines Architecture lecture we discussed LSI and Terrier. Great questions were raised. Some of these follows:

Q: How many dimensions to keep?
A: This is done by trial and error. I have a research project on the topic. None of the current ways of addressing this problem convince me.

Q: How do we compute a truncated version of the initial matrix, A?
A: After SVDing A, truncate U, S, and V by retaining the first k columns of U and V (rows of V transpose) and the first k diagonal elements of S. Multiply these as discussed in class to get A truncated.

Q: To compute the query vector in the reduced space, do we need to compute A truncated for each query?
A: No. The new coordinates of this vectors are defined as
q = qTUkSk-1
This means that A can be called from the cache. See the fast track tutorial
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf
over at Mi Islita.com site.

Q: Do I need to compute A truncated each time a new document is added or previous are modified?
A: For small matrices the answer is YES. However, for huge matrices we can resource to updating/appending techniques. Some of these add doc vectors without recomputing the previous matrix. There is a point wherein this can compromise orthogonality, though.

Q: How do I use Desktop Terrier?
A: Follow the instructions provided in the updated version of Lab Report 2.

Search Engines Architecture Week 3

March 28, 2008

Week 3 Agenda

Lecture Session

Document Indexing
Web Crawling Techniques

Lab Session

For this lab, students should have already signed to download Terrier from http://ir.dcs.gla.ac.uk/terrier. We can use the Desktop API as is, but for development we need JAVA in the local machine.

Lab instructions for using the API will be provided in class. Please read in advanced Terrier documentation. Bring with you a directory (folder) full of documents from the pupr.edu site or your favorite site to play with. This will be analyzed during the lab.

This lab report is due next week.

PCA and SPCA Tutorial

March 25, 2008

As promised, the PCA and SPCA Tutorial is now available at Mi Islita.

I wrote this tutorial for two reasons:

1. to help graduate students taking the Search Engines Architecture course with a general overview and review of linear algebra concepts.
2. because most tutorials discuss PCA, but ignore SPCA.

If you are enrolled in the course and have any question, please feel free to post.

Feel also free to comment or make suggestions so we can improve the tutorial.

Important Terrier Upgrade and Call for Participation

March 20, 2008

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 - 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Search Engines Architecture Week 2

March 14, 2008

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

Search Engines Architecture Week 1

March 6, 2008

Week 1 Agenda

Lecture Session 

Course overview and description
Review of linear algebra, including vector analysis
Discussion of SVD and PCA

Lab Session

Lab 1. Linear Algebra with EXCEL
     Part 1. Vectors and Search Engine Marketing (SEM)
     Part 2. Vectors and Crawlers
     Part 3. Singular Value Decomposition (SVD) with LSI
     Part 4. Principal Component Analysis (PCA) with SVD

Lecture Material

http://www.miislita.com/information-retrieval-tutorial/information-retrieval-tutorials.html

Search Engines Architecture Graduate Course

February 27, 2008

Search Engines Architecture - Lecturer: Dr. Edel Garcia
Code: CECS 7804/31 Special Topics in C&E
Time: Saturday 8:00 A.M. – 12:00 N, Spring 2008
Room: Software Testing Lab, L-210
Academic Calendar: http://www.pupr.edu/academiccalendar/ac-wi05.pdf
Final Examination Date: May 24, 2008

Description: This is a hands-on, one-full semester course on search engines architecture and their algorithms. Each class consists of lecture and lab sessions. Students are expected to build and test their own search engines and related components on a server dedicated for this purpose.

Course Communication: All communications, upgrades/changes to this syllabus to fulfill class needs, answer to questions, clarifications, etc will be made available through this blog. Posts will be indexed in the Search Engine Architecture Course category. Thus, students must read this category on a regular basis. To access the category, just click on the link listed in the Categories section at the right of this blog home page.

Classroom Policies: Taping is not allowed. Lecture Notes are not published online. Students are required to take notes in the traditional way. All lecture material is copyrighted.

Important Note: Students registered in this course automatically receive an invitation to present their projects developed during this course at the Search Engines and Information Security Conference to be held at Polytechnic University, San Juan Campus during October 3 and 4, 2008. Contact Dr. Garcia at admin@miislita.com for additional details or questions regarding the conference.

Target: Students in Business, Engineering, and Computer Sciences and from other disciplines are encouraged to register for this special course.

Requirements: Permission from advisor or department and knowledge of matrix algebra.

Grading: Weekly Lab Reports, Project Presentations, and a Final Exam. The following scoring system will be used:

Course Grade = Ave*(1 - w) + Fin*w

where

Ave = average of all lab reports and group projects. The lowest 2 partial grades are eliminated.

Fin = final exam score
w = an adjustable weight

The letter grade scale is as follows:

A = 100 - 90; B = 89 - 80; C = 79 - 65; D = 64 - 55; F = 54 - 0

Topics: Although not necessarily in this order, some of the topics to be covered, include, but are not limited to the followings:

Linear Algebra Fast Track Tutorial: Brief tutorial on matrix operations with emphasis on vector theory and Singular Value Decomposition

Parser Building: Use of regular expressions to build, test, and use a parser.

Crawler Building: Use of AJAX to build a client-side crawler and a dedicated server-side crawler.

Look-Up Directories: Implementing a look-up directory and pseudo site search tool.

Search Interfaces: Developing and testing search interfaces, their search modes, and advanced search features.

Index and Database Building: Data fragmentation and storing.

Sanitizing and Ranking Answer Sets: Filtering, De-duplicating, and ranking answer set results.

Textbook: There is no official textbook. Open source components will be used or developed by the students. However, the following reference books are recommended for research. Additional references and extended syllabus will be provided in class.

Modern Information Retrieval (Baeza-Yates and Ribeiro-Neto; Addison Wesley).
Information Retrieval – Algorithms and Heuristics (Grossman and Frieder; Springer).