Week 3 Agenda

1. Introduction to Parsing: Building a Query Normalizer (PPT Presentation).
2. Understanding Search Engine Snippets via the SES 2005, San Jose presentation:
Patents on Duplicated Content (PDF Presentation).
3. Demonstration of Snippy Software: A Snippet Generator and Constrain Searcher.
4. Introduction to Association and Scalar Clusters: Keyword Clustering through a Similarity Matrix (PDF Presentation).

Required Reading Material

  Note: The Word and PDF versions of this talk are no longer available on the Web.

Bonus for Take-Home Work 1

This bonus is available to individual students, not to student groups. Each student must provide two separate html working scripts: (a) one containing the binary interface and (b) one containing the cruncher.

After learning this week how to build the search interface of a query normalizer.

1. Modify this search interface so that it only accepts binary data (1’s or 0’s).
2. Modify the search interface so that it becomes an HTML cruncher application. Add customized prototype methods so that the cruncher removes tabs, carriage returns, newlines, and unnecessary white space found in a typical HTML document. In addition, it should be non-invasive; i.e., it should not affect the functionality of HTML documents containing scripts, CSS instructions, or comment lines. You might need to retest the cruncher thoroughly with real documents from the Web, preferably with spam documents and with the source code of emails. To grab the source of an Outlook Express email, open an email you have received and navigate to:

File > Properties > Details > Message Source

Then, right-click message source and click Select to highlight all and right-click again to click Copy. You can also press Crtl+A to select all and Crtl+C to copy. Paste source in your HTML cruncher and crunch it.