Here are some useful notes for those taking the Search Engines Architecture grad course.

In Lab 4: Experiment in Parsing Techniques, you are building a Query Normalizer, a Document Parser, and a URL Normalizer.

In section 1, be sure the tool removes all kind of HTML instructions from the search interface.

In section 2, the document parser should does document linearization, tokenization, and filtration, but no stemming. The final output should be a list of tuples consisting of unique terms and occurrences.

Regarding section 3.1, Building a URL Normalizer, I’m adding new restrictions to this part.

This part challenges you to use regular expressions, only. No arrays, no scripting loops, no conditionals (if-then), no lookup libraries, no DNS lookups, just regexps. It should work for all valid urls available on the Web. It can be done.

The parser for the URL normalizer tool should remove from a URL all kind of:

protocols (http, https, ftp, etc)
www prefixes
top-level domains (TLD)
ports (:80, :8000, etc)
file extensions (.html, .php, etc)
parameters (name=value pairs)
named anchors or fragments (#)
URL-forbidden characters, international characters, or script lines

Be sure you understand the difference between top-level domains and subdomains. For example, .com and .pr are TLDs, but defines a subdomain.

Let say we have the site. If we remove http://www. and .pr from this we end with, but .com is not the TLD in this case. The TLD still is .pr, though. Redirection mechanisms are “another twenty bucks” (“otros veinte pesos”), which might confuse the concept.

You should also know that a TLD can be a generic top-level-domain (gTLD), a country-code top-level domain (ccTLD), international TLDs (iTLDs), or US legacy TLDs (usTLDs). When combined, these define a subdomain. For example, is a subdomain with .pr as TLD is a subdomain with .uk as TLD

That is, for a url on the Web, only one string sequence works as and defines the top-level domain.

Thus, for:

your tool should return:

Recommended Lecture Material:  

About these ads