Here are some useful notes for those taking the Search Engines Architecture grad course.
In Lab 4: Experiment in Parsing Techniques, you are building a Query Normalizer, a Document Parser, and a URL Normalizer.
In section 1, be sure the tool removes all kind of HTML instructions from the search interface.
In section 2, the document parser should does document linearization, tokenization, and filtration, but no stemming. The final output should be a list of tuples consisting of unique terms and occurrences.
Regarding section 3.1, Building a URL Normalizer, I’m adding new restrictions to this part.
This part challenges you to use regular expressions, only. No arrays, no scripting loops, no conditionals (if-then), no lookup libraries, no DNS lookups, just regexps. It should work for all valid urls available on the Web. It can be done.
The parser for the URL normalizer tool should remove from a URL all kind of:
protocols (http, https, ftp, etc)
www prefixes
top-level domains (TLD)
ports (:80, :8000, etc)
file extensions (.html, .php, etc)
parameters (name=value pairs)
named anchors or fragments (#)
URL-forbidden characters, international characters, or script lines
Be sure you understand the difference between top-level domains and subdomains. For example, .com and .pr are TLDs, but .com.pr defines a subdomain.
Let say we have the http://www.xxx.com.pr site. If we remove http://www. and .pr from this we end with xxx.com, but .com is not the TLD in this case. The TLD still is .pr, though. Redirection mechanisms are “another twenty bucks” (”otros veinte pesos”), which might confuse the concept.
You should also know that a TLD can be a generic top-level-domain (gTLD), a country-code top-level domain (ccTLD), international TLDs (iTLDs), or US legacy TLDs (usTLDs). When combined, these define a subdomain. For example,
.edu.pr is a subdomain with .pr as TLD
.co.uk is a subdomain with .uk as TLD
That is, for a url on the Web, only one string sequence works as and defines the top-level domain.
Thus, for:
http://www.google.com.pr
https://www.google.com/adsense/login/en_US/?gsessionid=wfx3oxHhgDU
ftp://www.gobierno.pr/index.html
http://search.music.yahoo.com/search/?m=all&p=britney
http://www.telegraph.co.uk
http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s
http://www.lis.ntu.edu.tw/~mctang/index.htm
your tool should return:
google.com
google
gobierno
search.music.yahoo
telegraph.co
video.google.co
lis.ntu.edu
Recommended Lecture Material:
http://en.wikipedia.org/wiki/.pr
http://www.icann.org/meetings/saopaulo/presentation-dns-conrad-07dec06.pdf
http://www.mattcutts.com/blog/seo-glossary-url-definitions/
April 15, 2008 at 11:13 am |
1. Correction:
During the last lecture, I gave you a CSS file with block lines ending with semicolons. One of the students (Gina) pointed out to me that this should not be done. She is right as the CSS will not validate nor will be interpreted by Firefox (IE is more lenient in this sense). Thanks, Gina.
Please remove the semicolons and you will see the CSS in action in both browsers. When I wrote the CSSs and companion JavaScript instructions my texts editor was checked to end all lines with a semicolon. This explains the outcome. I forgot to double check the output in both browsers. My fault.
Talking about semicolons, Yahoo!’s Douglas Crockford and creator of JavaScript Object Notation (JSON) recommends ending functions blocks with a semicolon, especially when part of prototype lines. http://yuiblog.com/blog/2007/01/24/video-crockford-tjpl/
Interestingly, he also suggests the use of === instead of == in JavaScript to avoid coercion. My old scripts use ===, but when I run these with the latest version of IE on Windows Vista these seem to trepidate. Using == they seem to work just fine. I’m researching what causes the glitch in my scripts.
2. Reminder:
A reminder that to get full credit for your lab reports you need to turn in both hard and electronic copies of these.
Don’t forget to document all findings, including how-to installations if any is needed.