The course final examination is almost here. This is a theory and practice test. If you are taking the test, bring with you a No. 2 pencil, eraser, calculator, laptop, and all tools developed during the course (parser, crawler, query/url normalizers, and a working copy of Terrier). You will need this material for the practicum.
Some of the questions to be faced involve discussion and good reasoning like the ones discussed during the review session. Consider this one:
Question. String noise can be generated during markup removal, tokenization, filtration, and stemming, especially if we blindfold remove apostrophes, possesives, contractions, and stopwords. In which order should you remove these so that a minimum of noise is generated?
See answer at the end.
Thank you for taking this course. By now you probably understand how search engine architectures are designed and actually work. At least you got the basics.
I have been asked to teach next Fall a graduate course on Text Mining under the title
Adversarial IR: Web Spam and Search Engines for Penetration Testing
From now to the fall, anything can happen.
Answer to question:
Step 1. Remove markup and then tokenize according to rules.
Step 2. Remove contractions and then possesives.
Step 3. Remove apostrophes and then stopwords.
Step 4. If applying stemming, do according to a flavored version of Porter’s.
This strategy is as good as your regexp expressions and parsing rules and can only be applied on a per case basis (e.g., caution with rules for hyphenated tokens). It is not perfect, but is workable.