This post is for grad students taking the Search Engines Architecture course.
You should have in your email inbox a pdf of Porter’s original article from 1980 and a revised version of Lab 5 to follow the nomenclature utilized by Porter, along with the expected results to check against. Please disregard previous lab version.
The Stemmer is easy to build in any language. You can take a look at some versions on the Web to get some ideas, but you cannot copy these. Your tool should only do what is required in the experiment.
Deadline will be negotiated next time we meet. If you have any questions, feel free to blog these, whether these are about programming or content.
Important quotes or notes from Porter’s original article
1. “A consonant is a word other than a, e, i, o, u or a letter other than y preceded by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of itself does not make it ambiguous.) So in toy the consonants are t and y, in syzygy they are s, z, and g. If a letter is not a consonant it is a vowel.”–Porter.
2. Any word has a measure m, defined as the VC frequency (‘shift’) or number of times the VC pair is repeated, where V is a vowel sequence and C is a consonant sequence; e.g.,
m = 0 tr, ee, y, by
m = 1 trouble, oats, trees, ivy
m = 2 troubles, private, oaten, orrery