A grad student asked me about n-grams in IR as a thesis topic.

What are n-grams

Well, most of the modern work on n-grams is due to the work by D’Amore and Mah (1985) ONE-TIME COMPLETE INDEXING OF TEXT: THEORY AND PRACTICE

Page 116 of Grossman and Frieder (Information Retrieval: Algorithms and Heuristics) has great introductory material.

An n-gram is a fragment of length n of a word of length L. For example, replaces is a word of length L = 8 that has the following set of bigrams (2-grams; n = 2) and trigrams (3-grams; n = 3):

BI    =  {re, ep, pl, la, ac, ce, es} – for a total of N = 7 bigrams
TRI =  {rep, epl, pla, lac, ace, ces} – for a total of N = 6 trigrams

Additional n-grams are extracted for 0 < n <= L.

The total number of n-grams (N) assuming that we are shifting a window of length n one character at a time is given by

N = L – n + 1

Thus, if L = 8 and n = 2, 3, 4, 5, 6, 7

then

N = 7, 6, 5, 4, 3, 2

Evidently, for L = n, N = 1, resulting in an identity n-gram, which is the word itself. Thus, replaces is the identity n-gram of replaces. Similarly a two-term sequence has two identity n-grams, each leading to many n-grams. Some refer to these just as n-grams. Thus, a three- and four-string sequence is also termed a 3-gram and a 4-gram. In this sense, sentence fragments are also n-grams. Examples are given in All Our n-Gram are Belong to You.

What n-grams are good for

The process of matching query terms to terms in documents performs poorly against noisy text (e.g., due to mispellings or noisy OCR). The use of n-grams has been proposed to deal with such noise. The idea is to decompose words into n-grams and then develop n-gram matching algorithms. This strategy has been applied to detect and correct mispellings, to determine the authorship of documents, and to detect duplicated content.

D’Amore and Mah proposed a model to estimate the noise in indexing and to determinate approximate document similarity measures. In their model the weight of n-grams is not based just on occurrences (frequencies), but using a scaling method that normalizes document lengths. This makes sense since:

  1. large documents contain more n-grams than small documents.
  2. unlike words in documents, n-grams in documents do not follow a Zipf distribution.
  3. with scaling we can use thresholds for telling how significant the similarity between any two documents is, regardless of their length.

The authors found that n-grams are not equally likely to occur, but after removal of stopwords are closer to follow a binomial distribution than regular terms.

Read D’Amore and Mah paper. Although they were unable to apply their research work to a large corpus (*) their paper is still a must-read, not to mention that n-gram based IR can be a great thesis topic for a grad student.

Note: * Fortunately this is no longer the case, thanks to Google.

Advertisements