Although not back-to-back during this week I will be posting on glottochronology.

Glottochronology is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.

Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,

A(n+1) =rA(n),

where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.

Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that

r^1000 = 0.805

Thus, the per cent of the original list of words that are still in use k years later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.