Although not back-to-back during this week I will be posting on glottochronology.

*Glottochronology *is a combination of greek terms which essentially means language dating.

Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book *Discrete Dynamical Systems (C*hapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.

Sandefur wrote:

We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of

Lwords (sayL=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.Let one unit of time be 1 year. Thus, time

nwill benyears. LetA(n)represent the per cent of the original list of words still in usenyears later, given as a decimal. The basic assumption is that the percentA(n+1)of the original list of words in use at timen+1is proportional to the per cent of the original list of words in use at timen, that is,

A(n+1) =rA(n),where

ris a positive constant less than one. At time 0, all of the original list of words is in use, soA(0)=1. Therefore, at timek,A(k)=r^k(1) = r^kis the percent of the original list of words still in use, as a decimal.Since languages change slowly,

rshould be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimater^1000. This numberralso depends on the particular language. But glottochronologists have found that the numberr^1000is usually close to0.805. So for languages with no written history, that is, for languages in which we cannot estimater, we will assume that

r^1000 = 0.805Thus, the per cent of the original list of words that are still in use

kyears later is

r^k = (0.805)^(0.001 k).

Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.

It seems that term independence is T**he Original Sin** in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.