Please read Glottochronology Part I before reading this post.
Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is
p(A1A2) = p(A1)p(A2)
Some authors like Sandefur call this the multiplication principle.
As Sandefur noted and quote:
Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.
Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.
Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.
In Part III we will provide some examples of this principle to the evolution of cultures.
Later in this series we will explain how the independence assumption affects some of the reasonings and claims behind language dating models.
In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.