In  I explained to the SEO community the concept of document linearization as part of document GAP analysis. Marketers learned what IR graduate students already know: that document linearization (i.e., markup removal) is just one component of document indexing.

Keyword distribution, word distances, phrase matching, etc. are obtained from the text stream that results from linearization, not from the apparent position of text that is rendered by a browser and visually inspected by average end users. Document linearization debunks the common SEO Keyword Density Myth. One thing is the apparent distribution of words as perceived when end users visually scan a document and another thing is the actual word distribution as parsed by a search engine. The futility of computing KD values is quite obvious.

