In A Comparison of Document, Sentence, and Term Event Spaces, Blake analyses the stability of various global weight models (IDF, ISF, and ITF) with respect to document and journal corpuses. She uses stratified samples collected based on term frequency information. Abstracts and section partitions of full-text scientific articles were studied.
We are conducting studies along similar lines, but using web collections. Such studies are extremely important in a dynamic environment like the web, which is different from abstracts and scientific journal collections. Web collections often consist of documents of general interest or noisy content. Documents can be designed using dubious practices like keyword repetition techniques. Such techniques, commonly known as keyword spam, can introduce biased term frequency data.
It should be underscored that keywords relevant to products and services are also intentionally repeated across the web. In addition, not all document content found in web collections is valid. Some are near duplicates, have been artificially generated, or are the result of vested alliances like link exchange programs. Thus, the stability of IDF, ISF, and ITF with respect to web collections or subsets of these in such a noisy environment is still an open question.
Any pointer or reference research about this topic is greately appreciated.