Yesterday we posted on Kincaid’s ARI. Although not mentioned, the main reason for bloggin about it was that we are testing some readability index (RI) tools to be integrated to Mi Islita.com.
There is a kind of endless readability “war” since the invention of the first tests. Thanks to the Internet this war is in full swing among academics. When it comes to computing such scores/indexes, should we:
include or exclude random samples from a given text? (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.5934&rep=rep1&type=pdf)
include or exclude syllables? (http://www.jstor.org/pss/27531033)
include or exclude punctuation?
include or exclude typos?
include or exclude spaces?
include or exclude non-dictionary terms?
Since punctuation can affect readability, semantics, coherence etc (http://www.aelfe.org/documents/text4-Sancho.pdf), why remove punctuation from the analysis? Unfortunately, current readability formulas only measure structural difficulty (http://www.iacis.org/iis/2008_iis/pdf/S2008_1071.pdf).
To do or not to do tokenization… that is the question.
Funny that some algorithms define “words” as a continuous sequence of nonspaces, regardless if these are punctuation characters or valid words. But then, word lists free-from punctuation and spaces are used for assessing the performance of such algorithms. That’s smell,… and it is not lavanda.
There is also the problem of scoring non-narrative text as found in most Web documents with formulas intended to be used for scoring narrative text. How futile is that?
And how futile is scoring Web documents with ever changing dynamic content?
Do readability formulas work? (http://blogs.wsj.com/numbersguy/do-readability-formulas-work-297/tab/article/)
And if so: which tool is better? It all depends on who is counting, how, and why.