Here is a nice article about the risks of misusing big data

Here are my comments on the topic:

1. Most traditional statistically significant analyzes were meant to be used with small data sets, not with big data, unless stratification of the big data is possible. In general, if a large enough data set is used, any t-test study of very small correlation coefficients can be forced to become statistically significant, and so misleading. Said effect is a by-product artifact of the equations involved.

2. Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid. This is one of the first things a peer reviewer of statistical articles will look for. Methods and techniques for transforming data to become normally distributed prior to any analysis do exist, although some data sets might not be transformable and forcing them to adopt normality can be contraindicated.

3. Avoid arithmetically adding and averaging correlation coefficients, standard deviations, slopes, cosine similarity measures, and dissimilar ratios in general. They are simply not additive, regardless of what some outdated meta-analysis articles say or what you hear from groupie search marketers (“Why regurgitate in blogs what you don’t understand?”). 🙂

Advertisements