In a nutshell, because most are based on flawed statistics.

**The Question of Standard Deviations and Variances**

If you have studied for the College Board Examination, you should know that standard deviations are not additive. You should also know that variances are additive for independent random variables. Read the article Why Variances Add — And Why It Matters. Many SEOs fail to know this.

**The Question of Correlation Coefficients**

Like standard deviations, correlation coefficients are not additive, period. Since they cannot be added, it is not possible to compute an arithmetic average out of them. The same can be said about cosines, cosine similarities, slopes, and in general about any dissimilar ratio. Read the *Communications in Statistics *article The Self-Weighting Model wherein flaws in the top two main meta-analysis models are documented. Again, many SEOs do not understand this point.

**The Question of Normality**

Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid; otherwise one cannot claim that, for instance a computed arithmetic mean (average) is a valid estimator of central tendency for the data at hand. Most SEOs and some “web analytic gurus” out there simply take some data and average them without first doing a normality test.

**The Question of Big Data and the t-Test of Significance**

When the Fathers of Statistics (Fisher and company) came up with the t-test of significance and similar tests, these were meant to be used with small data sets, not big data sets. To illustrate, if you take a very very very large data set of N paired results, compute a statistic (eg. a correlation coefficient), and compare it against a t-table value, eventually it will pass the test of significance. This will be true for experimental correlations as small as 0.1, 0.01, 0.001….. provided that N is large enough. Claims of statistical significancies are in this case useless. This is why with big data you should try data stratification methods, followed by weighting methods. Big data can lead to big statistical pitfalls.

**The Question of Average of Ratios or Ratio of Averages**

Ratios cannot be added and then averaged arithmetically, period. A ratio of averages must be used instead of computing an average of ratios. The reason is that a ratio distribution is Cauchy. A Cauchy Distribution is often mistaken for a normal one, but has no mean, variance, or higher moments. As more sample are taken, the sample mean and variance change with an increasing bias as more samples are taken. Computing an average mean from a Cauchy distribution is not an estimate of central tendency. SEOs should know what they are averaging. Check one of my old posts and the comments that followed at

http://irthoughts.wordpress.com/2012/06/04/when-big-data-leads-to-big-errors/#comment-1469

To sum up, beware of SEO statistical “studies”.

Leonid Boytsov (@srchvrs)

said:It is not a good argument about getting statistical significance with a lot of data. You will get a significant result eventually and it will demonstrate the true effect size with high probability. That is if the effect is tiny, you will see it’s tiny and has no practical consequence.

PS: In more details, I discussed this issue a month ago: http://searchivarius.org/blog/statistical-significance-useful

egarcia

said:Agreed. That’s exactly the gist of the post. Forcing very small correlations to pass a significance test by using big data proves nothing.

Leonid Boytsov (@srchvrs)

said:The violation of normality is another story that doesn’t harm essentially anything. It is well-known that t-test works really well even when assumptions are stretched to an amazing degree. One good example is information retrieval. People thought that they might get much better results with permutation/bootstrap tests. But, alas, t-test works perfectly and agrees with non-parametric re-sampling tests almost ideally.

egarcia

said:I disagree. There are many scenarios wherein normality must not be violated.

For instance, Zimmerman et al. (2003) showed that arbitrarily applying Fisher’s Z Transformation to correlations from distributions that violate bivariate normality, can lead to spurious results. In this case both x and y must be normally distributed.

Zimmerman, D. W., Zumbo, B. D., Williams, R. H. (2003). Bias in estimation and hypothesis testing of correlation. Psicológica 24:133–158.