The hilarious picture above shows how some SEOs look when playing to be scientists. This often occurs when interpreting big data.

Few specific scenarios:

1. Applying the statistical theory of small samples to extremely large samples, like …

2. …using large amount of data to force very small correlation coefficients to become statistically significant.

3. Trying to arithmetically average ratios (like correlation coefficients, standard deviations, slopes, and cosine similarities).

4. Mistaking Cauchy Distributions for Normal Distributions.

5. Adding together intensive properties.

Fortunately, I know of good folks that are doing a great job at educating their search marketing peers (Mike Grehan, Bruce Clay, Danny Sullivan, etc) without playing to be scientists.

### Like this:

Like Loading...

*Related*

itman1975

said:Surprisingly enough, statistical analysis of IR experiments can also be improved (in many ways). Talking about intensive properties& ratios. Actually, it does not always hurt to add them. Depends on data. I have examples, when doing analysis properly, i.e., through computing geometric means (or means of logs) did not change the outcome.

egarcia

said:Sorry to disagree with you on this.

Intensive properties are not additive at all.

Ratios follow a Cauchy distribution: no means, variance, or higher moments to compute. Sure you can find approximations that “sometimes work”.

When writing scientific papers you must use mathematically valid methods, not methods that sometimes work and sometimes don’t.

Sure you could compute a sample average from ratios, but you should not as it will not be an estimate of central tendency at all. So the result will not be mathematically valid. An average of ratios or a ratio of averages? This is a long-standing debate that we settle in SWM.

Leonid Boytsov (@srchvrs)

said:I agree that you better use geometric means with ratios and/or ratios-type analysis. I cannot really argue about mathematical rigor in applying statistics. I note however that, according to my observations, people use methods even when assumptions do not hold or hold approximately.

egarcia

said:I disagree on that as well. Sorry.

The geometric mean is used for log-normal distributions. http://en.wikipedia.org/wiki/Log-normal_distribution

A Cauchy Distribution is often mistaken for a normal one, but has no mean, variance, or higher moments. As more sample are taken, the sample mean and variance change with an increasing bias as more samples are taken. Computing an average mean from a Cauchy distribution is not an estimate of central tendency. You would be computing a meaninless estimate. If you really wants to approximate, use the median in a Cauchy-like distribution, instead.

And even doing that, the resultant distribution might be a mixture of Cauchy and a bimodal distribution with a Cauchy predominating, giving the impression of a normal-like distribution (*). The analysis can be challenged on basic statistical arguments.

There are valid math methods to stay away from such approximations. I’m presenting on this at a local seminar this June 7.

PS

(*) Consider the ratio (a + ca)/(b+cb), where a and b are variables. For certain combination of the ca and cb constants this mixture scenario can be present. As ca = cb –> 0, the Cauchy scenario predominates. (Marsaglia, 2006).

Leonid Boytsov (@srchvrs)

said:Ok,

How do you test your distribution assumptions?

egarcia

said:Check here for all related distribution assumption tests:

Marsaglia, G. (2006). Ratios of normal variables. Journal of Statistical Software 16(4):1–10.

Cedilnik, A., Košmelj, K., & Blejec, A. (2004). The distribution of the ratio of jointly normal variables. Metodološki zvezki 1(1):99–108.

Cedilnik, A., Košmelj, K., & Blejec, A. (2006). Ratio of two random variables: A note on the existence of its moments. Metodološki zvezki 3(1):1–7

gaga bella (@bellamary222)

said:Great Info shared,Its very helpful for me,Thanks For sharing

Pingback: Why Most SEO Statistical “Studies” are Flawed | IR Thoughts