Our tutorial on standard errors is back! It is now available at
We have edited and updated the tutorial. New material was added.
Happy to see that The Self-Weighting Model (SWM) paper
was briefly cited in the Virus Evolution journal published by Oxford University Press, in the research paper:
Coevolutionary Analysis Identifies Protein–Protein Interaction Sites between HIV-1 Reverse Transcriptase and Integrase
HTML version: http://ve.oxfordjournals.org/content/2/1/vew002.full).
This is a great example of applying data mining techniques to HIV research, a major public health issue according to WHO, UNAIDS, and other world health organizations.
The study agrees with the SWM thesis; i.e., that correlation coefficients are not additive. Glad to see how SWM influenced their data analysis.
More on SWM below:
Time to restore online my old tutorials on the non-additivity of correlation coefficients so the next generations of scientists are not misled (@SEO quacks and @MOZ pseudo-scientists).
1. Starting with our classic Term Vector Theory series, we are republishing our series of tutorials on Information Retrieval from the early and mid 2000s. See http://www.minerazzi.com/tutorials/
2. A new miner on the Zika Virus is now available online at http://www.minerazzi.com/zika/
3. Additional miners are listed at http://www.minerazzi.com/
In a nutshell, because most are based on flawed statistics.
The Question of Standard Deviations and Variances
If you have studied for the College Board Examination, you should know that standard deviations are not additive. You should also know that variances are additive for independent random variables. Read the article Why Variances Add — And Why It Matters. Many SEOs fail to know this.
The Question of Correlation Coefficients
Like standard deviations, correlation coefficients are not additive, period. Since they cannot be added, it is not possible to compute an arithmetic average out of them. The same can be said about cosines, cosine similarities, slopes, and in general about any dissimilar ratio. Read the Communications in Statistics article The Self-Weighting Model wherein flaws in the top two main meta-analysis models are documented. Again, many SEOs do not understand this point.
The Question of Normality
Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid; otherwise one cannot claim that, for instance a computed arithmetic mean (average) is a valid estimator of central tendency for the data at hand. Most SEOs and some “web analytic gurus” out there simply take some data and average them without first doing a normality test.
The Question of Big Data and the t-Test of Significance
When the Fathers of Statistics (Fisher and company) came up with the t-test of significance and similar tests, these were meant to be used with small data sets, not big data sets. To illustrate, if you take a very very very large data set of N paired results, compute a statistic (eg. a correlation coefficient), and compare it against a t-table value, eventually it will pass the test of significance. This will be true for experimental correlations as small as 0.1, 0.01, 0.001….. provided that N is large enough. Claims of statistical significancies are in this case useless. This is why with big data you should try data stratification methods, followed by weighting methods. Big data can lead to big statistical pitfalls.
The Question of Average of Ratios or Ratio of Averages
Ratios cannot be added and then averaged arithmetically, period. A ratio of averages must be used instead of computing an average of ratios. The reason is that a ratio distribution is Cauchy. A Cauchy Distribution is often mistaken for a normal one, but has no mean, variance, or higher moments. As more sample are taken, the sample mean and variance change with an increasing bias as more samples are taken. Computing an average mean from a Cauchy distribution is not an estimate of central tendency. SEOs should know what they are averaging. Check one of my old posts and the comments that followed at
To sum up, beware of SEO statistical “studies”.
SearchEngine Watch is publishing how MOZ (formerly, SEOmoz) is giving another black eye to the search marketing community:
Another embarrassing SEO Quack Science report from the usual suspects debunked, this time by Google’s Matt Cutts http://searchenginewatch.com/article/2290337/Matt-Cutts-Google-1s-Dont-Lead-to-Higher-Ranking
Oh, No. These SEO marketers still don’t understand basic statistics.
Here is a nice article about the risks of misusing big data
Here are my comments on the topic:
1. Most traditional statistically significant analyzes were meant to be used with small data sets, not with big data, unless stratification of the big data is possible. In general, if a large enough data set is used, any t-test study of very small correlation coefficients can be forced to become statistically significant, and so misleading. Said effect is a by-product artifact of the equations involved.
2. Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid. This is one of the first things a peer reviewer of statistical articles will look for. Methods and techniques for transforming data to become normally distributed prior to any analysis do exist, although some data sets might not be transformable and forcing them to adopt normality can be contraindicated.
3. Avoid arithmetically adding and averaging correlation coefficients, standard deviations, slopes, cosine similarity measures, and dissimilar ratios in general. They are simply not additive, regardless of what some outdated meta-analysis articles say or what you hear from groupie search marketers (“Why regurgitate in blogs what you don’t understand?”).
The hilarious picture above shows how some SEOs look when playing to be scientists. This often occurs when interpreting big data.
Few specific scenarios:
1. Applying the statistical theory of small samples to extremely large samples, like …
2. …using large amount of data to force very small correlation coefficients to become statistically significant.
3. Trying to arithmetically average ratios (like correlation coefficients, standard deviations, slopes, and cosine similarities).
4. Mistaking Cauchy Distributions for Normal Distributions.
5. Adding together intensive properties.
Fortunately, I know of good folks that are doing a great job at educating their search marketing peers (Mike Grehan, Bruce Clay, Danny Sullivan, etc) without playing to be scientists.
More and more SEOs are using separators like pipes, dashes, commas, etc when writing title tags. (Exclude underscores from the list, which are used as concatenators.)
Which of these separators perform better?
From time to time some SEO “experts” and their cheerleaders promote the idea that a particular separator performs better over others. See the following link:
And that is despite the fact that Google’s Matt Cutts has mentioned in two different videos that there is no real significative advantage of using one over the other. These videos are available at the following links:
I would like to see an analytical study supporting the facts. I believe this is reasonable to ask. Don’t you think so?
Nothing better than starting 2011 with more research work.
Check this blog tomorrow as there will be a value-added good news for those interested in conducting research at the interface of information retrieval, statistical analysis, and applied mathematics. You’re welcome to grab a copy of this four-month investigation, for use in your own research, as a teaching tool, or to chase away SEO snakeoil.