In a nutshell, because most are based on flawed statistics.
The Question of Standard Deviations and Variances
If you have studied for the College Board Examination, you should know that standard deviations are not additive. You should also know that variances are additive for independent random variables. Read the article Why Variances Add — And Why It Matters. Many SEOs fail to know this.
The Question of Correlation Coefficients
Like standard deviations, correlation coefficients are not additive, period. Since they cannot be added, it is not possible to compute an arithmetic average out of them. The same can be said about cosines, cosine similarities, slopes, and in general about any dissimilar ratio. Read the Communications in Statistics article The Self-Weighting Model wherein flaws in the top two main meta-analysis models are documented. Again, many SEOs do not understand this point.
The Question of Normality
Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid; otherwise one cannot claim that, for instance a computed arithmetic mean (average) is a valid estimator of central tendency for the data at hand. Most SEOs and some “web analytic gurus” out there simply take some data and average them without first doing a normality test.
The Question of Big Data and the t-Test of Significance
When the Fathers of Statistics (Fisher and company) came up with the t-test of significance and similar tests, these were meant to be used with small data sets, not big data sets. To illustrate, if you take a very very very large data set of N paired results, compute a statistic (eg. a correlation coefficient), and compare it against a t-table value, eventually it will pass the test of significance. This will be true for experimental correlations as small as 0.1, 0.01, 0.001….. provided that N is large enough. Claims of statistical significancies are in this case useless. This is why with big data you should try data stratification methods, followed by weighting methods. Big data can lead to big statistical pitfalls.
The Question of Average of Ratios or Ratio of Averages
Ratios cannot be added and then averaged arithmetically, period. A ratio of averages must be used instead of computing an average of ratios. The reason is that a ratio distribution is Cauchy. A Cauchy Distribution is often mistaken for a normal one, but has no mean, variance, or higher moments. As more sample are taken, the sample mean and variance change with an increasing bias as more samples are taken. Computing an average mean from a Cauchy distribution is not an estimate of central tendency. SEOs should know what they are averaging. Check one of my old posts and the comments that followed at
To sum up, beware of SEO statistical “studies”.
SearchEngine Watch is publishing how MOZ (formerly, SEOmoz) is giving another black eye to the search marketing community:
Another embarrassing SEO Quack Science report from the usual suspects debunked, this time by Google’s Matt Cutts http://searchenginewatch.com/article/2290337/Matt-Cutts-Google-1s-Dont-Lead-to-Higher-Ranking
Oh, No. These SEO marketers still don’t understand basic statistics.
Here is a nice article about the risks of misusing big data
Here are my comments on the topic:
1. Most traditional statistically significant analyzes were meant to be used with small data sets, not with big data, unless stratification of the big data is possible. In general, if a large enough data set is used, any t-test study of very small correlation coefficients can be forced to become statistically significant, and so misleading. Said effect is a by-product artifact of the equations involved.
2. Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid. This is one of the first things a peer reviewer of statistical articles will look for. Methods and techniques for transforming data to become normally distributed prior to any analysis do exist, although some data sets might not be transformable and forcing them to adopt normality can be contraindicated.
3. Avoid arithmetically adding and averaging correlation coefficients, standard deviations, slopes, cosine similarity measures, and dissimilar ratios in general. They are simply not additive, regardless of what some outdated meta-analysis articles say or what you hear from groupie search marketers (“Why regurgitate in blogs what you don’t understand?”). :)
On Words and Strings
In String Frequency Distributions, Mark Liberman blogs about the flaws involved when co-occurrence studies are reported without defining what is a “word” in the first place.
I agree 100% with him. Using a Google data set without defining what a “word” is can be misleading. If no data clean-up is done first, the best that we can do is to call those studies “string frequency co-occurrence”.
A string is a linear sequence of symbols (characters, words, phrases, etc.). If we limit the term “string” to mean a string token, then a string is a sequence of characters. I prefer the following definition: A string (meaning a string token), is a non-space character or a sequence of these. Thus, a string no need to be a dictionary word.
On Co-Occurrence Studies
Back in the early ’00s I conducted research on co-occurrence. In 2004, I introduced SEOs to co-occurence theory concepts when the Search Engine Watch Forums was created back then by Danny Sullivan. I described what is/is not co-occurrence. I also introduced SEOs to two key concepts: the C-Index and the EF-Ratio, which were well explained at several SEO sites and SEO conferences through basic Venn Diagram Theory.
To make the story short, the first one of the above ratios is simply the AND/OR ratio and the latter the EXACT/AND ratio. Both are computable from search results, but as signals contaminated with some noise since search engines can expand the query and answer sets in the background. Despite of being noisy signals, calculating these ratios provides some clue as whether two or more keywords exhibit some form of relatedness in a database or corpus. Co-Occurrence is also applicable to HTML elements like links, anchor texts, titles, etc…
These days, a new generation of easy to impress SEOs is rediscovering co-occurrence as something “new” around the block. Please…
More than 10 years later, I’m completing research on new ratios and on their complementary relationships with the above two. This includes a search engine that calculates these ratios on the fly in its search result pages. More on this will soon follow.
…and the end of a fiasco.
We have added a new round of similarity measures to our Binary Similarity Calculator, for a total of 30.
We plan to add few more in the future, so more data miners, researchers, and students can benefit from it. We plan to make this a comprehensive project on similarity analysis. For this purposes, several doctoral theses are deployed. Any help, correction, feedback is appreciated.
The Binary Distance Calculator, a new tool for computing the distance or dissimilarity (lack of resemblance) between any two binary sets of same size is available now at http://www.miislita.com Its FAQs section includes a clear definition of distance in the context of Information Retrieval and Mathematics.
This tool was developed to complement The Binary Similarity Calculator, one of our popular tools.
We have just launched The Binary Similarity Calculator. This is a new tool for computing binary-based similarity measures that is available now.
What it is
The Binary Similarity Calculator (BSC) can be used to compare binary sets, groups consisting of only two types of items or states. These are item sets that can be represented as sequences of 1’s and 0’s.
Who can benefit from it
• Marketing analysts that need to examine Yes/No-type questionnaires about products and services.
• Teachers and examiners that must score Yes/No-type exams or assess plagiarism cases.
• Engineers, mathematicians, and physicists that must evaluate On/Off-type records.
• Statisticians, bioanalysts, and others involved with sequencing analysis.
• To sum up, anyone that uses binary sets.
Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid. One way of testing for normality is through a quantile-quantile (q-q) plot, a technique for determining if data sets originate from populations with a common distribution.
In this tutorial, you will determine if a data set is normally distributed by comparing its quantiles against those of a theoretical normal distribution. You will also learn how to make a data set nearly normally distributed.
Oct 3, 2012 Update: I’ve added a new figure to the article and reworked few lines.