Our tutorial on standard errors is back! It is now available at
We have edited and updated the tutorial. New material was added.
Happy to see that The Self-Weighting Model (SWM) paper
was briefly cited in the Virus Evolution journal published by Oxford University Press, in the research paper:
Coevolutionary Analysis Identifies Protein–Protein Interaction Sites between HIV-1 Reverse Transcriptase and Integrase
HTML version: http://ve.oxfordjournals.org/content/2/1/vew002.full).
This is a great example of applying data mining techniques to HIV research, a major public health issue according to WHO, UNAIDS, and other world health organizations.
The study agrees with the SWM thesis; i.e., that correlation coefficients are not additive. Glad to see how SWM influenced their data analysis.
More on SWM below:
Time to restore online my old tutorials on the non-additivity of correlation coefficients so the next generations of scientists are not misled (@SEO quacks and @MOZ pseudo-scientists).
The hilarious picture above shows how some SEOs look when playing to be scientists. This often occurs when interpreting big data.
Few specific scenarios:
1. Applying the statistical theory of small samples to extremely large samples, like …
2. …using large amount of data to force very small correlation coefficients to become statistically significant.
3. Trying to arithmetically average ratios (like correlation coefficients, standard deviations, slopes, and cosine similarities).
4. Mistaking Cauchy Distributions for Normal Distributions.
5. Adding together intensive properties.
Fortunately, I know of good folks that are doing a great job at educating their search marketing peers (Mike Grehan, Bruce Clay, Danny Sullivan, etc) without playing to be scientists.
I received this morning from the editors of Communications in Statistics: Theory and Methods confirmation that they accepted and will be publishing my peer reviewed paper on a new model for statistical analysis. It should be out this 2012.
Once published, you will understand the SEO (* SEOmoz, I should say) non-sense of computing arithmetic averages of correlation coefficients and why some meta-analysis studies published in the past (* Hunter-Schmidt; Hedges-Olkin) are flawed and invalid.
It took me several meals and research hours to figure it out. I hope that IRs, dataminers, and statistics colleagues find new applications for the model.
The model can be applied to many fields, including marketing, business, risk analysis, data mining, signal processing, engineering, clinical trials, and almost any field or knowledge domain that involves the calculation of weighted statistics. I look forward to discuss it online once it get published.
Happy New Year.
PS. (*) I’ve edited this post to make these points obvious. So, the issue of arithmetically averaging correlations has been raised and killed for good before the scientific and statistical community.
PS. Just in: Last night (Jan-03-2012) I received news from one of the editors of the journal that the paper was assigned to issue 41 (8). Check for its title: The Self-Weighting Model (in Spanish is something like “El Modelo de Autoponderacion“. I forget to mention that this journal is published biweekly; so, things are moving fast. What a way of ending 2011 and starting 2012!!!
Back in 2008, Jan M. Hoem wrote an interesting reflexion paper titled “The reporting of statistical significance in scientific journals” (VOLUME 18, ARTICLE 15, PAGES 437-442; 03 JUNE 2008 http://www.demographic-research.org/volumes/vol18/15/18-15.pdf . The piece was an expanded version of a previous paper (http://www.demogr.mpg.de/papers/working/wp-2007-037.pdf )
He wrote (and I quote):
“Scientific journals in most empirical disciplines have regulations about how authors should report the precision of their estimates of model parameters and other model elements. Some journals that overlap fully or partly with the field of demography demand as a strict prerequisite for publication that a p-value, a confidence interval, or a standard deviation accompany any parameter estimate. I feel that this rule is sometimes applied in an overly mechanical manner. Standard deviations and p-values produced routinely by general-purpose software are taken at face value and included without questioning, and features that have too high a p-value or too large a standard deviation are too easily disregarded as being without interest because they appear not to be statistically significant. In my opinion authors should be discouraged from adhering to this practice, and flexibility rather than rigidity should be encouraged in the reporting of statistical significance. One should also encourage thoughtful rather than mechanical use of p-values, standard deviations, confidence intervals, and the like. Here is why:”
Hoem then dissects five points related with misusing statistical significance results and automatic software solutions. I’m listing these below.
I completely agree with Hoem.
SEOMOZ and their statistical “studies”
These days search engine optimization marketers (SEOs/SEMs) keep misinterpreting statistical results spitted from software without stopping and thinking about the significance-behind-the-significance, especially when it comes to a correlation coefficient, r (Pearson, Spearman, etc).
When one reads SEO hearsays and urban legends at SEOMOZ about very small correlation coefficients (0.17, 0.32, etc) derived from large sample sizes, as evidence that variables are “highly correlated” or “well correlated”, it is time to stop and put into question such “studies”. For reference see the following links
Fortunately, leading search marketers like Danny Sullivan has put into question those “studies” at a recent search engine conference
and that was even before SEOMOZ admitted the 0.32 result has to be recanted as 0.17.
Sean Golliher, founder and publisher of the Search Engine Marketing Journal (SEMJ.org) has also put into question their results (http://www.seangolliher.com/2010/uncategorized/185/ ), which Hendrickson from SEOMOZ still insists in defending.
Since then they have never disclosed the source of the mistake, dismissing it just as a programming error. Unfortunately, they are still claiming that a 0.15 – 0.30 range validates their “studies” (http://www.seo.co.uk/seo-news/seo-tools/the-seomoz-lda-tool-%E2%80%93-our-disappointing-findings.html) .
Small and Large Sample Sizes
When William Sealy Gosset (aka “Student”) proposed, and Ronald A. Fisher expanded on, the test later termed Student’s t-test of significance, the test was meant to be used to assess information from small samples, not from large samples. I have discussed the case of small sample sizes in another post (https://irthoughts.wordpress.com/2010/10/18/on-correlation-coefficients-and-sample-size/ ).
In order to apply a t-test (and other small sample analysis tests) to large samples, divide-and-conquer techniques, like stratification, were eventually developed. In the case of correlation and regression, the reason for doing this is that applying something like a t-test to, for instance, a single correlation coefficient coming from a huge sample can produce misleading results. Let see why.
For large enough sample sizes eventually any correlation coefficient, even the smaller ones, will always be significant (t-observed > t-table). At that point it might be tempting to assume that the variables in question are highly correlated. Wrong assumption!
The fact is that statistical significance does not necessarily equate to variables being highly correlated and vice versa. Let address this point in two parts: (1) the question of statistical significance and (2) the question of high correlation.
Statistical Significance: Bigger not always is better
As noted in a Wikipedia entry, “given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. (http://en.wikipedia.org/wiki/Effect_size ). I have discussed effect size and power analysis in a previous post (https://irthoughts.wordpress.com/2010/10/21/on-power-analysis-and-seo-quack-science/ ).
The reason for the above effect has a lot to do with the definition of statistical significance itself. Statistical significance is the confidence one has in a given result and that such a result is not by random chance.
In mathematical terms, the confidence that a result is not by random chance is given by the following formula by Sackett (http://en.wikipedia.org/wiki/Statistical_significance , http://www.cmaj.ca/cgi/content/full/165/9/1226 ):
Confidence = (Signal/Noise)*Sqrt[Sample Size]
This simple expression or derivatives of it appears in many different scenarios and disciplines. It describes a generic Confidence Function, F, in terms of a Signal, a Noise, and a Sample Size; that is, F(Signal, Noise, Sample Size). In general, such a generic function tells us that:
Let’s apply a version of this expression to correlation. To do this, let Y be the dependent variable and let X be the independent variable. Let also make the following substitutions:
F(Signal, Noise, Sample Size) = t-observed2 = [r2/(1 – r2] [n – 2]
Taking the square root (Sqrt) at both sides, we obtain the so-called formula for a two-tailed t-test.
t-observed = r*Sqrt[(n – 2)/(1 – r2)]
Evidently for a given r value, t-observed increases when n increases. By rearranging this expression, it is possible to compute for a large enough sample size a critical value above which r values will be significant. For very large samples at a 95% confidence level, t-table= 1.96 (http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values ). Replacing arbitrarily this value in the above expression (t-observed = t-table = t = 1.96) and solving for r, we obtain that the critical r value is given by
r = t/Sqrt[(n – 2) + t2]
The following table lists values for very small r values and huge sample sizes. I’m intentionally using several decimal places and ignoring significant figure rules since I want to make a point on the small values used. I’m also using a 0.95 confidence level for illustration purposes, but for the large samples I could and should use other confidence levels as well.
|n||n – 2||t||r||S = r*r||N = 1 – r*r||S/N|
For a sample size of 10,000 observations the critical r is 0.0196 or about 0.02, meaning that for such a huge sample size any r value above this small and critical r value will be significant. However, something interesting is observed from this table: (PS See footnote update)
When one moves to large sample sizes the Noise becomes greater than the Signal. For instance, at n = 10,000 the amount of Signal is very small (0.000384) while the amount of Noise is above 0.9996… or 99.96…%, giving a quite trivial S/N ratio. A similar reasoning can be applied for r = 0.17 (S = 0.0289, N = 0.9711) and r = 0.32 (S = 0.1024, N = 0.8976). The corresponding S/N ratios are trivial.
One can also solve the above expression for n to find sample sizes for some small r values and arbitrary t as shown in the following table (PS See footnote update).
|t||r||S = r*r||N = 1 – r*r||S/N||n||n – 2|
Still, note that the amount of Noise completely overcomes the Signal, producing trivial S/N ratios. In general for small r and large n values significance is achieved at the cost of Noise masking the Signal. When this occurs the statistical significance is not a practical guideline for drawing useful conclusions from the data at hand.
This drives the present discussion to the substantive part of the problem missed by SEOs, and that is …
Statistical Significance Does Not Necessarily Mean Highly Correlated Results
Simply stated, statistical significance does not necessarily imply that the X, Y variables are highly correlated.
A simple scatterplot will convince anyone that for the above small r values there will be no pattern or trend in the data set. The corresponding regression model will be useless for forecasting or inferring anything of value, except that the data spreads so wildly that it has no method to its chaos. What else is to be expected from a data set with a large Noise and small S/N ratio?
This is something that SEOs/SEMs still don’t seem to understand: t-observed > t-table not necessarily means high correlation, and vice versa. I don’t have any personal stake (or take) against them, but when folks like Hendrickson, Fishkin, and others from SEOMOZ ignore Signal-to-Noise ratios and start referring to small r values as evidence that experimental variables are “highly” or “well” correlated, it is more than fair to call such “studies” Quack “Science”. That label might sound harsh, but in this case is appropriate.
Search engine marketers might be good at selling snakeoil, publishing sloppy “studies”, or recanting on overhyped statements, but not at doing real Science. They should know better; i.e., that
Statistical “significance” only means that any confidence in the data is not by random chance. Therefore, a significant correlation does not necessarily mean a “high”, “well”, or “strong” correlation between variables.
To understand all this we need to distinguish between statistical significance and practical significance.
Statistical Significance vs. Practical Significance
As stated at this Wikipedia entry (emphasis added in boldfaces) http://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Criticism ):
A common misconception is that a statistically significant result is always of practical significance, or demonstrates a large effect in the population. Unfortunately, this problem is commonly encountered in scientific writing. Given a sufficiently large sample, extremely small and non-notable differences can be found to be statistically significant, and statistical significance says nothing about the practical significance of a difference.
Use of the statistical significance test has been called seriously flawed and unscientific by authors Deirdre McCloskey and Stephen Ziliak. They point out that “insignificance” does not mean unimportant, and propose that the scientific community should abandon usage of the test altogether, as it can cause false hypotheses to be accepted and true hypotheses to be rejected.
Some statisticians have commented that pure “significance testing” has what is actually a rather strange goal of detecting the existence of a “real” difference between two populations. In practice a difference can almost always be found given a large enough sample. The typically more relevant goal of science is a determination of causal effect size. The amount and nature of the difference, in other words, is what should be studied. Many researchers also feel that hypothesis testing is something of a misnomer. In practice a single statistical test in a single study never “proves” anything.
That pretty much settles the question of discerning between statistical significance and practical significance of correlation coefficients, but does not tell us how to quantitatively discern between the two concepts. In an upcoming article, I will derive expressions that might help to quantitatively assess these.
Since the tutorial on correlation coefficients http://www.miislita.com/information-retrieval-tutorial/a-tutorial-on-correlation-coefficients.pdf has been updated several times and is getting too long, I will put that upcoming material on a separate pdf file. As a sneak preview, we will be examining extreme cases (too high/low r values, too high/low sample sizes, and too high/low signal-to-noise ratios, etc.).
PS. I updated this post to fix some little typos.
Footnote. I found erroneous including the entries for n = 10 and n = 100 in the first table so I removed these altogether and limited the discussion to the large n values. A reader asked why I used t-table = 1.96 for all entries. I thought it was clear from the discussion that the above tables are meant to show calculations for arbitrarily set t-values. In a real test, you would need to use the actual t values from statistical tables. For instance, for n = 10 you would have to use a t-table value of t = 2.306 at the 0.95 level. You should get
|n||n-2||t||r||S = r*r||N = 1 – r*r||S/N|
One of the trickiest aspects of publishing statistical studies is the sample size to be used. Not stipulating a valid procedure for estimating a proper sample size can hurt, for instance, a grant proposal. Ethical committees are concerned about the right number of observations in a study, asking submitters to justify on statistical grounds how they arrived at a given sample size. Research projects with too few or too many observations or no sample size methodology at all often get rejected. This is something those conducting SEO quack “science” don’t seem to understand or are not aware of.
Too small samples are unethical, because the researcher cannot be specific enough about the size of, for example, the effect of a drug in a population. Too large samples are also unethical, because represent a waste of funding. True that a large sample improves precision, but it might involve an unjustified cost. Stratification is preferred, but it gets too complicated with huge sample sizes, not to mention that statistical significance not necessarily scales between samples.
As Rahul Dodhia from RavenAnalytics (http://ravenanalytics.com/Articles/Sample_Size_Calculations.htm ) indicates: a 2000-sample might not be very different from a 20000-sample, but a 200-sample maybe very different from a 2000-sample even when in each case the sample ratio is 10. So, a large sample not always is justified, even if such a sample size improves statistical significance and precision.
Consider the case of search engine ranking results. Upon a query, search engines are capable of finding many results, frequently in the range of thousand or million results per query. Still search engines and retrieval systems show to users a limited answer set. For instance, Google limits its viewable answer set to a maximum of 1,000 results (100 pages, 10 results/page).
Like in most retrieval systems, relevant results are accumulated at the first few result pages forming clusters. This is in agreement with Rijsbergen’s Cluster Hypothesis, which states that documents that cluster together have a similar relevance to a given query. Moving down the list of search results one often find cluster transitions wherein the quality and aboutness of documents is polluted with off-topic content.
Documents buried in a list of results often contain content irrelevant to the initial query or full of spam techniques. If one wants to conduct a statistical study of ranking results versus a particular document feature, one can do better by considering a sample from the first few result pages than from the entire answer set of 1,000 results.
In general, in a non-search engine scenario one cannot just arbitrarily select large samples to “force” the statistical significance of very low correlation coefficients and then use those values to draw conclusions. Furthermore, what is the selection criterion for using 1,000 or 10,000 results?
Simply stated: If 10,000 observations are arbitrarily selected, why not use 100,000 or 1,000,000 instead? We already know that very small correlation coefficients between any two arbitrary pair of random variables will be significant at those huge sample levels, anyway. And?
As noted in a Wikipedia entry, “given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. (http://en.wikipedia.org/wiki/Effect_size ).
For example, a correlation coefficient of r = 0.04 would be significant at a 95% confidence level if coming from a 10,000-sample (t-calc = 4.003 >> t-table = 1.96) while a correlation coefficient of r = 0.01 would be significant at a 95% confidence level if coming from a 100,000-sample (t-calc = 3.162 >> t-table = 1.96). And? This proves nothing, especially when the magnitude of a “signal” approaches the magnitude of its “noise”.
As noted at the above Wikipedia entry, a correlation coefficient of 0.1 is strongly statistically significant when sample size is 1000, (t-calc = 3.175 >> t-table = 1.96) but reporting only the small p-value from this analysis could be misleading if a correlation of 0.1 is too small to be of interest in a particular application. (http://en.wikipedia.org/wiki/Effect_size ).
Statistical significance of extremely small r values is not surprising as is just a mathematical consequence of the fact that a t-value is a function (F) of a weighted ratio: the ratio of explained-to-unexplained variations weighted by the number of degree of freedoms:
F(r, n) = t = SQRT[(r2/(1 – r2))*(n – 2)]
F(r, n) = t = r*SQRT[((n – 2)/(1 – r2))]
For a given r value, increasing n increases t. No surprise here. One thing is what a math equation tells you and another different thing is what the nature and obvious boundaries of a physical system tell you.
At trivially low r values any claim with regards to the statistical significance or strength of some results proves nothing and one cannot do much with such trivial r values. For instance for r = 0.04, r2 = 0.0016, meaning that 1 – r2 = 0.9984 or 99.84% of the variations in the dependent variable (y) are not explained by variations in the independent variable (x).
In such a scenario, assessing the effect of x on y is a futile exercise. Such a model would be useless for drawing conclusions or predicting anything. And here is the point that many SEOs at SEOMOZ (http://www.seo.co.uk/seo-news/seo-tools/the-seomoz-lda-tool-%E2%80%93-our-disappointing-findings.html , Fishkin, Hendrickson, and others elsewhere) don’t seem to grasp:
When a correlation coefficient is useless for all practical purposes.
If the raw data constantly changes, that’s another “Chaos Layer” that compounds the problem.
Enters Cohen’s Power
According to Cohen’s work, when conducting a sample size study of correlation coefficients, one needs to consider the required confidence level and power of the test, the desired probability for Type I and Type II Errors, and the hypothesized or anticipated correlation coefficient (http://www.medcalc.be/manual/correlation_coefficient.php ). One cannot just use an arbitrary sample size for testing things.
In general, given any three of the following, the fourth one can be determined (http://www.statmethods.net/stats/power.html ):
1. sample size
2. effect size
3. significance level = P(Type I error) = probability of finding an effect that is not there
4. power = 1 – P(Type II error) = probability of finding an effect that is there
One also needs to consider what is the statistical parameter that is undergoing the power analysis. One needs to ask questions like the following:
Are we testing means from a given group? http://www.nss.gov.au/nss/home.nsf/pages/Sample+Size+Calculator+Description?OpenDocument
Are we testing means from different groups? http://www.ncbi.nlm.nih.gov/pmc/articles/PMC137461/
Are we testing correlation coefficients? Read Simon’s take on the impact of sample size on the desired level of precision in correlation coefficients (http://www.childrens-mercy.org/stats/weblog2005/CorrelationCoefficient.asp ).
Are we interested in significance level, effect size, sample effect, or power?
When conducting an effect size analysis one must keep in mind that effect sizes estimate the strength of a possible relationship, rather than assigning a significance level. However, effect sizes do not determine significance levels, or vice-versa.
So, how do we go about implementing Power Analysis?
For those interested in implementing power analysis written in the R Language, I recommend the libraries at http://www.statmethods.net/stats/power.html
Software for conducting power analysis is also available elsewhere, as shown in the following table. My favorites are G*Power and SPSS SamplePower (http://www.spss.com/software/statistics/samplepower/).
|Power Analysis SoftwareSource: http://www.epibiostat.ucsf.edu/biostat/sampsize.html|
|G*Power License: Free||Uses both exact and approximate methods to calculate power. It will deal with sample size/power calculations for t-tests, 1-way ANOVAs, regression, correlation, and chi-square goodness of fit. For t-tests and ANOVAs you find the effect size by supplying mean and variance information. For correlation coefficients the effect size is a function of r2. http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/|
|PC-Size License: Free||Deals with sample size/power calculations for t-tests, 1-way and 2-way ANOVA, simple regression, correlation, and comparison of proportions. http://www.esf.edu/efb/gibbs/monitor/usingDSTPLANandPCSIZE.pdf
|DSTPLAN License: Free||Uses approximate methods to calculate power. It will calculate sample size/power for t-tests, correlation, a difference in proportions, 2xN contingency tables, and various survival analysis designs. http://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id=41|
|PS License: Free||Performs sample size/power calculations for t-tests, Chi-square, Fisher’s exact, McNemar’s, simple regression, and survival analysis. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize|
|Tibco Spoffire S+ License: Paid||The only commercially-supported statistical analysis software that delivers a cross-platform IDE for the award-winning S programming language, the ability to analyze gigabyte class data sets on the desktop, and a package system for sharing, reuse and deployment of analytics in the enterprise and in validated environments. Used widely in validated production environments (e.g., 21 CFR Part 11).http://spotfire.tibco.com/products/s-plus/statistical-analysis-software.aspx|
|NQuery Advisor License: Paid||Performs sample size/ power calculations for t-tests, 1 and 2 way ANOVAS, tests of contrasts in 1-way ANOVAs, univariate repeated measures designs, regression (simple, multiple and logistic), correlation, difference of proportions, 2XN contingency tables, and survival analyses. http://www.statsol.ie/nquery/nquery.htm|
|PASS License: Paid||Performs sample size/power calculations for z-tests, t-tests, 1, 2, and 3-way ANOVAs, univariate repeated measures designs, regression (simple, multiple and logistic), correlations, difference in proportions, 2xN contingency tables, survival analyses and simple non-parametric analyses. http://www.ncss.com/pass.html|
|Stata License: Paid||It has some simple built-in power and sample size functions. http://www.stata.com/|
|SPSS SamplePower License: Paid||If your sample size is too small, you could miss important research findings. If it’s too large, you could waste valuable time and resources. Finds the right sample size for your research in minutes and test the possible results before you begin your study, with IBM SPSS SamplePower. Strikes the right balance among confidence level, statistical power, effect size, and sample size using IBM SPSS SamplePower. Compares the effects of different study parameters with its flexible analytical tools. http://www.spss.com/software/statistics/samplepower/|
LDA and Google’s ranks well correlated?
After the hilarious example of this guy with the SEOMOZ LDA tool (http://smackdown.blogsblogsblogs.com/2010/09/09/proof-that-the-new-seomoz-tools-is-at-least-half-accurate/ ) I can only laugh out loud. Have anyone tried something like that?
Regarding the new fiasco with their LDA tool. Oh, no, another one… (http://www.seomoz.org/blog/lda-correlation-017-not-032) : What can I said? They sound pathetic and apologetic. The words overhyped, shitty, sloppy, flawed, etc are not enough to describe their “research work”.
What will happen now with those Mute Speakerphones that were misled? Those that listen to fools become one.
I don’t feel any sympathy for their 15 minutes of “honesty”. The damage was done already to naïve readers.
Also, note that this latest flaw was discovered by them. It was not the result of any peer review process from external referees, as those throwing a towel at them would like to believe.
As mentioned before, beware of SEOs statistical “studies” and their quack “science” (https://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/ ), especially if coming from SEOMOZ.
Probably their snakeoil will make a comeback soon. (Oh, no. Again?)
If they still think they have a valid LDA implementation, why not announce it at David Blei’s Topic-Models werein a community of LDA experts will review it and compare it against other implementations?
Two things can happen:
(a) It will be reviewed.
(b) it will be ignored.
I “invite” them to do so.
Please, just don’t show up with your snakeoil, yellow shoes, your seo mom, paid cheerleaders, vested investors, overhyped claims, etc, etc.
More on their hype machine here: http://skitzzo.com/archives/seomoz-hype-machine.php
It appears that even Danny Sullivan is not buying SEOmoz’s “research” on LDA. Accordingly, “He didn’t think it was the remarkable change that SEOmoz was making it out to be.” (http://outspokenmedia.com/internet-marketing-conferences/evening-forum-with-danny-sullivan/). He even confronted and put into question their “highly correlated” numbers. And that was even before they recanted.
One of the best known technique for transforming correlation coefficient (r) values into weighted additive quantities is the r-to-Z transformation due to Fisher.
Fisher’s r-to-Z transformation is an elementary transcendental function called the inverse hyperbolic tangent function. The reverse, a Z-to-r transformation, is therefore a hyperbolic tangent function.
In Windows computers, these functions are built-in in their scientific calculator program which is accessible by navigating to Start > All Programs > Accessories > Calculator. Microsoft Excel also has these built-in as the ATANH and TANH functions.
Fisher’s r-to-Z transformation is applicable only to bivariate normal distributions; i.e. if the (x, y) paired variables both describe bell-shaped curves. Non trivial errors arise if one of the variables is not normally distributed.
5-22-2016 Update: We have developed a tool for easily computing these transformations and explaining the bivariate normality restriction.
The tool is available at
Beware of Sloppy Calculations
Correlation coefficient arithmetic averages are not computable directly from individual values.
Indeed, it is not possible to add, subtract, average or take standard deviations out of raw r values.
Unfortunately some researchers with a limited knowledge on Statistics have published papers containing such gross errors. What is worse, reviewers of those papers are either not statisticians or have been lazy enough to overlook at the concept, leading graduate students and post docs into error.
Search marketers are also buying into the error. An example of this are the SEOs from SeoMOZ (now MOZ) promoting quack “science” and sloppy “statistical studies”. If you are an SEO and still want to believe their snakeoil marketing, that’s up to you.
In the 04/23/2010 blog post, Beware of SEO Statistical Studies, we commented on incorrect statistical methods used by SEOs in two different blogs. In that post, it was mentioned that we agreed with the comments made by one dissenting poster with regard to an SEOMOZ post published by his owner Rand Fishkin (aka randfish) and titled The Science of Ranking Correlations. That SEOMOZ post subsumed a methodology implemented by a member of SEOMOZ, Ben Hendrickson (aka. ijustwritecode).
We ended up our post with the following lines:
“To sum up, beware of SEO ”Science” and their statistical “studies”.
“I hope this helps.”
“PS: I updated this post to add the figure above and few additional comments.”
That was it.
Two days later, on 04/26/2010, the dissenting poster, (Branko Rithman; aka neyne, whiteweb_b) stopped by this blog and explained why in his own opinion SEMOZ was doing incorrect science. We went through an exchange of comments with Rithman that spanned two days and ended on 04/28/2010. All comments exchanged were limited to referencing and commenting facts from the Statistics literature. No personal attacks were directed at SEOMOZ, Fishkin, or Hendrickson.
On 06/12/2010, almost two months later, that post was referenced by Ted Dziuba in the post SEO is Mostly Quack Science, wherein Dziuba was very critical about SEOs and SEOMOZ, and referencing their work as “quack science”.
While we agreed with many of the remarks made by Dziuba, we disagree with at least one. In his post he stated that “A correlation of zero suggests that the two variables are completely independent of one another.”. Actually this incorrect as r = 0 only means variables are not linearly correlated. In our IR Watch Newsletter we have explained that r = 0 implies nothing about whether variables are dependent, independent, random (a claim later made by Fishkin) or deterministic or whether these might be or not nonlinearly correlated.
On 06/14/2010, Sean Golliher, founder and publisher of Search Engine Marketing Journal (SEMJ.org), stopped by to add comments on why SEOMOZ statistical treatment was incorrect. I responded: “Unfortunately these type of “research” gives a black eye to the SEO industry. I limited the length of that post since I was rushing to put out our monthly IRW newsletter which was already late.
The same day, Hendrickson, dropped by this blog and accused me that all comments were a personal attack and critic directed at him, despite the fact that his name was never mentioned or ever inferred in any of the previous posts by me. [PS. I just quoted Rithman]
“In your post, you criticize me…” blah, blah, blah…
He then insisted on what they are doing was a correct statistical treatment. Of course, he did not present any mathematical reasoning to support his claims. He repeated that he was right and demanded a retraction. Since evidently he took our posts personally and we were under time constraints to publish IRW, I decided to halt his post until a full response to his virulent accusations was ready for public consumption.
It was after his remarks with no statistical or mathematical evidence given that we put into question any valid knowledge Hendrickson and Fishkin might have about Statistics. Their indiscriminate reference about power and exponential laws confirms this perception. It was after their reactions that we called what SEOMOZ was doing “quack science” as can be seen from this 06/16/2010 post. As of Today, I stand by that statement. I ended that post with the following line:
“I am the one that is now demanding a public retraction from you both for putting out crap “science” at SEOMOZ.”
That’s a strong statement and I don’t regret it. Why? Because often search engine marketers and their cheerleaders put out a lot of false knowledge and opinions labeled as “science” or “study”, giving a black eye to the serious, ethical sector of their own industry.
Next day, on 06/17/2010, Dziuba’s post was featured at Sphinn by Jill Whalen wherein a hot debate evolved between SEOs and wherein our post was commented. Looking at that debate, we concluded that the best way to dispel so many myths and confusion was through education.
On 06/20/2010, it was mentioned that we planned to write two tutorials on statistics. On 06/25/2010 we announced the release of the first tutorial, A Tutorial on Standard Errors.
In that tutorial we explained the right way of computing standard errors. It was explained why it is incorrect any attempt at computing mean of correlation coefficients as arithmetic averages, treating those mean correlation coefficients as one would treat the mean of x observations, and computing with these standard deviations and standard errors. We explained that r-to-Z Fisher Transformations are needed and even sometimes pooled standard errors must be computed.
It appears Hendrickson was monitoring that release since later that Friday (there is almost a time zone difference of 4 hours between Seattle and San Juan), he published a “rebuttal” at SEOMOZ titled Statistics: a win for SEO wherein he tried to refute the main statements of the tutorial, perhaps as a damage control tactic.
Hendrickson “rebuttal” was limited to repeat his previous claims, again with no mathematical proofs. He then made some irrisory remarks about Z scores exploding at r = 1, and to quoting reference work that obviously he never read thoroughly, otherwise he could have realized one of the papers he cited was based on a previous work with an obvious theoretical error: computing arithmetic averages of correlation coefficients.
In email exchanged with a co-author of that work, the author now recognized that their approach of adding and averaging correlation coefficients was indisputably incorrect after all. You can read the full story and comments exchanged with him in our second tutorial, A Tutorial on Correlation Coefficients, published on 07/08/2010. The tutorial is also a direct response to Hendrickson and Fishkin “rebuttal”.
In the SEOMOZ “rebuttal” there were few comments made by Hendrickson (and later subscribed by Fishkin at Sphinn) that are diversion tactics and plain lies. That is common in the art of misinformation: second guess your opponent and write a rebuttal on comments that were never made. This relates to my comments on PCA.
What Hendrickson claimed I said about PCA:“Rebuttal To Claim That PCA Is Not A Linear Method”
What I actually said about PCA:
This is what I said and stand by.
About the word “linearity” in PCA: Be careful when you, Ben and Rand, talk about linearity in connection with PCA as no assumption needs to be made in PCA about the distribution of the original data. I doubt you guys know about PCA, a dimensionality reduction technique frequently used in data mining. It is also used in clustering analysis as a one way of addressing the so-called K-Means initial centroid problem, though it can fail with overlapping clusters.
PCA can be applied to linear and nonlinear data. Ben, have you ever heard about nonlinear PCA? But even if you don’t, let me say this: Nothing more nonlinear than images, clusters, noisy scattered data, etc–yet PCA can be applied to these scenarios to extract structural components hinting at possible patterns hidden by noisy dimensions.
Linearity in PCA does not refer to the original variables, but about linearity in the transformed variables used in PCA because these are transformed into linear combinations. To understand linear combinations a review on linear algebra helps. According to Gering from MIT (http://people.csail.mit.edu/gering/areaexam/gering-areaexam02.pdf )
“Principle Component Analysis (PCA) replaces the original variables of a data set with a smaller number of uncorrelated variables called the principal components….The method is linear in that the new variables are a linear combination of the original. No assumption is made about the probability distribution of the original variables.”
The linearity assumption is with the basis vectors. And even so, variants of PCA do exist to deal with nonlinearity.
Incidentally, on 06/28/2010, a poster from SEOMOZ, Daniel Deceuster, dropped by our blog wherein he was not happy with SEOMOZ claims. My 06/30/2010 comments to him are available
and are given below:
Thank you for stopping by. I’m very busy these days, so sorry for not responding before.
Good luck with your tests. Please keep in mind that with a problem with many variables, testing all of them at the same time allows one to account for outliers, noisy variables and possible interactions between variables which often introduce nonlinearity.
A one-variable-at-a-time testing is prone to ignore the above. Arbitrarily removing nonlinearity can introduce spurious linearity. There are methods for multivariate testing: modified sequential simplex optimization, factorial designs, and yes, PCA.
For instance, a problem with many variables N, taken as dimensions, might look nonlinear. PCA creates a new set of variables that are linear combinations of the original variables and linearly independent of each other.
The odds are that when a tester deals with many variables he does not know a priory whether the orignal data is truly linear or nonlinear, or if there are subspaces wherein it is one way or the other. Fortunately, no assumptions need to be made in PCA about the distribution of the original data.
The goal is to represent the problem using fewer dimensions M such that M < N. This is why PCA is considered a dimensionality reduction technique. The set of M dimensions is obtained by finding the principal components (PCs). The PCs give the direction of greater variability and provide evidence for linearity in a particular direction of a reduced space. So, one can find linearity within apparently nonlinear data.
There are many problems (linear and nonlinear) in which PCA is applied. And there are many advances and new research in the PCA and SVD area to make PCA robust. After the upcoming tutorial on correlation coefficients, I might have to put out new tutorials on the topic to dispel so much misinformation running around.”
The same misinformation tactic was used by Hendrickson regarding my claims about what I said about Pearson and Spearman correlation coefficients. Here he seems to use bidirectional extrapolation like many politicians have mastered when debating. [“if A to B is true, then B to A is true.”, “if A to B is false, then B to A is false”. “Let’s the opponents to disprove.”]
We know that bidirectional extrapolation in Statistics and Science is incorrect. For instance, we know that:
If variable are independent, there is no correlation (r = 0). The reverse is not necessarily true.
Pearson’s can be applied to linear paired data. The reverse is not necessarily true. (*)
Nonlinear paired data requires of Spearman’s. The reverse is not necessarily true. (*)
*The asterisk is for this: If a linear transformation of variables is possible, both Pearson’s and Spearman’s can be used.
All my comments about Pearson’s and Spearman’s are available at the original blog post for anyone to read.
Some False Claims that Evaporate Under Examination
We have shown in A Tutorial on Correlation Coefficients, why SEOMOZ, Hendrickson, and Fishkin “science” approach and “studies” are mathematically and statistically incorrect.
First, correlation coefficients are not additive. For centered data, the geometrical equivalence of an r value is a cosine. Adding, subtracting, or averaging cosines do not give a new cosine. For non-centered data there is still a cosine term in the formula for r.
Second, a mean from x observations is an unbiased estimator while a mean correlation coefficient is a biased estimator. Treating an average r value as a mean x is the same as saying that a mean r is an unbiased estimator. Statistically speaking, this is a gross error. This is easy to prove.
Third, professional and academic statisticians know that unlike mean x values, r values have an inherent bias; i.e., their distribution is bivariate and inherently skewed. Thus, the computed mean r value is also a biased estimator. All this affects hypothesis testing. A researcher cannot compute mean values and associated standard errors arbitrarily with one formula just because he likes to pick one in particular.
I will give now an example described in the Appendix section of our tutorial.
Let r1= 0.877 and r2 = 0.577. One might think that the average over these is 0.727 or about 0.73 with an associated Coefficient of Determination of 0.73*0.73 = 0.53 or 53%. Assuming this is the same as implying that the mean is an unbiased estimator, not skewed toward r1 or r2.
However, converting r1 and r2 to Z scores and averaging, gives an average Z score of 1.010. Converting this result back to r gives a mean value of 0.766 or about 0.77 with an associated R of 0.77*0.77 = 0.59 or 59%. Note that now the mean value is skewed toward r1, confirming that the mean value is a biased estimator. Statisticians know the inherent bias is a property of r values and their mean values.
If SEOMOZ wants to get into the misinformation business and sell snakeoil, sure they are entitled to that. But that ruins any little credibility they are getting these days with their “science knowledge”. Some of their cheerleaders, groupies, or easy-to-impress visitors might elect to stick to them. That’s fine as they deserve what they are getting. So far, Statistics is a loss for them, anyway. All I can say is this: Those that listen to fools become one.
I’m not sure if Hendrickson misled Fishkin or if it was all the way around. In real companies, for less than that employees end fired. Since Fishkin cannot fire himself, his burden of proof is now double. And for the fourth time: I am still demanding a retraction.
PS. I updated this post to refine some lines.
I’m pleased to publish the tutorial on correlation coefficients. So, start ignoring the quack science from the usual SEO losers. Statistics is a loss for them, anyway.
Instead of speculating about the scientific work of others or misquoting their work, we did what any serious researcher would do: We contacted them directly.