On SEOMOZ “Knowledge” about Statistics

In the 04/23/2010 blog post, Beware of SEO Statistical Studies, we commented on incorrect statistical methods used by SEOs in two different blogs. In that post, it was mentioned that we agreed with the comments made by one dissenting poster with regard to an SEOMOZ post published by his owner Rand Fishkin (aka randfish) and titled The Science of Ranking Correlations. That SEOMOZ post subsumed a methodology implemented by a member of SEOMOZ, Ben Hendrickson (aka. ijustwritecode).

We ended up our post with the following lines:

“To sum up, beware of SEO ”Science” and their statistical “studies”.

“I hope this helps.”

“PS: I updated this post to add the figure above and few additional comments.”

That was it.

Two days later, on 04/26/2010, the dissenting poster, (Branko Rithman; aka neyne, whiteweb_b) stopped by this blog and explained why in his own opinion SEMOZ was doing incorrect science. We went through an exchange of comments with Rithman that spanned two days and ended on 04/28/2010. All comments exchanged were limited to referencing and commenting facts from the Statistics literature. No personal attacks were directed at SEOMOZ, Fishkin, or Hendrickson.

On 06/12/2010, almost two months later, that post was referenced by Ted Dziuba in the post SEO is Mostly Quack Science, wherein Dziuba was very critical about SEOs and SEOMOZ, and referencing their work as “quack science”.

While we agreed with many of the remarks made by Dziuba, we disagree with at least one. In his post he stated that “A correlation of zero suggests that the two variables are completely independent of one another.”. Actually this incorrect as r = 0 only means variables are not linearly correlated. In our IR Watch Newsletter we have explained that r = 0 implies nothing about whether variables are dependent, independent, random (a claim later made by Fishkin) or deterministic or whether these might be or not nonlinearly correlated.

On 06/14/2010, Sean Golliher, founder and publisher of Search Engine Marketing Journal (SEMJ.org), stopped by to add comments on why SEOMOZ statistical treatment was incorrect. I responded: “Unfortunately these type of “research” gives a black eye to the SEO industry. I limited the length of that post since I was rushing to put out our monthly IRW newsletter which was already late.

The same day, Hendrickson, dropped by this blog and accused me that all comments were a personal attack and critic directed at him, despite the fact that his name was never mentioned or ever inferred in any of the previous posts by me. [PS. I just quoted Rithman]

Hendrickson stated:

“In your post, you criticize me…” blah, blah, blah…

He then insisted on what they are doing was a correct statistical treatment. Of course, he did not present any mathematical reasoning to support his claims. He repeated that he was right and demanded a retraction. Since evidently he took our posts personally and we were under time constraints to publish IRW, I decided to halt his post until a full response to his virulent accusations was ready for public consumption.

It was after his remarks with no statistical or mathematical evidence given that we put into question any valid knowledge Hendrickson and Fishkin might have about Statistics. Their indiscriminate reference about power and exponential laws confirms this perception. It was after their reactions that we called what SEOMOZ was doing “quack science” as can be seen from this 06/16/2010 post. As of Today, I stand by that statement. I ended that post with the following line:

“I am the one that is now demanding a public retraction from you both for putting out crap “science” at SEOMOZ.”

That’s a strong statement and I don’t regret it. Why? Because often search engine marketers and their cheerleaders put out a lot of false knowledge and opinions labeled as “science” or “study”, giving a black eye to the serious, ethical sector of their own industry.

Next day, on 06/17/2010, Dziuba’s post was featured at Sphinn by Jill Whalen wherein a hot debate evolved between SEOs and wherein our post was commented. Looking at that debate, we concluded that the best way to dispel so many myths and confusion was through education.

On 06/20/2010, it was mentioned that we planned to write two tutorials on statistics. On 06/25/2010 we announced the release of the first tutorial, A Tutorial on Standard Errors.

In that tutorial we explained the right way of computing standard errors. It was explained why it is incorrect any attempt at computing mean of correlation coefficients as arithmetic averages, treating those mean correlation coefficients as one would treat the mean of x observations, and computing with these standard deviations and standard errors. We explained that r-to-Z Fisher Transformations are needed and even sometimes pooled standard errors must be computed.

It appears Hendrickson was monitoring that release since later that Friday (there is almost a time zone difference of 4 hours between Seattle and San Juan), he published a “rebuttal” at SEOMOZ titled Statistics: a win for SEO wherein he tried to refute the main statements of the tutorial, perhaps as a damage control tactic.

Hendrickson “rebuttal” was limited to repeat his previous claims, again with no mathematical proofs. He then made some irrisory remarks about Z scores exploding at r = 1, and to quoting reference work that obviously he never read thoroughly, otherwise he could have realized one of the papers he cited was based on a previous work with an obvious theoretical error: computing arithmetic averages of correlation coefficients.

In email exchanged with a co-author of that work, the author now recognized that their approach of adding and averaging correlation coefficients was indisputably incorrect after all. You can read the full story and comments exchanged with him in our second tutorial, A Tutorial on Correlation Coefficients, published on 07/08/2010. The tutorial is also a direct response to Hendrickson and Fishkin “rebuttal”.

In the SEOMOZ “rebuttal” there were few comments made by Hendrickson (and later subscribed by Fishkin at Sphinn) that are diversion tactics and plain lies. That is common in the art of misinformation: second guess your opponent and write a rebuttal on comments that were never made. This relates to my comments on PCA.

What Hendrickson claimed I said about PCA:“Rebuttal To Claim That PCA Is Not A Linear Method”

What I actually said about PCA:
This is what I said and stand by.

About the word “linearity” in PCA: Be careful when you, Ben and Rand, talk about linearity in connection with PCA as no assumption needs to be made in PCA about the distribution of the original data. I doubt you guys know about PCA, a dimensionality reduction technique frequently used in data mining. It is also used in clustering analysis as a one way of addressing the so-called K-Means initial centroid problem, though it can fail with overlapping clusters.

PCA can be applied to linear and nonlinear data. Ben, have you ever heard about nonlinear PCA? But even if you don’t, let me say this: Nothing more nonlinear than images, clusters, noisy scattered data, etc–yet PCA can be applied to these scenarios to extract structural components hinting at possible patterns hidden by noisy dimensions.

Linearity in PCA does not refer to the original variables, but about linearity in the transformed variables used in PCA because these are transformed into linear combinations. To understand linear combinations a review on linear algebra helps. According to Gering from MIT (http://people.csail.mit.edu/gering/areaexam/gering-areaexam02.pdf )

“Principle Component Analysis (PCA) replaces the original variables of a data set with a smaller number of uncorrelated variables called the principal components….The method is linear in that the new variables are a linear combination of the original. No assumption is made about the probability distribution of the original variables.”

The linearity assumption is with the basis vectors. And even so, variants of PCA do exist to deal with nonlinearity.

Incidentally, on 06/28/2010, a poster from SEOMOZ, Daniel Deceuster, dropped by our blog wherein he was not happy with SEOMOZ claims. My 06/30/2010 comments to him are available
and are given below:

“Hi, Daniel:
Thank you for stopping by. I’m very busy these days, so sorry for not responding before.

Good luck with your tests. Please keep in mind that with a problem with many variables, testing all of them at the same time allows one to account for outliers, noisy variables and possible interactions between variables which often introduce nonlinearity.

A one-variable-at-a-time testing is prone to ignore the above. Arbitrarily removing nonlinearity can introduce spurious linearity. There are methods for multivariate testing: modified sequential simplex optimization, factorial designs, and yes, PCA.

For instance, a problem with many variables N, taken as dimensions, might look nonlinear. PCA creates a new set of variables that are linear combinations of the original variables and linearly independent of each other.

The odds are that when a tester deals with many variables he does not know a priory whether the orignal data is truly linear or nonlinear, or if there are subspaces wherein it is one way or the other. Fortunately, no assumptions need to be made in PCA about the distribution of the original data.

The goal is to represent the problem using fewer dimensions M such that M < N. This is why PCA is considered a dimensionality reduction technique. The set of M dimensions is obtained by finding the principal components (PCs). The PCs give the direction of greater variability and provide evidence for linearity in a particular direction of a reduced space. So, one can find linearity within apparently nonlinear data.

There are many problems (linear and nonlinear) in which PCA is applied. And there are many advances and new research in the PCA and SVD area to make PCA robust. After the upcoming tutorial on correlation coefficients, I might have to put out new tutorials on the topic to dispel so much misinformation running around.”

The same misinformation tactic was used by Hendrickson regarding my claims about what I said about Pearson and Spearman correlation coefficients. Here he seems to use bidirectional extrapolation like many politicians have mastered when debating. [“if A to B is true, then B to A is true.”, “if A to B is false, then B to A is false”. “Let’s the opponents to disprove.”]

We know that bidirectional extrapolation in Statistics and Science is incorrect. For instance, we know that:

If variable are independent, there is no correlation (r = 0). The reverse is not necessarily true.

Pearson’s can be applied to linear paired data. The reverse is not necessarily true. (*)

Nonlinear paired data requires of Spearman’s. The reverse is not necessarily true. (*)

*The asterisk is for this: If a linear transformation of variables is possible, both Pearson’s and Spearman’s can be used.

All my comments about Pearson’s and Spearman’s are available at the original blog post for anyone to read.

Some False Claims that Evaporate Under Examination

We have shown in A Tutorial on Correlation Coefficients, why SEOMOZ, Hendrickson, and Fishkin “science” approach and “studies” are mathematically and statistically incorrect.

First, correlation coefficients are not additive. For centered data, the geometrical equivalence of an r value is a cosine. Adding, subtracting, or averaging cosines do not give a new cosine. For non-centered data there is still a cosine term in the formula for r.

Second, a mean from x observations is an unbiased estimator while a mean correlation coefficient is a biased estimator. Treating an average r value as a mean x is the same as saying that a mean r is an unbiased estimator. Statistically speaking, this is a gross error. This is easy to prove.

Third, professional and academic statisticians know that unlike mean x values, r values have an inherent bias; i.e., their distribution is bivariate and inherently skewed. Thus, the computed mean r value is also a biased estimator. All this affects hypothesis testing. A researcher cannot compute mean values and associated standard errors arbitrarily with one formula just because he likes to pick one in particular.

I will give now an example described in the Appendix section of our tutorial.

Let r1= 0.877 and r2 = 0.577. One might think that the average over these is 0.727 or about 0.73 with an associated Coefficient of Determination of 0.73*0.73 = 0.53 or 53%. Assuming this is the same as implying that the mean is an unbiased estimator, not skewed toward r1 or r2.

However, converting r1 and r2 to Z scores and averaging, gives an average Z score of 1.010. Converting this result back to r gives a mean value of 0.766 or about 0.77 with an associated R of 0.77*0.77 = 0.59 or 59%. Note that now the mean value is skewed toward r1, confirming that the mean value is a biased estimator. Statisticians know the inherent bias is a property of r values and their mean values.

If SEOMOZ wants to get into the misinformation business and sell snakeoil, sure they are entitled to that. But that ruins any little credibility they are getting these days with their “science knowledge”. Some of their cheerleaders, groupies, or easy-to-impress visitors might elect to stick to them. That’s fine as they deserve what they are getting. So far, Statistics is a loss for them, anyway. All I can say is this: Those that listen to fools become one.

I’m not sure if Hendrickson misled Fishkin or if it was all the way around. In real companies, for less than that employees end fired. Since Fishkin cannot fire himself, his burden of proof is now double. And for the fourth time: I am still demanding a retraction.

PS. I updated this post to refine some lines.


  1. I would like to ask you something about the use of Spearman correlation as you seem to have explored the thing in depth in your post here (http://www.seomoz.org/blog/statistics-a-win-for-seo ). How could you justify that SEOmoz (http://www.seomoz.org/article/search-ranking-factors#metrics-1) uses Spearman for the correlation of Page Authority or Domain Authority since they are not definitely monotonic and why don’t you use another correlation metric? I mean lets say you get the first 10 pages for a group of 10 queries from Google search engine and their corresponding Page Authority. So in total you have 100 results and you have 100 pairs of PA and rank. If you display this in a 2 dimensional x-y graph it does not have to be monotonic. Could you help me a bit on this because I am highly interested in the topic. I am not into statistics in depth so I would like to know if my question makes even sense?

    1. Thank you for stopping by.

      I don’t justify or trust anything from SEOmoz nor I care what they are up to.

      Since a search engine might elect to expand the query and the answer set of documents retrieved in the background, through lookup references (thesauri, lookup tables, etc) or by incorporating user’s search behaviors, such “correlation studies” are at best extracted from data lacking of normality and therefore are probably useless.

      They still believe and teach their naive peers that correlation coefficients can be added together and averaged, which has been proved faulty here: http://www.tandfonline.com/doi/abs/10.1080/03610926.2011.654037

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s