Beware of SEO Statistical Studies

Beware of SEO statistical studies as their findings can easily evolve into urban legends and myths when taken at face value. Why? Keep reading.

From time to time SEOs publish statistical studies in which incorrect reasoning are given in relation to Statistical Analysis and PageRank.

For instance back in 2007, an SEO published a study trying to find correlations between the rank of a page in the search results and PageRank. We wrote two rebuttals about that study:

https://irthoughts.wordpress.com/2007/11/20/a-pagerank-rank-correlation/
https://irthoughts.wordpress.com/2008/08/29/how-not-to-use-correlation-coefficients/

Evidently, we cannot read too much into correlation coefficients, particularly without providing a t-test analysis. In fact, x-y paired variables can be almost orthogonal (perpendicular) suggesting independence and still have high correlation coefficients, which was actually the case in the mentioned “study”.

Data can also be non-linear and still have a high correlation coefficient, as many books on statistics show.  Furthermore, a low or zero correlation coefficient value not necessarily means that the data is not correlated. This is illustrated in the following figure, taken from “Statistics for Analytical Chemistry” (J.C. Miller and J. N. Miller, 1984).

nonlinear correlation coefficients

Note that coefficients (Pearson in this case) can be high or low and the data can still be nonlinear AND well correlated. These two results just show that the data does not fit to a linear model. By simply looking at a coefficient and inferring no correlation if it is too low is misleading (bottom figure). Similarly if the coefficient is high we cannot just neglect non-linearity (top figure) and assume linearity. Please, enlarge  and read text above the figure to understand why.

In addition, to find a pattern in the data, other tests like PCA or SPCA are necessary. See tutorial at

http://www.miislita.com/information-retrieval-tutorial/pca-spca-tutorial.pdf

The selection of one coefficient over the other can be with or without any a priori statistical knowledge about the data. However, while a previous knowledge about the nature of the data can help, it can also hurt any statistical reasoning as a tester can be easily biased toward finding what he/she wants to find, overlooking other type of pattern finder analyses and reasonings or, worse, arbitrarily dropping outliers to fit results to a preconceived data model.

Now that I’m on the subject of choosing one correlation coefficient statistic over the other, in a recent post another SEO (Randfish) published “The Science of Ranking Correlations” article (http://www.seomoz.org/blog/the-science-of-ranking-correlations), in which he states:

“Why did we use Spearman rather than Pearson correlation?

Pearson’s correlation is only good at measuring linear correlation, and many of the values we are looking at are not.  If something is well exponentially correlated (like link counts generally are), we don’t want to score them unfairly lower.”

Almost immediately, another poster (whiteweb_b) contradicted that assertion and wrote:

“Rand your (or Ben’s) reasoning for using Spearman correlation instead of Pearson is wrong. The difference between two correlations is not that one describes linear and the other exponential correlation, it is that they differ in the type of variables that they use.”

The poster is right on that one, especially when no t-test analysis or additional pattern/trend finder techniques were provided. So, which coefficient to use, when, and why: Pearson or Spearman?  That’s a fair question. We covered this in the following post:

https://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Something that few realize is the connection between Cosine Similarity with Pearson and Spearman Coefficients. This also was discussed in the following post:

https://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/

To sum up, beware of SEO “Science” and their statistical “studies”.

I hope this helps.

PS: I updated this post to add the figure above and few additional comments.

37 Comments

  1. Hi,

    Thank you for reviewing this important issue. I was the one that left the comment on SEOMoz that you are referring to. While not being a statistician (my field is molecular biology), I am aware of how statistics is often abused and misinterpreted to support pre-formulated conclusions.

    I kinda retreated from that discussion as I did not feel i am knowledgable enough to argue with people on these issue, however I could not ignore the feeling that there are things that are wrong with the method described in the post:

    1. The correlation coefficients are very low. As far as I understand how this works, the correlation of 0.3 means that only 30% of variation of the data can be explained by the regression and the remaining 70% cannot. In that aspect Rand’s claim that only a coefficient of 0 means randomality and anything above and below 0 implies correlation is very innacurate.

    2. Those coefficients could be low and still significant but without t-tests that should establish how significant a correlation is, they are meaningless. Furthermore, comparison of correlation coefficients for different parameters (without even touching upon the different nature of those parameters) without performing t-analysis is meaningless.

    3. The standard error that Rand provided in the post (as far as I understand) relates to different coefficients that he got for different keywords. That standard error could only explain the variation of coefficients from their mean, not the significance of correlation between different parameters and rankings.

    I would appreciate if you could comment on the above remarks and their accuracy and validity. It would really be great to know if and where I am wrong so I can improve the validity of my own analysis I am doing (and frustratingly, not publishing, exactly for the lingering suspicion that my analysis is not valid or verified in a correct manner)

  2. Thank you for stopping by. I’ll love to hear about Information Retrieval in Molecular Biology Databases. If you have conducted some work somehow related with the subject, let me know and we can feature it in our newsletter. If so, let’s discuss that through regular email. I advised a student that did a thesis on computer genetic algorithms with applications along those lines.

    I’ll limit this post to statistical facts.

    A reliable way of producing a statistical study is one based on a double-blind approach, wherein the generator of the data set and the statistician doing the analysis on the data set don’t know each other, so there is no conflict of interests or vested interests. We cannot say the same when employees or associates are doing an analysis for a boss or a friend.

    Regarding explained or unexplained variations, these are computed with coefficients of determinations (r^2), not with correlation coefficients (r). An example follows.

    Let r = 0.90, then r^2 = 0.81. This means that 81% of the variations can be explained by the regression model and 1 – r^2 = 0.19 or 19% of the variations cannot be explained by the model.

    So, if r = 0.30 then r^2 = 0.09 meaning that 9% of the variations can be explained by the regression model and 91% cannot be explained by the model, which is even worse.

    Regarding assessing the significance of correlation coefficients, this is done with a two-tailed t-test at a given confidence level and n – 2 degrees of freedom, using

    t = (| r |*(n – 2)^1/2)/(1- r^2)

    Regarding using standard deviations for assessing correlation, in general quantities based on absolute differences (like standard deviations) are poor appraisers of correlation. According to the Handbook of Applied Mathematics for Engineers and Scientists (Max Kurtz, 1991, McGraw-Hill, New York):

    Regarding supporting statistical arguments, this should be done with books on statistical analysis. I’m not sure relying on cesspool of contaminated knowledge to make a statistic argument, like marketing media screens, including Wikipedia, is a wise thing to do. Those media screens have their own dose of interest groups posing as reputed editors.

    1. Hi Ed

      I read your blog periodically. Good insight. I spent considerable time studying graduate level statistics and non-linear equations while completing degrees in engineering and physics. In this case we are dealing with graduate level statistics (or close to it) so you can get in trouble quickly. Which you are pointing out with these graphs.

      Anyway, I saw this study and read it over. I don’t have time to post a full explanation right now of what the errors are but the first point I would make is indicated in the graph you have above.

      Page rank is scaled logarithmically. Counting up links via other tools is a linear calculation (summing them up). You can’t compare this data by looking at non-linear AND linear equations. A zero correlation indicates the absence of a linear relationship between a pair. So you have to do a scatter plot first to determine the type of relationship we are dealing with. So there needs to be graph. At first glance it appears they have proven that there is a non-linear relationship between the variables they are looking at (page rank). Which we already knew.

      There are many people in search marketing that analyze statistics in many different areas of online marketing and do a good job of it. I have found many papers of interest they are just not in the spotlight or discussed often. The problem is that to read them typically requires graduate level mathematics so the audience is automatically limited. Therefore it is hard to get material reviewed correctly.

      Things are improving on the industry side but there is still a long way to go. A first step is for people to get used to the concept of peer review for material that is published. Or at least collaboration. There is a lot of money being made on the front end so the motive is inherently skewed. The subject of search marketing is also “made up” (it could be different whereas physics can not be different). The elements were created by commercial entities, search engines. So universities will have a difficult time teaching fundamentals or generic concepts. Unlike IR, machine learning, etc, where the topics are independent of any commercial product.

      Thanks for your time.

      1. @seangolliher

        Hi, sean:

        Thank you for stopping by.

        Indeed the paper and methodology used has too many flaws. Unfortunately these type of “research” gives a black eye to the SEO industry. In the upcoming issue of IR Watch Newsletter we discuss the notion of uncorrelated and correlated variables. This distinction is best understood in terms of expectation values.

  3. One more thing, Neyne:

    With regard to the problem of comparing different correlation coefficients (r values):

    For samples from the same data set or different features from the same data set:

    A mere comparison of standard errors or r values can be misleading. Use Raghunathan, Rosenthal, Rubin Method. See:

    Raghunathan, Rosenthal, and Rubin (1996, Psychological Methods, 1, 178-183).

    Meng, Rosenthal, & Rubin (1992) Comparing correlated correlation coefficients. Psychological Bulletin, 111: 172-175.

    Their method is well described at the following link

    http://core.ecu.edu/psyc/wuenschk/stathelp/ZPF.doc

    For samples from different data sets or features from different data sets:

    Again, a mere comparison of standard errors or r values can be misleading. One needs to compare both the slopes and intercepts from the regression curves. It is recommended to do this by plotting the curves and the data points on the same graph.

    It could happen that different r’s give identical slopes or that different slopes give similar r’s.

    The prescribed procedure for doing this, developed by Fisher in 1921, is well explained by Karl L. Wuensch at the following link:

    http://core.ecu.edu/psyc/wuenschk/docs30/CompareCorrCoeff.doc

    You would need to use a log transformation on the r values and compute a z-score to get p value for the null hypothesis.

    To compare slopes you would need to do a t-test analysis on the slopes.

    Last but not least, with the assertion that Spearman should be used when the relationship between X and Y is non linear as inferred from Randfish’s chosing of Spearman over Pearson, Jan Hauke and Tomasz Kossowski state and quote (http://mtriad07.amu.edu.pl/pdf/hauke_kossowski.pdf), emphasis added:

    “Spearman’s rank correlation coefficient (denoted here by rs) is a nonparametric (distribution-free) rank statistic proposed by Charles Spearman as a measure of the strength of the associations between two variables. It is a measure of a monotone association that is used when the distributionof the data makes Pearson’s correlation coefficient undesirable or misleading. Spearman’s coefficient is not a measure of the linear relationship between two variables, as some “statisticians” declare. It assesses how well an arbitrary monotonic function can describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. Unlike Pearson’s product-moment correlation coefficient, it does not require the assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales; it can be used for variables measured at the ordinal level.”

    To conclude

    On and on, it is not as simple as comparing standard errors or crude r differences. Such analyses will be invalidated right on the spot by any professional statistician or journal if submitted as a “scientific study”.

    Indeed, I’m not sure that such seo statistical “studies” will pass a litmus test on any peer review journal on statistics.

    I hope this helps.

    PS. I edited this comment to add few more lines and refine others.

  4. Thank you a lot for taking time in answering this. I have a few more questions, if you don’t mind:

    1. Would the different data that was gathered in Rand’s investigation be considered data from the same dataset or from different dataset ?

    2. What would be the correct tests to measure and compare an impact different parameters have on ranking? From what I see, Spearman’s is the correct test to do (as long as the independent variables are transformed to ranked data) as it doesn’t make assumptions about the nature of correlation.

    PS. I think that Rand said that they chose Spearman’s because the correlation between the parameters and ranking could not be assumed to be linear. He went on to assume that link count is exponentially correlated with rankings, which is an unbased assumption, but I think that the basic reasoning for using Spearman’s is correct

    Again thank you a lot for providing the explanations and the further reading. Understanding of these issues is crucial both for performing valid analysis and being able to assess when someone is publishing invalid analysis under the title of “scientific research”.

    I would be glad to provide any examples you may find interesting from my own research. I believe that i registered with an email address to leave comments here, but if not, I can be reached at theney at gmail dot com

  5. Hi, Neyne:

    To the first point, that’s a call SEOMoz needs to answer as they should know how they generated their data sets.

    A data set is defined as the data points under inspection. Let say I have a “universe” to be analyzed. If 10 data points are extracted from it, that’s a data set of 10 elements.

    For instance if I query an inverted index, the inverted index is the “universe”. Search results for the query is an answer set or data set. If the top N results are taken and inspected that would be another data set of N points. The two sets are not independent from one another and are said to be related. This is just one example of data set relatedness.

    Data set relatedness can be by features to be inspected. If a data set consists of three X, Y, and Z variables, the X-Y paired data is a data set so as the X-Z, and Y-Z paired data. These four sets are not entirely independent and are said to be related.

    In all these cases we can test the Null Hypothesis that are not related and go from there before doing other statistical analyses on these data sets. Was this done by SEOMoz? It is not evident from the report that they did this.

    With regard to “the correct tests to measure and compare an impact different parameters have on ranking”, this can be addressed using PCA or as a sequential optimization problem using Simplex Optimization -Nelder-Mead algorithm or improved modifications of this algorithm – (http://oldweb.cecm.sfu.ca/AAS/coope.pdf). Before discovering Fractals, my research work was on this algorithm, circa mid ‘80s.

    Regarding the selection of Spearman, this is implemented without making any assumption. This did not come across from Rand’s reasoning for using one coefficient over the other. Spearman is implemented without making any assumtion about presence or lack of linearity or non-linearity at all or about a pre-knowledge or lack of it about the data set. It is in this sense a “blind” approach.

    Once an r value is computed (Spearman or Pearson), coefficients of determination can shed some light about the fraction of variations that can be explained by the regression model adopted.

    The assertion that link count is exponentially correlated with rankings is just another SEO snakeoil. Isn’t links-ranking what link-farmers have claimed all along? I have many documents with little link counts, some as recent as few weeks, that are ranked in the top positions in Google for my intended keywords.

    Many marketers just die to see SEO being accepted to the level of Science. I don’t see this happening any time soon. With the above “studies” they are just giving a black eye to their industry.

    Please send what you have about your research work (abstract, technical article, etc).

    I hope this helps.

  6. E. Garcia, I find your criticisms odd. Let me try to respond to what I think your major points are. In the future, feel free to give me a heads up so I can respond quicker rather than finding your remarks months laters.

    NULL HYPOTHESIS
    You suggest we ought to have done the math to reject the null hypotheses (that the various variables are uncorrelated) before showing correlation coefficients. We did publish stderr, so you can see quite quickly that we will reject most of these null hypothesis by over 10 standard deviations of certainty. I think when it is obvious a null hypothesis will be overwhelmingly rejected, it is OK to publish stderr instead of computing out a incredibly tiny probably will be rounded to zero by an online calculator.

    NON-LINEARITY
    In your post, you criticize me for using Spearman’s over Pearson’s because my concerned that some of our measurements would be correlated in ways besides linearly. In a comment, you now acknowledge that Spearman’s unlike Pearson’s doesn’t assume linearity. Unless you a priori know all correlations would be linear, how can you be skeptical of my use of Spearman’s? Your postion seems particularly weird given at the start of your post you criticized someone else for assuming linearity when measuring PageRank correlation.

    On a related issue, you seem incredulous that there is an exponential correlation between the number of links and ranking. If you say it would actually persuade you to publish a retraction, I’ll do the computations to show they are more exponential correlated than linearily, but somehow I doubt even doing the math would persuade you. So instead, I’ll just point out that in-link counts are power law distributed (besides intuitive, a quick Google search will find many peer-reviewed papers pointing this fact out), ranking positions are not, so it hardly should be a surprise that the log of the links counts are closer to a linear correlation than the raw link counts themselves. It would be quite suprising any other way.

    NO SPCA/PCA
    Given the top of your post criticizes someone else for using Pearson’s because of linearity issues, isn’t it kinda odd to suggest another linear method?

    Anyway, I answered most of these issues more carefully and in more detail in the comments of our blog shorly after we posted the results. You can see my comments in that discussion:
    1) http://www.seomoz.org/blog/the-science-of-ranking-correlations#jtc109526
    2) http://www.seomoz.org/blog/the-science-of-ranking-correlations#jtc109555

    Cheers,
    Ben

  7. This is a full response to Ben and Rand.

    Before responding to ijustwritecode (Ben Hendrickson from SEOMoz) about the piece that he and Rand put together and called “Science”, In order to put things in context, I am reproducing an email he just sent me yesterday.

    Hi Dr. E. Garcia,

    This morning I stumbled across your blog and noticed it had a couple
    of points criticizing some statistics I’ve done at SEOmoz. I am
    hardly perfect with statistics – I mostly just write code – but I am
    certainly interested in learning what I did wrong so as to improve for
    the next time (or correct this current batch of statistics).
    Furthermore, I don’t think your criticism was entirely accurate, and I
    left comments to that effect on your posts.

    I would be interested in talking about this more directly if you are.
    When I have stumbled across your blog in the past, it impressed me as
    a good resources. There is certainly a lot of bullshit in the SEO
    industry, as you have pointed out, but I certainly hoped by me
    publishing fairly precise statistics it would help push the industry
    more in the direction of rigor and not less. So, of course, I was
    quite disappointed when I found your criticisms of my statistics this
    morning.

    Cheers,
    Ben

    He sent this email after two previous post he dropped at my blog. The wording sounds a bit more polite now. Doesn’t?. I did not quickly activate and respond to his previous comments due to time constraints. I’m running again against deadlines with IRW Newsletter and other projects. BTW this month issue is about Uncorrelated and Correlated variables and the July issue will probably be about PCA or Standard Error Analysis. I also did not respond immediately since a full response was granted.

    I am hardly perfect with statistics – I mostly just write code – but I am
    certainly interested in learning what I did wrong so as to improve for
    the next time (or correct this current batch of statistics).

    From this admission by Ben, I conclude he is not a statistician, but someone that just writes code (a programmer?). Ben’s comments about exponential laws, power laws, and PCA confirms an incorrect knowledge about statistics (More on this later).

    Simply put, Ben and Rand rushed to put out a piece about poorly correlated data, from which comparisons were made and opinions were drawn and then unfortunately labeled that as “Science”. Isn’t this the signature of quackery? How is this any different from selling snakeoil? This is precisely what gives a black eye and bad reputation to the SEO industry, from which a tiny sector is certainly doing a fine job.

    In the future, feel free to give me a heads up so I can respond quicker rather than finding your remarks months laters.

    It is odd that you knew about the post just now. Long ago Rand tweeted that he read and new about it. Are you both having communication problems? Anyway if you want to know about what it is posted here, drop by often.

    NULL HYPOTHESIS
    You suggest we ought to have done the math to reject the null hypotheses (that the various variables are uncorrelated) before showing correlation coefficients. We did publish stderr, so you can see quite quickly that we will reject most of these null hypothesis by over 10 standard deviations of certainty. I think when it is obvious a null hypothesis will be overwhelmingly rejected, it is OK to publish stderr instead of computing out a incredibly tiny probably will be rounded to zero by an online calculator.

    Please don’t second guess me. I suggested nothing about any null hypothesis.

    One more thing.

    Evidently, you don’t know how to calculate the standard error of a correlation coefficient, The procedure is described at the following links in very simple terms:

    Click to access BioStatistics%20Topic%2023.pdf

    Click to access correlat.pdf

    The first link also shows that correlation coefficients are compared with pooled standard errors of these.

    The general formula for a t-test is: t = (parameter)/(standard error of parameter). For a correlation coefficient the standard error (SE) is computed as follows.

    We first compute the coefficient of determination r^2. This gives the fraction of variation that can be explained by the model. Next we compute the amount of variation that cannot be explained by the model as 1 – r^2. This is then normalized by the degree of freedom used. For a two-tail this is n – 2. Finally we square root the result. The standard error of a correlation coefficient is then computed as

    SE = sqrt[(1 – r^2)/(n – 2)]

    Once we compute SE what we do with it? We calculate a t-value as

    t = r/SE

    This is done to test whether the correlation coefficient is significant. If the correlation coefficient are not significant any comparison between these is useless.

    t-calculate is then compared against a t-table at a given confidence level. The null hypothesis in this case is that there is no correlation between x and y. If t-calculated is greater than t-table, the null hypothesis is rejected and concluded that a significant correlation does exists.

    Note that the standard error of the mean and the standard error of a correlation coefficient are two different things. Moreover, the standard deviation of the mean is not used to calculate the standard error of a correlation coefficient or to compare correlation coefficients or their statistical significance.

    In the original article, you and Rand talk about standard errors in convoluted ways. Rand placed a Wikipedia link to Standard Error of the mean and then you and few posters talk about standard errors of a correlation coefficient. And then go on to combine these concepts. For instance you wrote in the original report:

    “Consider the correlation of PR to Google.com. The correlation coefficient is greater than 0.18, standard error is less than 0.0056, so the null hypothesis being right would be an event of more than 32.143(=.18/0.0056) standard deviations.”

    You divided a correlation coefficient by a standard error and called that ratio standard deviations. You cannot do that and it invalidates your treatments and reasonings. First r/SE where SE is the standard error of the correlation coefficient is a t-value, not a standard deviation. If you used SE as the standard error of the mean, that’s even worse.

    One more thing and this is to Rand. Rand, r = 0 does not imply variables are random. It only means they are not linearly correlated; though they might be nonlinearly correlated and not necessarily be random. Stop misleading the public.

    NON-LINEARITY
    In your post, you criticize me for using Spearman’s over Pearson’s because my concerned that some of our measurements would be correlated in ways besides linearly. In a comment, you now acknowledge that Spearman’s unlike Pearson’s doesn’t assume linearity. Unless you a priori know all correlations would be linear, how can you be skeptical of my use of Spearman’s? Your postion seems particularly weird given at the start of your post you criticized someone else for assuming linearity when measuring PageRank correlation.

    Given the misevaluation of the significance of a correlation coefficient with standard errors above, “nothing more with the witness, your honor.”

    On a related issue, you seem incredulous that there is an exponential correlation between the number of links and ranking. If you say it would actually persuade you to publish a retraction, I’ll do the computations to show they are more exponential correlated than linearily, but somehow I doubt even doing the math would persuade you.

    So instead, I’ll just point out that in-link counts are power law distributed (besides intuitive, a quick Google search will find many peer-reviewed papers pointing this fact out), ranking positions are not, so it hardly should be a surprise that the log of the links counts are closer to a linear correlation than the raw link counts themselves. It would be quite suprising any other way.

    You first make reference to an exponential law and then to a power law. First of all, these are two different animals. Even so, both laws (exponential and power laws) can be converted into a linear form. Once in a linear form, linear regression can be applied to the data and statistics computed (correlation coefficients, regression coefficients, etc). So, only because a paired data follows either one does not excludes a linear regression treatment at all.

    I am the one that is demanding a public retraction from you, Rand, and SEOMoz for putting out nonsense.

    Some comments on in-links

    It is well known from many years ago that in-links follows a power law. You don’t need to rush to publish sloppy results to know that. 15 years ago, I wrote a thesis on applied Fractals. I also run a subsite on the topic, so I know something about power laws. I also know that a power law can be converted into a linear law by using double-log plots. Once in a double-log plot linear regression analysis can be applied to it and tests for correlation can be performed. If there is a valid power law, the power exponent can be computed from the slope of the regression equation.

    I also remember that back in May of 2000 I read Andrew Tomkins’s so-called Bow Tie theory paper on in-links and power laws. In 2006, I meet Tomkins at the Institute of Pure and Applied Mathematics (UCLA, IPAM) and asked him about that paper when covering the event for SEW. Here is the link on that

    http://forums.searchenginewatch.com/archive/index.php/t-9711.html

    Rand is on record that this was when he learned about the Bow Tie theory for the first time –precisely from that post. Back then he wrote: “Dr. Garcia – first off, thanks for the great coverage. This is something we’d never be exposed to without you. Second – can you elaborate on Andrew Tomkins’ work:”

    For your perusal, a comprehensive report on the IPAM workshop is provided here:

    Click to access ipam-document-space-workshop.pdf

    NO SPCA/PCA
    Given the top of your post criticizes someone else for using Pearson’s because of linearity issues, isn’t it kinda odd to suggest another linear method?

    PCA
    About the word “linearity” in PCA: Be careful when you, Ben and Rand, talk about linearity in connection with PCA as no assumption needs to be made in PCA about the distribution of the original data. I doubt you guys know about PCA, a dimensionality reduction technique frequently used in data mining. It is also used in clustering analysis as a one way of addressing the so-called K-Means initial centroid problem, though it can fail with overlapping clusters.

    PCA can be applied to linear and nonlinear data. Ben, have you ever heard about nonlinear PCA? But even if you don’t, let me say this: Nothing more nonlinear than images, clusters, noisy scattered data, etc–yet PCA can be applied to these scenarios to extract structural components hinting at possible patterns hidden by noisy dimensions.

    Linearity in PCA does not refer to the original variables, but about linearity in the transformed variables used in PCA because these are transformed into linear combinations. To understand linear combinations a review on linear algebra helps. According to Gering from MIT (http://people.csail.mit.edu/gering/areaexam/gering-areaexam02.pdf )

    “Principle Component Analysis (PCA) replaces the original variables of a data set with a smaller number of uncorrelated variables called the principal components….The method is linear in that the new variables are a linear combination of the original. No assumption is made about the probability distribution of the original variables.”

    The linearity assumption is with the basis vectors. And even so, variants of PCA do exist to deal with nonlinearity.

    I am the one that is now demanding a public retraction from you both for putting out crap “science” at SEOMoz.

  8. I am now putting together two tutorials:

    1. A Tutorial on Standard Errors
    2. A Tutorial on Correlation Coefficients

    These will be out probably this week. After reading these tutorials, SEOs will understand why SEOMoz “statistical treatment” has no basis on any theoretical or experimental grounds.

    Will they ever publish a public retraction for putting out, promoting, and trying to save face on their quack “science”? Time will tell.

  9. Hey, I agree with you. The folks at SEOmoz have done a lot to get science and stats on the radar in the SEO industry, but I feel they are going about it all wrong. They aren’t conducting true experiments if you ask me. I would like to get your opinion on something. I just started a website at realseoscience.com. My goal is to do very simple experiments. Make a bunch of web pages. Observe their rankings. Make one identical change (H1 tag, alt tag, etc.) to all pages. See where they rank afterwards. Run a paired samples t-test to see if the change is at all statistically significant. I will do this for several rankings factors.

    Does this sound pretty solid to you? I ran the stats research lab in college, but I’m doubting I’m at the level you are at with this kind of thing, would just like to get some feedback from someone with more experience in the industry. Thanks!

  10. Hi, Daniel:

    Thank you for stopping by. I’m very busy these days, so sorry for not responding before.

    Good luck with your tests. Please keep in mind that with a problem with many variables, testing all of them at the same time allows one to account for outliers, noisy variables and possible interactions between variables which often introduce nonlinearity.

    A one-variable-at-a-time testing is prone to ignore the above. Arbitrarily removing nonlinearity can introduce spurious linearity. There are methods for multivariate testing: modified sequential simplex optimization, factorial designs, and yes, PCA.

    For instance, a problem with many variables N, taken as dimensions, might look nonlinear. PCA creates a new set of variables that are linear combinations of the original variables and linearly independent of each other.

    The odds are that when a tester deals with many variables he does not know a priory whether the orignal data is truly linear or nonlinear, or if there are subspaces wherein it is one way or the other. Fortunately, no assumptions need to be made in PCA about the distribution of the original data.

    The goal is to represent the problem using fewer dimensions M such that M < N. This is why PCA is considered a dimensionality reduction technique. The set of M dimensions is obtained by finding the principal components (PCs). The PCs give the direction of greater variability and provide evidence for linearity in a particular direction of a reduced space. So, one can find linearity within apparently nonlinear data.

    There are many problems (linear and nonlinear) in which PCA is applied. And there are many advances and new research in the PCA and SVD area to make PCA robust. After the upcoming tutorial on correlation coefficients, I might have to put out new tutorials on the topic to dispel so much misinformation running around.

  11. Pingback: | SeanGolliher.com
  12. Another thing that amounts to SEOMOZ Quack “Science” is that they fail to understand that a combination of very large sample sizes with very small correlation coefficients leads to a meaningless statistical sifgnificance.

    It is well known that for a sufficient large sample size, such combination will always leads to a t-observed value higher than a t-table value regardless of how small is the corresponding r value. In such circumstances statistical significance of such a small r value is meaninless or useless. See https://irthoughts.wordpress.com/2010/10/21/on-power-analysis-and-seo-quack-science/

  13. At http://www.seangolliher.com/2010/uncategorized/185/, Mr. Hendrickson echoed outdated research to claim that correlation coefficients can be added and averaged. He provides no evidence, but just this comment:

    “If one checks page 82 of “Meta-Analysis of Correlations: correcting error and bias in research findings” it asks the question “is the weighted average always better than the simple average?” In then cites an example where the unweighted analysis does better, ….”

    Since that blog post was closed, I’m responding to this claim here.

    Essentially this is outdated AND incorrect research.

    Arithmetically adding correlation coefficients runs against all math concepts and produce mean values with a poor discriminatory power. The proof can be found here:

    Click to access on-the-non-additivity-correlation-coefficients.pdf

    Using an arithmetic average of correlation coefficients leads to fallacious models.

    I encourage marketers to do real scientific marketing research before taking SEO comments at face value.

    Your’re welcome.

    PS. It seems SEOMOZ and Fishkin keeps computing small average correlations and making a big deal out of these.
    http://www.seomoz.org/blog/google-places-seo-lessons-learned-from-rank-correlation-data

    Fortunately, even some readers have found critical flaws in their tests. Instead of nibbling at small average correlations and “statistical significance” they should look at power analysis and the practical significance of effect sizes.

    They should read Cohen’s Primer on Power Analysis:

    Click to access CohenJ1992a.pdf

    They should also read Carson’s paper:
    Carson, C. (2006). “The effective use of effect size indices in institutional research”.

    Click to access effect_size.pdf

    And quote:

    “Researchers have criticized the use of null hypothesis statistical significance tests since the early 1900’s (Huberty, 2002). Psychologist, scientist, and philosopher Paul Meehl passionately described a statistically significant test result as: “a potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring” (Meehl, 1967, p.265). By its nature, a statistically significant test lays a seductive trap that a researcher can easily fall into because it “does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!” (Cohen, 1994,
    p. 997)”

Leave a reply to egarcia Cancel reply