• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: November 2010

Understanding Correlation

29 Monday Nov 2010

Posted by egarcia in Data Mining, SEO Myths, Statistics and Mathematics

≈ 2 Comments

I came across Professor R.J. Rummel page on Understanding Correlation. This is an old, but still relevant book-like Web page on how to interpret properly correlation coefficients.

In Chapter 4 he discusses on the proper way of looking at correlation coefficient values. He writes and quote (emphasis added in boldfaces):

“As a matter of routine it is the squared correlations that should be interpreted. This is because the correlation coefficient is misleading in suggesting the existence of more covariation than exists, and this problem gets worse as the correlation approaches zero. Consider the following correlations and their squares.”

“Note that as the correlation r decrease by tenths, the r2 decreases by much more. A correlation of .50 only shows that 25 percent variance is in common; a correlation of .20 shows 4 percent in common; and a correlation of .10 shows 1 percent in common (or 99 percent not in common). Thus, squaring should be a healthy corrective to the tendency to consider low correlations, such as .20 and .30, as indicating a meaningful or practical covariation. “

Rummel’s page is very relevant these days where SEOs from SEOMOZ and few other snakeoil marketing sites are buying the bogus discourse from Fishkin and Hendrickson that low correlation coefficients in about that range are evidence of LDA scores and Google ranks being “highly” correlated.

As mentioned before at this blog, SEO marketers are good at selling that kind of snakeoil or “quack” science.

Statistical significance does not equate to high correlation. For large enough sample sizes even very low r values (0.1, 0.01, etc) eventually become significant, but these do not equate to high correlation.

On a side note, I’m reading an IR thesis wherein Spearman’s and Kendall’s coefficients are used. Quite interesting.

First PS

According to a Sloan Consulting article at ISIXSIGMA.COM site and quote (emphasis added)

“As a rule of thumb a strong correlation or relationship has an r-value range of between 0.85 to 1, or -0.85 to -1. In a moderate correlation, the r-value ranges from 0.75 to 0.85 or, -0.75 to -0.85. In a weak correlation, one that is not a very helpful predictor, r ranges from 0.60 to 0.74 or -0.60 to 0.74. Though an entirely random relationship equals, 0.00, any relationship that has a correlation r-value that is 0.59 and below is not considered to be a reliable predictor.”

According to this Intel Teach Program a correlation between 0 and 0.19 is a very weak one while one between 0.2 and 0.39 weak enough.

True that there are many correlation charts out there and some do not agree in specific degrees or ranges, but they all tend to agree in one thing: that a correlation value below 0.20 is a very, very weak correlation, never deemed as evidence of variables being “highly correlated” as claimed by SEOMOZ in their LDA fiasco posts.

Second PS

Here is a list of reference links wherein these marketers make correlation claims based on quite weak correlation values and in the process keep misleading naive peers and the public. “Highly correlated”? “Remarkably well correlated?” Evidently Statistics is a Loss for SEOs.

http://www.seomoz.org/blog/lda-correlation-017-not-032

http://www.seomoz.org/blog/lda-and-googles-rankings-well-correlated

http://www.seomoz.org/blog/google-vs-bing-correlation-analysis-of-ranking-elements

IRW:11-2010: Tables of Correlation Features

26 Friday Nov 2010

Posted by egarcia in Newsletters, Statistics and Mathematics

≈ Leave a Comment

tables of correlation features

The current issue of IRW is out and should reach subscribers during the day. 

In this issue we delve into our Tables of Correlation Features.  

The QA section addresses the question on how university administrators can allocate faculty to programs using an interesting formula.

The Who’s Who in CS is dedicated to one of my heroes: Wesley A. Clark.

Enjoy it and happy Holiday Season.

Expanding on Tables of Correlation Features

25 Thursday Nov 2010

Posted by egarcia in Data Mining, Statistics and Mathematics

≈ Leave a Comment

I’ve updated and expanded the Tables of Correlation Features article to include:

1. How-to instructions for reproducing the tables.
2. Additional statistics theory.
3. Working examples on how to use the tables.
4. An example on how results compare with G*Power software.

Enjoy it and Happy Thank-Day.

Tables of Correlation Features

15 Monday Nov 2010

Posted by egarcia in Data Mining, Statistics and Mathematics

≈ 1 Comment

I have constructed several Tables of Correlation Features . I found these quite useful for quickly determining statistical significance of r values and for discriminating between several correlation features. They are great for data mining and for use in other disciplines or fields.

These are presented for the convenience of analysts and for use with statistical and practical significance tests. Readers requiring additional theory or statistical data corresponding to confidence levels and/or degrees of freedom not covered in the tables are referred to the literature.

In the future I will provide some practical applications of these. I hope you like the tables.

On Statistical Significance and SEO Statistical “Studies”

08 Monday Nov 2010

Posted by egarcia in Quack Science, SEO Myths, Statistics and Mathematics

≈ 2 Comments

Back in 2008, Jan M. Hoem wrote an interesting reflexion paper titled “The reporting of statistical significance in scientific journals” (VOLUME 18, ARTICLE 15, PAGES 437-442; 03 JUNE 2008 http://www.demographic-research.org/volumes/vol18/15/18-15.pdf . The piece was an expanded version of a previous paper (http://www.demogr.mpg.de/papers/working/wp-2007-037.pdf )

He wrote (and I quote):

“Scientific journals in most empirical disciplines have regulations about how authors should report the precision of their estimates of model parameters and other model elements. Some journals that overlap fully or partly with the field of demography demand as a strict prerequisite for publication that a p-value, a confidence interval, or a standard deviation accompany any parameter estimate. I feel that this rule is sometimes applied in an overly mechanical manner. Standard deviations and p-values produced routinely by general-purpose software are taken at face value and included without questioning, and features that have too high a p-value or too large a standard deviation are too easily disregarded as being without interest because they appear not to be statistically significant. In my opinion authors should be discouraged from adhering to this practice, and flexibility rather than rigidity should be encouraged in the reporting of statistical significance. One should also encourage thoughtful rather than mechanical use of p-values, standard deviations, confidence intervals, and the like. Here is why:”

Hoem then dissects five points related with misusing statistical significance results and automatic software solutions. I’m listing these below.

  1. The scientific importance of an empirical finding depends much more on its contribution to the development or falsification of a substantive theory than on the values of indicators of statistical significance.
  2. Measures of statistical significance may be misleading. When a model has been developed through repeated use of tests of significance to include and exclude covariates, to split or combine levels on categorical covariates, and to determine other model features, the user often loses control over statistical-significance values, and the values computed by standard software may be completely misleading.
  3. Standard p-values can be insufficiently precise indicators of statistical significance, particularly if their values are given only in grouped levels, which are often indicated by asterisks beside parameter estimates (“* = p<0.1, ** = p<0.05, *** = p<0.01”, and so on).
  4. It may be more important for an understanding of demographic behavior or other phenomena studied to know whether the inclusion of a categorical covariate in its entirety contributes significantly to an improvement of the model than to know the significance indicators of each of its levels.
  5. Standard deviations, when used, should be reported for interesting contrasts, not for features selected automatically by statistical software.

I completely agree with Hoem.

SEOMOZ and their statistical “studies”

These days search engine optimization marketers (SEOs/SEMs) keep misinterpreting statistical results spitted from software without stopping and thinking about the significance-behind-the-significance, especially when it comes to a correlation coefficient, r (Pearson, Spearman, etc).

When one reads SEO hearsays and urban legends at SEOMOZ about very small correlation coefficients (0.17, 0.32, etc) derived from large sample sizes, as evidence that variables are “highly correlated” or “well correlated”, it is time to stop and put into question such “studies”. For reference see the following links

http://www.seomoz.org/blog/lda-and-googles-rankings-well-correlated
http://www.seomoz.org/blog/lda-correlation-017-not-032
http://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/  

Fortunately, leading search marketers like Danny Sullivan has put into question those “studies” at a recent search engine conference

http://outspokenmedia.com/internet-marketing-conferences/evening-forum-with-danny-sullivan/ ,

and that was even before SEOMOZ admitted the 0.32 result has to be recanted as 0.17.

Sean Golliher, founder and publisher of the Search Engine Marketing Journal (SEMJ.org) has also put into question their results (http://www.seangolliher.com/2010/uncategorized/185/ ), which Hendrickson from SEOMOZ still insists in defending.

Since then they have never disclosed the source of the mistake, dismissing it just as a programming error. Unfortunately, they are still claiming that a 0.15 – 0.30 range validates their “studies” (http://www.seo.co.uk/seo-news/seo-tools/the-seomoz-lda-tool-%E2%80%93-our-disappointing-findings.html) .

Small and Large Sample Sizes

When William Sealy Gosset (aka “Student”) proposed, and Ronald A. Fisher expanded on, the test later termed Student’s t-test of significance, the test was meant to be used to assess information from small samples, not from large samples. I have discussed the case of small sample sizes in another post (http://irthoughts.wordpress.com/2010/10/18/on-correlation-coefficients-and-sample-size/ ).

In order to apply a t-test (and other small sample analysis tests) to large samples, divide-and-conquer techniques, like stratification, were eventually developed. In the case of correlation and regression, the reason for doing this is that applying something like a t-test to, for instance, a single correlation coefficient coming from a huge sample can produce misleading results. Let see why.

For large enough sample sizes eventually any correlation coefficient, even the smaller ones, will always be significant (t-observed > t-table). At that point it might be tempting to assume that the variables in question are highly correlated. Wrong assumption!

The fact is that statistical significance does not necessarily equate to variables being highly correlated and vice versa. Let address this point in two parts: (1) the question of statistical significance and (2) the question of high correlation.

Statistical Significance: Bigger not always is better

As noted in a Wikipedia entry, “given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. (http://en.wikipedia.org/wiki/Effect_size ). I have discussed effect size and power analysis in a previous post (http://irthoughts.wordpress.com/2010/10/21/on-power-analysis-and-seo-quack-science/  ).

The reason for the above effect has a lot to do with the definition of statistical significance itself. Statistical significance is the confidence one has in a given result and that such a result is not by random chance.

In mathematical terms, the confidence that a result is not by random chance is given by the following formula by Sackett (http://en.wikipedia.org/wiki/Statistical_significance , http://www.cmaj.ca/cgi/content/full/165/9/1226 ):

Confidence = (Signal/Noise)*Sqrt[Sample Size]

This simple expression or derivatives of it appears in many different scenarios and disciplines. It describes a generic Confidence Function, F, in terms of a Signal, a Noise, and a Sample Size; that is, F(Signal, Noise, Sample Size). In general, such a generic function tells us that:

  • Confidence is proportional to a Signal source (S).
  • Confidence is inversely proportional to a Noise source (N).
  • Confidence is proportional to a Signal-to-Noise ratio (S/N).
  • Confidence is proportional to a Sample Size.

Let’s apply a version of this expression to correlation. To do this, let Y be the dependent variable and let X be the independent variable. Let also make the following substitutions:

  • Confidence: expressed as t2
  • Signal: expressed as r2; i.e., fraction of explained variations in Y (due to X).
  • Noise: expressed as 1 – r2; i.e., fraction of unexplained variations in Y.
  • Sample Size: expressed as degrees of freedom; i.e., n – 2 for a two-tailed test.

F(Signal, Noise, Sample Size) = t-observed2 = [r2/(1 – r2] [n – 2]

Taking the square root (Sqrt) at both sides, we obtain the so-called formula for a two-tailed t-test.

t-observed = r*Sqrt[(n – 2)/(1 – r2)]

Evidently for a given r value, t-observed increases when n increases. By rearranging this expression, it is possible to compute for a large enough sample size a critical value above which r values will be significant. For very large samples at a 95% confidence level, t-table= 1.96 (http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values ). Replacing arbitrarily this value in the above expression (t-observed = t-table = t = 1.96) and solving for r, we obtain that the critical r value is given by

r = t/Sqrt[(n – 2) + t2]

The following table lists values for very small r values and huge sample sizes. I’m intentionally using several decimal places and ignoring significant figure rules since I want to make a point on the small values used. I’m also using a 0.95 confidence level for illustration purposes, but for the large samples I could and should use other confidence levels as well.

n n – 2 t r S = r*r N = 1 – r*r S/N
1,000 998 1.96 0.0619 0.003835 0.996165 0.003849
10,000 9998 1.96 0.0196 0.000384 0.999616 0.000384
100,000 99998 1.96 0.0062 0.000038 0.999962 0.000038
1,000,000 999998 1.96 0.0020 0.000004 0.999996 0.000004

For a sample size of 10,000 observations the critical r is 0.0196 or about 0.02, meaning that for such a huge sample size any r value above this small and critical r value will be significant. However, something interesting is observed from this table: (PS See footnote update)

When one moves to large sample sizes the Noise becomes greater than the Signal. For instance, at n = 10,000 the amount of Signal is very small (0.000384) while the amount of Noise is above 0.9996… or 99.96…%, giving a quite trivial S/N ratio. A similar reasoning can be applied for r = 0.17 (S = 0.0289, N = 0.9711) and r = 0.32 (S = 0.1024, N = 0.8976). The corresponding S/N ratios are trivial.

One can also solve the above expression for n to find sample sizes for some small r values and arbitrary t as shown in the following table (PS See footnote update).

t r S = r*r N = 1 – r*r S/N n n – 2
1.96 0.0200 0.000400 0.999600 0.000400 9602 9600
1.96 0.1500 0.022500 0.977500 0.023018 169 167
1.96 0.1700 0.028900 0.971100 0.029760 131 129
1.96 0.3000 0.090000 0.910000 0.098901 41 39
1.96 0.3200 0.102400 0.897600 0.114082 36 34

Still, note that the amount of Noise completely overcomes the Signal, producing trivial S/N ratios. In general for small r and large n values significance is achieved at the cost of Noise masking the Signal. When this occurs the statistical significance is not a practical guideline for drawing useful conclusions from the data at hand.

This drives the present discussion to the substantive part of the problem missed by SEOs, and that is …

Statistical Significance Does Not Necessarily Mean Highly Correlated Results

Simply stated, statistical significance does not necessarily imply that the X, Y variables are highly correlated.

A simple scatterplot will convince anyone that for the above small r values there will be no pattern or trend in the data set. The corresponding regression model will be useless for forecasting or inferring anything of value, except that the data spreads so wildly that it has no method to its chaos. What else is to be expected from a data set with a large Noise and small S/N ratio?

This is something that SEOs/SEMs still don’t seem to understand: t-observed > t-table not necessarily means high correlation, and vice versa. I don’t have any personal stake (or take) against them, but when folks like Hendrickson, Fishkin, and others from SEOMOZ ignore Signal-to-Noise ratios and start referring to small r values as evidence that experimental variables are “highly” or “well” correlated, it is more than fair to call such “studies” Quack “Science”. That label might sound harsh, but in this case is appropriate.

Search engine marketers might be good at selling snakeoil, publishing sloppy “studies”, or recanting on overhyped statements, but not at doing real Science. They should know better; i.e., that

  • “significance” does not mean “correlation”.
  • “significance” does not mean “important”.
  • “insignificance” does not mean “unimportant”.

Statistical “significance” only means that any confidence in the data is not by random chance. Therefore, a significant correlation does not necessarily mean a “high”, “well”, or “strong” correlation between variables.

To understand all this we need to distinguish between statistical significance and practical significance.

Statistical Significance vs. Practical Significance

As stated at this Wikipedia entry (emphasis added in boldfaces) http://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Criticism ):

A common misconception is that a statistically significant result is always of practical significance, or demonstrates a large effect in the population. Unfortunately, this problem is commonly encountered in scientific writing. Given a sufficiently large sample, extremely small and non-notable differences can be found to be statistically significant, and statistical significance says nothing about the practical significance of a difference.

Use of the statistical significance test has been called seriously flawed and unscientific by authors Deirdre McCloskey and Stephen Ziliak. They point out that “insignificance” does not mean unimportant, and propose that the scientific community should abandon usage of the test altogether, as it can cause false hypotheses to be accepted and true hypotheses to be rejected.

Some statisticians have commented that pure “significance testing” has what is actually a rather strange goal of detecting the existence of a “real” difference between two populations. In practice a difference can almost always be found given a large enough sample. The typically more relevant goal of science is a determination of causal effect size. The amount and nature of the difference, in other words, is what should be studied. Many researchers also feel that hypothesis testing is something of a misnomer. In practice a single statistical test in a single study never “proves” anything.

That pretty much settles the question of discerning between statistical significance and practical significance of correlation coefficients, but does not tell us how to quantitatively discern between the two concepts. In an upcoming article, I will derive expressions that might help to quantitatively assess these.

Since the tutorial on correlation coefficients http://www.miislita.com/information-retrieval-tutorial/a-tutorial-on-correlation-coefficients.pdf has been updated several times and is getting too long, I will put that upcoming material on a separate pdf file. As a sneak preview, we will be examining extreme cases (too high/low r values, too high/low sample sizes, and too high/low signal-to-noise ratios, etc.).

PS. I updated this post to fix some little typos.

Footnote. I found erroneous including the entries for n = 10 and n = 100 in the first table so I removed these altogether and limited the discussion to the large n values. A reader asked why I used t-table = 1.96 for all entries.  I thought it was clear from the discussion that the above tables are meant to show calculations for arbitrarily set t-values.  In a real test, you would need to use the actual t values from statistical tables. For instance, for n = 10 you would have to use a t-table value of  t = 2.306 at the 0.95 level. You should get

n n-2 t r S = r*r N = 1 – r*r S/N
10 8 2.306 0.6319 0.399293 0.600707 0.664705
November 2010
M T W T F S S
« Oct   Dec »
1234567
891011121314
15161718192021
22232425262728
2930  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.