• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: IR Tutorials

The Scope Hypothesis in IR: Who is Right?

13 Saturday Aug 2011

Posted by egarcia in Data Mining, IR Tools, IR Tutorials, Queries

≈ 8 Comments

In previous posts, we have presented two tutorials on Okapi BM25 and BM25F, which are based on the Verbosity and Scope Hypotheses.

However…

Here I would like to reference research at both sides of the Scope Hypothesis.

In the abstract of ”Revisiting the relationship between document length and relevance” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.3786&rep=rep1&type=pdf), Losada, D.E., Azzopardi, L. and Baillie, M. (2008) state:

“The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.”

Really….?

However in the abstract of “Enhancing ad-hoc relevance weighting using probability density estimation” (http://www.sigir2011.org/papershow.asp?PID=104), Zhou, Huang, and He (2011) state:

“Classical probabilistic information retrieval (IR) models, e.g. BM25, deal with document length based on a trade-off between the Verbosity hypothesis, which assumes the independence of a document’s relevance of its length, and the Scope hypothesis, which assumes the opposite. Despite the effectiveness of the classical probabilistic models, the potential relationship between document length and relevance is not fully explored to improve retrieval performance. In this paper, we conduct an in-depth study of this relationship based on the Scope hypothesis that document length does have its impact on relevance. We study a list of probability density functions and examine which of the density functions fits the best to the actual distribution of the document length. Based on the studied probability density functions, we propose a length-based BM25 relevance weighting model, called BM25L, which incorporates document length as a substantial weighting factor. Extensive experiments conducted on standard TREC collections show that our proposed BM25L markedly outperforms the original BM25 model, even if the latter is optimized.”

My take…

I haven’t reviewed BM25L vs. BM25F, yet. Still the question on the Scope Hypothesis is intriguing. For what I can tell (and this is my sole opinion), if an author writes more about a topic or several topics in a given document, more likely he will be using more instances of index terms. A cluster of the top index term density values (IDs) spreaded over said document should give some insight about its scope. We have developed a tool that computes these clusters. We are testing now whether that would translate into an improved relevance.

Assuming that Web IR systems out there (e.g,, search engines) use these algorithms or derivatives of these: What would be the implications for content writers trying to understand algos based on the Verbosity and Scope Hypotheses? Hello, copywriters, SEOs, etc. This puppy is nice to watch.

BM25 and BM25F: Implications to SEO and Web Design

04 Thursday Aug 2011

Posted by egarcia in IR Tutorials

≈ 1 Comment

Yesterday we published two great tutorials on the BM25 and BM25F algorithms.

The “take away home” from the theory behind these algorithms:

1. A term (e.g., a keyword) has more information gain when it occurs for the very first time.

2. More likely, a term weights more in a title field than in other fields.

3. The weight of a term and its ocurrence frequency are not linearly related.

4. A linear combination of field scores that destroys term dependencies is contraindicated (See BM25F).

Most SEOs know well about 1 and 2.

As a term has more information gain during its first occurrences, a document about specific terms should mention these at the beginning, particularly in the title tag. For testing purposes and since end user assume that a large headline is the actual title of a document (which is not)  we like to repeat the title tag content in an h1 header that is placed prominently at the beginning of the copy. Keywords from the title are then repeated early in the document body. In this way, one can write for both end users and search engines. If a search engine uses some form of the above algorithms (which we don’t know if they do), that base is covered, too. You don’t have to adopt this strategy, unless you want. It is just our way of conducting tests, but is a flexible approach.

New Tutorials: Okapi BM25F and BM25

03 Wednesday Aug 2011

Posted by egarcia in IR Tutorials, Queries

≈ 5 Comments

We have a new tutorial on Okapi Simple BM25 with Extension to Multiple Fields.

http://www.miislita.com/information-retrieval-tutorial/okapi-simple-bm25f-tutorial.pdf

Unlike the BM25, this model (known as Simple BM25F) incorporates the structure of documents into the scoring process.

 

In addition, we’ve uploaded a new, improved, and expanded version of the Okapi Best Match 25 tutorial.

http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf

 

Have a great IR day!

A Tutorial on Okapi BM25

01 Friday Jul 2011

Posted by egarcia in IR Tutorials, Queries

≈ 4 Comments

We have uploaded a new tutorial: Okapi BM25. See http://www.miislita.com/information-retrieval-tutorial/okapi-bm25-tutorial.pdf

This is a tutorial on the classic Okapi Best Match 25.

Enjoy it.

 

On the Non-Additivity of Correlation Coefficients

07 Friday Jan 2011

Posted by egarcia in Data Mining, IR Tutorials, Statistics and Mathematics

≈ Leave a Comment

Regardless of your research field, soon or later you need to generate average statistics, for instance a weighted correlation coefficient between any two variables, x and y.

Computing weighted averages of correlation coefficients depends on the weighting strategy used: unit weights, sample size, optimal weights, and within/between study variances, etc. Most text books advocate the use of Fisher’s Z Transformation for, for instance compute confidence intervals and average correlations.

One thing that has been bothering me for a long time now is this: what would be the discriminatory power of such weighting strategies if we are in the presence of identical data sets of correlation values, but coming from samples with different variabilities effects in the dependent variable?

Research conducted for the last four months let me to realize an alternate approach to the above weighting strategies.

At that time I was putting together a new tutorial series on meta-analysis, so this problem diverted my attention and was always in the back of my head.

So after many meals and long nights, I finally decided to include my research findings as Part 1 of the tutorial series, which you can read here: On the Non-Additivity of Correlation Coefficients.

I hope you like it. Since this is relevant to many research areas, please send your feedback through private, confidential email and not through this blog.

PS. If others are interested in testing how the proposed approach compares with other weighting strategies, feel free to contacting me. I’m interested in testing with real data (non-simulated) from any field: science, engineering, education, behavioral & social sciences, allied health, literature, politics, marketing, etc.)

On Correlation Coefficients and Sample Size

18 Monday Oct 2010

Posted by egarcia in IR Tutorials, SEO Myths, Spam, Statistics and Mathematics

≈ 1 Comment

Today I updated my Tutorial on Correlation Coefficients to include a new section on the effect of sample size on the significance of correlation coefficients. This was motivated by some comments from search engine marketers on correlation strengths. (http://searchenginewatch.com/3641002). The new material might help those interested in learning whether a reported correlation coefficient is statistically different from zero. It is given below. Enjoy it.

The problem with correlation strength scales is that these say nothing about how the size of a sample impacts the significance of a correlation coefficient. This is a very important issue that is now addressed.

Consider three different correlation coefficients: 0.50, 0.35, and 0.17. Assume that we want to test that there is no significant relationship between the two variables at hand. The null hypothesis (H0) to be tested is that these r values are not statistically different from zero (rho = 0). How to proceed?

As recommended by Stevens (17), for rho = 0, H0 can be tested using a two tailed (i.e.,two sided) t-test at a given confidence level, usually at a 95% level. If tcalculated ≥ ttable, H0 is rejected. However, if tcalculated < ttable H0 is not rejected and there is no significant correlation between variables.

Here tcalculated is computed as r/SEr = r*SQRT[((n – 2)/(1 – r2))] while ttable values are obtained from the literature (http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values ). Table 2 summarizes the result of testing the null hypothesis at different sample size values.

Table 2. H0 tests at different sample sizes; two-tailed, 95% confidence.
n df = n – 2 r SEr t(calc) t (0.95) Reject (H0 : rho = 0)?
5 3 0.50 0.50 1.000 3.182 don’t reject
10 8 0.50 0.31 1.633 2.306 don’t reject
12 10 0.50 0.27 1.826 2.228 don’t reject
14 12 0.50 0.25 2.000 2.179 don’t reject
20 18 0.50 0.20 2.449 2.101 reject
30 28 0.50 0.16 3.055 2.048 reject
40 38 0.50 0.14 3.559 2.024 reject
50 48 0.50 0.13 4.000 2.011 reject
 
5 3 0.35 0.54 0.647 3.182 don’t reject
10 8 0.35 0.33 1.057 2.306 don’t reject
12 10 0.35 0.30 1.182 2.228 don’t reject
14 12 0.35 0.27 1.294 2.179 don’t reject
20 18 0.35 0.22 1.585 2.101 don’t reject
30 28 0.35 0.18 1.977 2.048 don’t reject
40 38 0.35 0.15 2.303 2.024 reject
50 48 0.35 0.14 2.589 2.011 reject
 
5 3 0.17 0.57 0.299 3.182 don’t reject
10 8 0.17 0.35 0.488 2.306 don’t reject
12 10 0.17 0.31 0.546 2.228 don’t reject
14 12 0.17 0.28 0.598 2.179 don’t reject
20 18 0.17 0.23 0.732 2.101 don’t reject
30 28 0.17 0.19 0.913 2.048 don’t reject
40 38 0.17 0.16 1.063 2.024 don’t reject
50 48 0.17 0.14 1.195 2.011 don’t reject

The table addresses at which size level an r value is high enough to be statistically significant.

For n = 14, all three r values (0.50, 0.35, and 0.17) are not statistically different from zero.

For n = 30, r = 0.50 is statistically different from zero while r = 0.35 and r = 0.17 are not.

Conversely, r = 0.50 is not statistically different from zero when n is equal or less than 14 while r = 0.35 is not different from zero when n is equal or less than 30.

Finally, r = 0.17 is not statistically different from zero at any of the sample sizes tested.

Understanding Accuracy and Precision

14 Thursday Oct 2010

Posted by egarcia in IR Tutorials, SEO Myths

≈ 1 Comment

Students often have hard time understanding the difference between accuracy and precision, particularly when they read quack “science” “studies” when surfing  the Web. This post might help them to grasp these concepts.

What is Accuracy?

Accuracy is a term describing deviation of an experimental value from a target value. A target value is a value accepted as ‘true’. Constants, fundamental quantities, and theoretical values are considered ‘true values’. Thus, accuracy is proximity to a true value.

To illustrate, assume that a quantity x is measured. Its true value is xt =1.00 and we report an experimental value xe of 0.90. The absolute error of this observation is | xe – xt | = 0.10 and its relative error is (| xe – xt |/ xt)*100 = 10%. The accuracy is the ratio between the experimental to true value. When expressed as a percent, it is called relative accuracy. In this case, xe/ xt = 0.90/1.00. This corresponds to a 90% accuracy.

What is Precision?

Precision has been loosely defined as how reproducible experimental results are. However, modern convention makes a careful distinction between reproducibility (between-run precision) and repeatability (within-run precision). Furthermore according to Freiser (1992),

  • Repeatability is the closeness of agreement between individual experimental results obtained with the same method on identical test material or samples, under the same conditions (same operator, same apparatus, same laboratories, and same intervals of time).
  • Reproducibility is the closeness of agreement between individual experimental results obtained with the same method on identical test material or samples, but under different conditions (different operator, different apparatus, different laboratories, and different intervals of time).

Note that the source of dispersion and errors in the experimental results is different in each case. Therefore arbitrarily expressing the precision of results in terms of standard deviations without considering how the data was collected (within- or between-run precision) should be avoided.

Similarly, comparing any two standard deviations, or standard errors for that matter, without regard for how the data was collected (experimental conditions, number of degrees of freedom, different sampling times, etc) should also be avoided. In particular, estimates of precision or comparisons of precisions from data set that constantly change within sampling times is a futile exercise.

Last but not least, the precision of a measurement depends on the measuring scale used. For instance, saying “He is about 55 years old.” is less precise than saying “He is 660 months old.” or than saying “He is 20,075 days old”.

References

Freiser, H. (1992). Concept Calculations in Analytical Chemistry. Chapter 12, p. 203. CRC Press, Boca Raton.

Miller, J. C. & Miller, J. N. (1984). Statistics for Analytical Chemistry. Chapter 1, p.19. Wiley, New York.

PS. I misplaced repeatability and reproducibility and fixed few more typos. Well and done. Thanks Dr. J. C. for pointing that out.

Understanding Fisher’s Z Transformations

20 Friday Aug 2010

Posted by egarcia in SEO Myths, IR Tutorials, Quack Science

≈ Leave a Comment

As mentioned in my Tutorial on Correlation Coefficients, the best known technique for transforming correlation coefficient (r) values into weighted additive quantities is the r-to-Z transformation due to Fisher.

Fisher’s r-to-Z transformation is an elementary transcendental function called the inverse hyperbolic tangent function. The reverse, a Z-to-r transformation, is therefore a hyperbolic tangent function.

In Windows computers, these functions are built-in in their scientific calculator program which is accessible by navigating to Start > All Programs > Accessories > Calculator. Microsoft Excel also has these built-in as the ATANH and TANH functions.

These transformations are needed to compute a weighted mean correlation coefficient and for hypothesis testing. Note that averaged correlation coefficients are not computable directly from raw r values.

Indeed, it is not possible to add, subtract, average or take standard deviations out of raw r values.

Unfortunately some researchers with a limited knowledge on Statistics have published papers containing such gross errors. What is worse, reviewers of those papers are either not statisticians or have been lazy enough to overlook at the concept, leading graduate students and post docs into error.

Search marketers are also buying into the error. An example of this are the SEOs from SeoMOZ promoting quack “science” and sloppy “statistical studies”. If you are an SEO and still want to believe their snakeoil marketing, that’s up to you.

On Correlation Strength Scales and a tutorial update

21 Wednesday Jul 2010

Posted by egarcia in SEO Myths, IR Tutorials, Data Mining, Spam

≈ Leave a Comment

Today, I’ve updated the Tutorial on Correlation Coefficients in order to add a new section on correlation strength scales. I feel this is granted.

In a 7/16/2010 Search Engine Watch post, a search marketer reported an r value of 0.67 and stated:

“If 1 is perfectly correlative, then 0.67 is certainly a strong correlative relationship and a figure of some interest, when we consider there are a couple hundred factors that reportedly contribute to rank.” (http://searchenginewatch.com/3641002).

This raises the question on how to characterize correlation strengths. Several attempts have been made at classifying r values as ‘weak’, ‘poor’, ‘moderate’, ‘strong’, or ‘very strong’ using scales of correlations.

The problem with these scales is that their boundaries are often defined using subjective arguments, not to mention that not all researchers agree with using such boundaries or scales at all.

Feel free to read the updated version now.

BTW, during the updating process I found an involuntary missing “not” in one line of example 3 in the tutorial. It should read as follows: “The difference between r1 and r2 is not significant at the 95% confidence level.” This should be obvious from reading the null hypothesis. My mistake, nevertheless.

Another correction made was in the Olkin-Pratt formula for bias. In this case, there is a missing parenthesis. Instead of 2n – 3, this should be written as 2(n -3). The parenthesis was originally included in the Excel program used to draw the graphs. So the graphs were not affected.

In the future and if I can find the time, I may add an exercise section applied to IR and search engines.

Cheers.

On SEOMOZ “Knowledge” about Statistics

12 Monday Jul 2010

Posted by egarcia in SEO Myths, IR Tutorials, Quack Science

≈ 2 Comments

In the 04/23/2010 blog post, Beware of SEO Statistical Studies, we commented on incorrect statistical methods used by SEOs in two different blogs. In that post, it was mentioned that we agreed with the comments made by one dissenting poster with regard to an SEOMOZ post published by his owner Rand Fishkin (aka randfish) and titled The Science of Ranking Correlations. That SEOMOZ post subsumed a methodology implemented by a member of SEOMOZ, Ben Hendrickson (aka. ijustwritecode).

We ended up our post with the following lines:

“To sum up, beware of SEO ”Science” and their statistical “studies”.

“I hope this helps.”

“PS: I updated this post to add the figure above and few additional comments.”

That was it.

Two days later, on 04/26/2010, the dissenting poster, (Branko Rithman; aka neyne, whiteweb_b) stopped by this blog and explained why in his own opinion SEMOZ was doing incorrect science. We went through an exchange of comments with Rithman that spanned two days and ended on 04/28/2010. All comments exchanged were limited to referencing and commenting facts from the Statistics literature. No personal attacks were directed at SEOMOZ, Fishkin, or Hendrickson.

On 06/12/2010, almost two months later, that post was referenced by Ted Dziuba in the post SEO is Mostly Quack Science, wherein Dziuba was very critical about SEOs and SEOMOZ, and referencing their work as “quack science”.

While we agreed with many of the remarks made by Dziuba, we disagree with at least one. In his post he stated that “A correlation of zero suggests that the two variables are completely independent of one another.”. Actually this incorrect as r = 0 only means variables are not linearly correlated. In our IR Watch Newsletter we have explained that r = 0 implies nothing about whether variables are dependent, independent, random (a claim later made by Fishkin) or deterministic or whether these might be or not nonlinearly correlated.

On 06/14/2010, Sean Golliher, founder and publisher of Search Engine Marketing Journal (SEMJ.org), stopped by to add comments on why SEOMOZ statistical treatment was incorrect. I responded: “Unfortunately these type of “research” gives a black eye to the SEO industry. I limited the length of that post since I was rushing to put out our monthly IRW newsletter which was already late.

The same day, Hendrickson, dropped by this blog and accused me that all comments were a personal attack and critic directed at him, despite the fact that his name was never mentioned or ever inferred in any of the previous posts by me. [PS. I just quoted Rithman]

Hendrickson stated:

“In your post, you criticize me…” blah, blah, blah…

He then insisted on what they are doing was a correct statistical treatment. Of course, he did not present any mathematical reasoning to support his claims. He repeated that he was right and demanded a retraction. Since evidently he took our posts personally and we were under time constraints to publish IRW, I decided to halt his post until a full response to his virulent accusations was ready for public consumption.

It was after his remarks with no statistical or mathematical evidence given that we put into question any valid knowledge Hendrickson and Fishkin might have about Statistics. Their indiscriminate reference about power and exponential laws confirms this perception. It was after their reactions that we called what SEOMOZ was doing “quack science” as can be seen from this 06/16/2010 post. As of Today, I stand by that statement. I ended that post with the following line:

“I am the one that is now demanding a public retraction from you both for putting out crap “science” at SEOMOZ.”

That’s a strong statement and I don’t regret it. Why? Because often search engine marketers and their cheerleaders put out a lot of false knowledge and opinions labeled as “science” or “study”, giving a black eye to the serious, ethical sector of their own industry.

Next day, on 06/17/2010, Dziuba’s post was featured at Sphinn by Jill Whalen wherein a hot debate evolved between SEOs and wherein our post was commented. Looking at that debate, we concluded that the best way to dispel so many myths and confusion was through education.

On 06/20/2010, it was mentioned that we planned to write two tutorials on statistics. On 06/25/2010 we announced the release of the first tutorial, A Tutorial on Standard Errors.

In that tutorial we explained the right way of computing standard errors. It was explained why it is incorrect any attempt at computing mean of correlation coefficients as arithmetic averages, treating those mean correlation coefficients as one would treat the mean of x observations, and computing with these standard deviations and standard errors. We explained that r-to-Z Fisher Transformations are needed and even sometimes pooled standard errors must be computed.

It appears Hendrickson was monitoring that release since later that Friday (there is almost a time zone difference of 4 hours between Seattle and San Juan), he published a “rebuttal” at SEOMOZ titled Statistics: a win for SEO wherein he tried to refute the main statements of the tutorial, perhaps as a damage control tactic.

Hendrickson “rebuttal” was limited to repeat his previous claims, again with no mathematical proofs. He then made some irrisory remarks about Z scores exploding at r = 1, and to quoting reference work that obviously he never read thoroughly, otherwise he could have realized one of the papers he cited was based on a previous work with an obvious theoretical error: computing arithmetic averages of correlation coefficients.

In email exchanged with a co-author of that work, the author now recognized that their approach of adding and averaging correlation coefficients was indisputably incorrect after all. You can read the full story and comments exchanged with him in our second tutorial, A Tutorial on Correlation Coefficients, published on 07/08/2010. The tutorial is also a direct response to Hendrickson and Fishkin “rebuttal”.

In the SEOMOZ “rebuttal” there were few comments made by Hendrickson (and later subscribed by Fishkin at Sphinn) that are diversion tactics and plain lies. That is common in the art of misinformation: second guess your opponent and write a rebuttal on comments that were never made. This relates to my comments on PCA.

What Hendrickson claimed I said about PCA:“Rebuttal To Claim That PCA Is Not A Linear Method”

What I actually said about PCA:
This is what I said and stand by.

PCA
About the word “linearity” in PCA: Be careful when you, Ben and Rand, talk about linearity in connection with PCA as no assumption needs to be made in PCA about the distribution of the original data. I doubt you guys know about PCA, a dimensionality reduction technique frequently used in data mining. It is also used in clustering analysis as a one way of addressing the so-called K-Means initial centroid problem, though it can fail with overlapping clusters.

PCA can be applied to linear and nonlinear data. Ben, have you ever heard about nonlinear PCA? But even if you don’t, let me say this: Nothing more nonlinear than images, clusters, noisy scattered data, etc–yet PCA can be applied to these scenarios to extract structural components hinting at possible patterns hidden by noisy dimensions.

Linearity in PCA does not refer to the original variables, but about linearity in the transformed variables used in PCA because these are transformed into linear combinations. To understand linear combinations a review on linear algebra helps. According to Gering from MIT (http://people.csail.mit.edu/gering/areaexam/gering-areaexam02.pdf )

“Principle Component Analysis (PCA) replaces the original variables of a data set with a smaller number of uncorrelated variables called the principal components….The method is linear in that the new variables are a linear combination of the original. No assumption is made about the probability distribution of the original variables.”

The linearity assumption is with the basis vectors. And even so, variants of PCA do exist to deal with nonlinearity.

Incidentally, on 06/28/2010, a poster from SEOMOZ, Daniel Deceuster, dropped by our blog wherein he was not happy with SEOMOZ claims. My 06/30/2010 comments to him are available
and are given below:

“Hi, Daniel:
Thank you for stopping by. I’m very busy these days, so sorry for not responding before.

Good luck with your tests. Please keep in mind that with a problem with many variables, testing all of them at the same time allows one to account for outliers, noisy variables and possible interactions between variables which often introduce nonlinearity.

A one-variable-at-a-time testing is prone to ignore the above. Arbitrarily removing nonlinearity can introduce spurious linearity. There are methods for multivariate testing: modified sequential simplex optimization, factorial designs, and yes, PCA.

For instance, a problem with many variables N, taken as dimensions, might look nonlinear. PCA creates a new set of variables that are linear combinations of the original variables and linearly independent of each other.

The odds are that when a tester deals with many variables he does not know a priory whether the orignal data is truly linear or nonlinear, or if there are subspaces wherein it is one way or the other. Fortunately, no assumptions need to be made in PCA about the distribution of the original data.

The goal is to represent the problem using fewer dimensions M such that M < N. This is why PCA is considered a dimensionality reduction technique. The set of M dimensions is obtained by finding the principal components (PCs). The PCs give the direction of greater variability and provide evidence for linearity in a particular direction of a reduced space. So, one can find linearity within apparently nonlinear data.

There are many problems (linear and nonlinear) in which PCA is applied. And there are many advances and new research in the PCA and SVD area to make PCA robust. After the upcoming tutorial on correlation coefficients, I might have to put out new tutorials on the topic to dispel so much misinformation running around.”

The same misinformation tactic was used by Hendrickson regarding my claims about what I said about Pearson and Spearman correlation coefficients. Here he seems to use bidirectional extrapolation like many politicians have mastered when debating. [“if A to B is true, then B to A is true.”, “if A to B is false, then B to A is false”. “Let’s the opponents to disprove.”]

We know that bidirectional extrapolation in Statistics and Science is incorrect. For instance, we know that:

If variable are independent, there is no correlation (r = 0). The reverse is not necessarily true.

Pearson’s can be applied to linear paired data. The reverse is not necessarily true. (*)

Nonlinear paired data requires of Spearman’s. The reverse is not necessarily true. (*)

*The asterisk is for this: If a linear transformation of variables is possible, both Pearson’s and Spearman’s can be used.

All my comments about Pearson’s and Spearman’s are available at the original blog post for anyone to read.

Some False Claims that Evaporate Under Examination

We have shown in A Tutorial on Correlation Coefficients, why SEOMOZ, Hendrickson, and Fishkin “science” approach and “studies” are mathematically and statistically incorrect.

First, correlation coefficients are not additive. For centered data, the geometrical equivalence of an r value is a cosine. Adding, subtracting, or averaging cosines do not give a new cosine. For non-centered data there is still a cosine term in the formula for r.

Second, a mean from x observations is an unbiased estimator while a mean correlation coefficient is a biased estimator. Treating an average r value as a mean x is the same as saying that a mean r is an unbiased estimator. Statistically speaking, this is a gross error. This is easy to prove.

Third, professional and academic statisticians know that unlike mean x values, r values have an inherent bias; i.e., their distribution is bivariate and inherently skewed. Thus, the computed mean r value is also a biased estimator. All this affects hypothesis testing. A researcher cannot compute mean values and associated standard errors arbitrarily with one formula just because he likes to pick one in particular.

I will give now an example described in the Appendix section of our tutorial.

Let r1= 0.877 and r2 = 0.577. One might think that the average over these is 0.727 or about 0.73 with an associated Coefficient of Determination of 0.73*0.73 = 0.53 or 53%. Assuming this is the same as implying that the mean is an unbiased estimator, not skewed toward r1 or r2.

However, converting r1 and r2 to Z scores and averaging, gives an average Z score of 1.010. Converting this result back to r gives a mean value of 0.766 or about 0.77 with an associated R of 0.77*0.77 = 0.59 or 59%. Note that now the mean value is skewed toward r1, confirming that the mean value is a biased estimator. Statisticians know the inherent bias is a property of r values and their mean values.

If SEOMOZ wants to get into the misinformation business and sell snakeoil, sure they are entitled to that. But that ruins any little credibility they are getting these days with their “science knowledge”. Some of their cheerleaders, groupies, or easy-to-impress visitors might elect to stick to them. That’s fine as they deserve what they are getting. So far, Statistics is a loss for them, anyway. All I can say is this: Those that listen to fools become one.

I’m not sure if Hendrickson misled Fishkin or if it was all the way around. In real companies, for less than that employees end fired. Since Fishkin cannot fire himself, his burden of proof is now double. And for the fourth time: I am still demanding a retraction.

PS. I updated this post to refine some lines.

← Older posts
May 2013
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.