• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: June 2010

A Tutorial on Standard Errors

25 Friday Jun 2010

Posted by egarcia in Data Mining, IR Tutorials, SEO Myths, Web Mining Course

≈ 3 Comments

Soon or later those conducting data mining studies will need to compute standard errors for several statistics.

Every statistic from a sample distribution has a standard error that is specific to that statistic. Using the incorrect definition for a standard error invalidates any research study.

A tutorial on standard errors is now available from miislita.com.

IRW-2010-June: Uncorrelation & Correlation: The Expectation Values Way

18 Friday Jun 2010

Posted by egarcia in IR Tutorials, Newsletters, SEO Myths

≈ 3 Comments

uncorrelation-correlation

The current issue of the IRW Newsletter is out and should reach subscribers inboxes during the day. The featuring column is about Uncorrelation and Correlation: The Expectation Values Way.

“Soon or later, those conducting data mining studies will need to deal with uncorrelated and correlated variables. In this issue of the newsletter we use expectation values to differentiate between these two terms. “

The article debunks the common myth  that  a vanishing correlation coefficient (r = 0) implies that variables have to be random or independent or that are not related at all. Easy-to-follow numerical and graphical examples are used to illustrate this point as well as the difference between uncorrelated and correlated variables.

The Q&A section explains the difference between exponential and power laws and how these can be converted into a linear form. Once linearized, all kind of statistics can be computed (correlation coefficient errors, standard deviations, etc.)

I’m thinking in putting a tutorial online so others can get out of their head the many myths promoted by *certain* SEOs.

Enjoy it.

On SEO Quackery

16 Wednesday Jun 2010

Posted by egarcia in SEO Myths

≈ 1 Comment

This month issue of IRW will soon be out. Because I was focused on it I was not able to follow the discussion on SEO quackery.

I am now publishing a full response to Rand and Ben demanding a public retraction from both for putting out SEO quackery. SEO quack ”science” is worse than mere SEO myths because involves using scientific concepts in a convoluted way.  I guess I would need to open a new category called in that way.

You can see my responses at the following links:

http://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/

http://irthoughts.wordpress.com/2010/06/10/on-spearmans-correlation-coefficients-with-excel/

On Spearman’s Correlation Coefficients with Excel

10 Thursday Jun 2010

Posted by egarcia in SEO Myths

≈ 9 Comments

I know I mentioned this before, but I keep getting the same question. How to compute Spearman correlation coefficient (rho) with Excel for a data set consisting of a relatively small number of paired values (n)?

Unlike for a Pearson’s correlation coefficient (r), Excel does not come with a built-in function for Spearman’s. You need to manipulate Excel.

Once you have the paired data in a rank format, there are several ways to proceed. Here are at least two.

1. compute the rank difference for each data point and square these. Next, compute the sum of squared differences (SSD) and program into Excel the formula rho = 1 – (6*(SSD)/(n*(n*n -1)))

2. You can also simply construct a scatterplot of the ranked data and fit the data to a linear regression curve. I do this in Windows 7 by adding focus to the scatterplot and selecting Layout 9 from the Design Tab of Excel. This gives you a visual overlay of how the ranks fluctuate around a straight line regression curve. It also gives you a Coefficient of Determination R2, the square root of which is Pearson’s r, but because the data is ranked, this Pearson’s r is also Spearman’s rho. So the assertion that nonlinear data cannot be treated with Pearson’s correlation is not entirely correct. An example is given below.

The data was taken from http://www.mnstate.edu/wasson/ed602calccorr.htm
 

Person Judge A Rank Judge B Rank D D2
Alan 1 2 -1 1
Beth 2 1 1 1
Carl 3 4 -1 1
Don 4 3 1 1
Edgar 5 6 -1 1
Frances 6 7 -1 1
Gertrude 7 5 2 4
Sum       10
Spearman 0.8214      

Spearman or Pearson

Using the second method, Excel gives the regression equation
y = 0.8214x + 0.7143; with R2 = 0.6747

Hence, the square root of 0.6747.. is 0.8214..

Note that Spearman’s rho can be calculated from either the slope or the Coefficient of Determination.

For a large number of data points, you are better off by writing an Excel macro. For a really huge data set, you need an industry strength software solution.

Note from the above results that Spearman’s rho (equivalent to Pearson’s r for ranked data) is about 0.82. The Coefficient of Determination indicates that about 67% of the variations can be explained by the regression model, thus about 33% of the variations cannot be explained by the regression model. The smaller the correlation coefficients the more likely the data points will be scattered on the graph. Without considering scatterplots, t-test significance analysis, and slope analyses it is easy to misinterpret correlation coefficients.

On a side note, trying to compare small correlation coefficients (Spearman’s or Pearson’s), frequently leads to flawed or useless conclusions. I don’t understand why some search engine marketers are still insisting in doing that (http://www.seomoz.org/blog/google-vs-bing-correlation-analysis-of-ranking-elements ). It would be interesting to see how their scatterplots look like. That’s how many SEO myths start: with flawed methodologies/reasonings.

As we may search…Not with Google Scholar

09 Wednesday Jun 2010

Posted by egarcia in Data Mining, IR Tools

≈ 11 Comments

Five years ago, in the article As we may search – Comparison of major features of the Web of Science, Scopus, and Google Scholar citation-based and citation-enhanced databases Peter Jacso from the Department of Information and Computer Science, University of Hawaii, compared several indexes, particularly Google Scholar (G-S). The article found G-S to have too many flaws. He concluded:

“Unfortunately, G-S gives a bad name to autonomous citation indexing. It shows lack of competence, and understanding of basic issues of citation indexing. G-S fails even in implementing the most basic Boolean OR operation correctly. Riding on the waves of the regular Google software which is great for processing the unstructured heap of billions of Web pages, G-S cannot handle even the meticulously tagged, metadata-enriched few million journal articles graciously offered to it by many publishers for free.”

In my opinion, no much has changed since then. That’s why I use G-S as my last option. I prefer public scholar search resources to get at no cost articles not found in G-S or that otherwise I would need to pay from journal indexes. Some of these free resources are designed to get a pdf version of the manuscripts sought. More than one university and private company prefer to have its own open source index available to anyone, but blocking Googlebot.

Independence, Orthogonality, and Uncorrelation

03 Thursday Jun 2010

Posted by egarcia in Newsletters

≈ 1 Comment

I made a typo in last issue of the newsletter which was corrected right away, but many still got the uncorrected version. Instead of writing C(X,Y) = 0 for uncorrelation, I wrote P(X,Y) = 0. This is a non trivial error, as C(X, Y) = 0 means that the covariance between X and Y variables is zero (uncorrelation) while P(X,Y) = 0 is often used in reference to disjoint (mutually exclusive events). Ah, the power of a single letter-typo… That’s kind of stuffs happen when you are pressed with so many deadlines. My apologies for that. I’ll repeat this clarification in the next issue of IRW since it covers the second part of the subject on correlation. The purpose of this post is to gives you a sneak preview on what to expect from the June issue.

Variables or events are said to be independent, correlated, uncorrelated, orthogonal, or unrelated depending on the type of scenario.

Independence implies that P(X,Y) = P(X)*P(Y) where P is probability; i.e., the probability of X and Y to occur is equal to the product of the individual variable probabilities.
Correlation is a measure of linear association. Variables can be correlated without having any causal relationship, or can have a causal relationship and yet be uncorrelated.

Uncorrelation implies that C(X,Y) = 0; i.e., the covariance between variables is zero. This implies that their correlation coefficient, r, is also zero since r = C(X,Y)/S(X)*S(Y), where S stands for standard deviation. There is a generalization of this using expectation values; i.e. it is said that X and Y are uncorrelated if and only if E(X,Y) = E(X)*E(Y).
Describing variables as being either orthogonal or uncorrelated is easier to understand in terms of vectors.

If vectors representing raw variables are perpendicular (dot product is zero) then the variables are orthogonal.

Now if the variables are centered and represented as vectors and their dot product is yet zero they are uncorrelated.

In other words, orthogonal denotes that the raw variables (non-centered) are perpendicular while uncorrelated means that the centered variables are perpendicular. C(X,Y) is then obtained from this dot product as C(X,Y) = 0.
There is an old, but still relevant article, Linearly Independent, Orthogonal, and Uncorrelated Variables, written by Rodgers et al which explains all this.

IRW:May, 2010 – Correlation Coefficients in IR

01 Tuesday Jun 2010

Posted by egarcia in Data Mining, Newsletters, SEO Myths

≈ Leave a Comment

correlation coefficients

The May issue of IRW is out a bit delayed since I was pressed to beat other deadlines. The abstracts follows:

“Soon or later, those conducting information retrieval or data mining studies will need to use correlation coefficients to assess how variables from data sets are correlated. In a recent blog post we warned readers about search engine marketing statistical studies that claim to find correlations between PageRank and conveniently selected variables, or that try to compare correlation coefficients derived from different Web tools or datasets
(http://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/ ).

Unfortunately, correlation coefficients are frequently misused and abused. What drives an analyst to make incorrect inferences about correlation? In general, most correlation coefficient myths derive from not realizing that independence, uncorrelation, and unrelatedness are not equivalent terms. In this issue of the newsletter we list 21 of the most common myths about correlation coefficients that can put into question the credibility of a statistical study and its proponents. ”

June 2010
M T W T F S S
« May   Jul »
 123456
78910111213
14151617181920
21222324252627
282930  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.