Stay ahead here, with a new search experience:
Stay ahead here, with a new search experience:
Here is a nice article about the risks of misusing big data
Here are my comments on the topic:
1. Most traditional statistically significant analyzes were meant to be used with small data sets, not with big data, unless stratification of the big data is possible. In general, if a large enough data set is used, any t-test study of very small correlation coefficients can be forced to become statistically significant, and so misleading. Said effect is a by-product artifact of the equations involved.
2. Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid. This is one of the first things a peer reviewer of statistical articles will look for. Methods and techniques for transforming data to become normally distributed prior to any analysis do exist, although some data sets might not be transformable and forcing them to adopt normality can be contraindicated.
3. Avoid arithmetically adding and averaging correlation coefficients, standard deviations, slopes, cosine similarity measures, and dissimilar ratios in general. They are simply not additive, regardless of what some outdated meta-analysis articles say or what you hear from groupie search marketers (“Why regurgitate in blogs what you don’t understand?”).
Our popular tool, The Web Crawler, is back! This new iteration of the tool is a lot more faster because is based on a different strategy: extractions of HREF sets and then refinement of these to get URLs that are qualified for status checks. So the tool also works as a link checker.
Another advantage of the above strategy is this:
We have just published this short article, based on The Color Miner tool:
Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery - Based on Fractal Geometry, fractalettes are color palettes within color palettes, where each cell contains color space information and relationships. These types of architectures engage end-users in data mining, critical thinking, and learning through discovery.
Indeed, the AZZOO measure outperforms all conventional measures in the application of IRIS biometrics and handwritten character recognition.
At least that’s what is claimed.
The hilarious picture above shows how some SEOs look when playing to be scientists. This often occurs when interpreting big data.
Few specific scenarios:
1. Applying the statistical theory of small samples to extremely large samples, like …
2. …using large amount of data to force very small correlation coefficients to become statistically significant.
3. Trying to arithmetically average ratios (like correlation coefficients, standard deviations, slopes, and cosine similarities).
4. Mistaking Cauchy Distributions for Normal Distributions.
5. Adding together intensive properties.
Fortunately, I know of good folks that are doing a great job at educating their search marketing peers (Mike Grehan, Bruce Clay, Danny Sullivan, etc) without playing to be scientists.
Correlation coefficients, coefficients of variations, standard deviations, slopes, tangents, cosines, densities, temperatures, dissimilar ratios, and intensive properties in general are not additive. Therefore, arithmetic averages cannot be computed out of any of these.
Still, from time to time some “experts” and pseudo “scientists” do that.
Want to know why this is not mathematically and statistically possible? This is the subject of a paper I wrote and that is about to be published in Communications in Statistics – Theory and Methods (by Taylor & Francis).
Incidentally, I will provide a preview of the topic to the search marketing community. Thanks to my dear friend, Mike Grehan, this will be the topic I’ll be speaking about at the March, 2012 SES, NY.
I received this morning from the editors of Communications in Statistics: Theory and Methods confirmation that they accepted and will be publishing my peer reviewed paper on a new model for statistical analysis. It should be out this 2012.
Once published, you will understand the SEO (* SEOmoz, I should say) non-sense of computing arithmetic averages of correlation coefficients and why some meta-analysis studies published in the past (* Hunter-Schmidt; Hedges-Olkin) are flawed and invalid.
It took me several meals and research hours to figure it out. I hope that IRs, dataminers, and statistics colleagues find new applications for the model.
The model can be applied to many fields, including marketing, business, risk analysis, data mining, signal processing, engineering, clinical trials, and almost any field or knowledge domain that involves the calculation of weighted statistics. I look forward to discuss it online once it get published.
Happy New Year.
PS. (*) I’ve edited this post to make these points obvious. So, the issue of arithmetically averaging correlations has been raised and killed for good before the scientific and statistical community.
PS. Just in: Last night (Jan-03-2012) I received news from one of the editors of the journal that the paper was assigned to issue 41 (8). Check for its title: The Self-Weighting Model (in Spanish is something like “El Modelo de Autoponderacion“. I forget to mention that this journal is published biweekly; so, things are moving fast. What a way of ending 2011 and starting 2012!!!
The current issue of IRW should reach subscribers inboxes during the day.
This is Part Two of the series on statistical analysis of n-grams. This is a text mining analysis technique widely used in information retrieval and data mining in general. In this issue we cover the implementation of association measures derived from contingency tables.
The QA section explains how to conduct a Chi Square Test for tables with many items; i.e., beyond the usual 2 x 2 contingency tables.
Soon or later those conducting data mining studies will need to compute standard errors for several statistics.
Every statistic from a sample distribution has a standard error that is specific to that statistic. Using the incorrect definition for a standard error invalidates any research study.
A tutorial on standard errors is now available from miislita.com.