• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: October 2010

IRW-10-2010:Inverted Index Architectures Part Three

25 Monday Oct 2010

Posted by egarcia in Machine Learning, Newsletters, Programming

≈ Leave a Comment

Subcribe to IRW

The current issue of IRW should arrive today to subscribers inbox. It is full of meaty stuff. The featuring article is Part Three of the series on inverted index architectures. We cover positional inverted indexes. It is shown with a simple example how these indexes processs advance searches (AND, NEAR, and EXACT) in order to retrieve documents.

The QA column covers hypothesis testing with correlation coefficients at a given sample size and confidence level.

Enjoy it!

On Power Analysis and SEO Quack Science

21 Thursday Oct 2010

Posted by egarcia in Data Mining, IR Tools, Quack Science, SEO Myths, Statistics and Mathematics

≈ 1 Comment

One of the trickiest aspects of publishing statistical studies is the sample size to be used. Not stipulating a valid procedure for estimating a proper sample size can hurt, for instance, a grant proposal. Ethical committees are concerned about the right number of observations in a study, asking submitters to justify on statistical grounds how they arrived at a given sample size. Research projects with too few or too many observations or no sample size methodology at all often get rejected. This is something those conducting SEO quack “science” don’t seem to understand or are not aware of.

Too small samples are unethical, because the researcher cannot be specific enough about the size of, for example, the effect of a drug in a population. Too large samples are also unethical, because represent a waste of funding. True that a large sample improves precision, but it might involve an unjustified cost. Stratification is preferred, but it gets too complicated with huge sample sizes, not to mention that statistical significance not necessarily scales between samples.

As Rahul Dodhia from RavenAnalytics (http://ravenanalytics.com/Articles/Sample_Size_Calculations.htm ) indicates: a 2000-sample might not be very different from a 20000-sample, but a 200-sample maybe very different from a 2000-sample even when in each case the sample ratio is 10. So, a large sample not always is justified, even if such a sample size improves statistical significance and precision.

Consider the case of search engine ranking results. Upon a query, search engines are capable of finding many results, frequently in the range of thousand or million results per query. Still search engines and retrieval systems show to users a limited answer set. For instance, Google limits its viewable answer set to a maximum of 1,000 results (100 pages, 10 results/page).

Like in most retrieval systems, relevant results are accumulated at the first few result pages forming clusters. This is in agreement with Rijsbergen’s Cluster Hypothesis, which states that documents that cluster together have a similar relevance to a given query. Moving down the list of search results one often find cluster transitions wherein the quality and aboutness of documents is polluted with off-topic content.

Documents buried in a list of results often contain content irrelevant to the initial query or full of spam techniques. If one wants to conduct a statistical study of ranking results versus a particular document feature, one can do better by considering a sample from the first few result pages than from the entire answer set of 1,000 results.

In general, in a non-search engine scenario one cannot just arbitrarily select large samples to “force” the statistical significance of very low correlation coefficients and then use those values to draw conclusions. Furthermore, what is the selection criterion for using 1,000 or 10,000 results?

Simply stated: If 10,000 observations are arbitrarily selected, why not use 100,000 or 1,000,000 instead? We already know that very small correlation coefficients between any two arbitrary pair of random variables will be significant at those huge sample levels, anyway. And?

As noted in a Wikipedia entry, “given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. (http://en.wikipedia.org/wiki/Effect_size ).

For example, a correlation coefficient of r = 0.04 would be significant at a 95% confidence level if coming from a 10,000-sample (t-calc = 4.003 >> t-table = 1.96) while a correlation coefficient of r = 0.01 would be significant at a 95% confidence level if coming from a 100,000-sample (t-calc = 3.162 >> t-table = 1.96). And? This proves nothing, especially when the magnitude of a “signal” approaches the magnitude of its “noise”.

As noted at the above Wikipedia entry, a correlation coefficient of 0.1 is strongly statistically significant when sample size is 1000, (t-calc = 3.175 >> t-table = 1.96) but reporting only the small p-value from this analysis could be misleading if a correlation of 0.1 is too small to be of interest in a particular application. (http://en.wikipedia.org/wiki/Effect_size ).

Statistical significance of extremely small r values is not surprising as is just a mathematical consequence of the fact that a t-value is a function (F) of a weighted ratio: the ratio of explained-to-unexplained variations weighted by the number of degree of freedoms:

F(r, n) = t = SQRT[(r2/(1 – r2))*(n – 2)]
F(r, n) = t = r*SQRT[((n – 2)/(1 – r2))]

For a given r value, increasing n increases t. No surprise here. One thing is what a math equation tells you and another different thing is what the nature and obvious boundaries of a physical system tell you.

At trivially low r values any claim with regards to the statistical significance or strength of some results proves nothing and one cannot do much with such trivial r values. For instance for r = 0.04, r2 = 0.0016, meaning that 1 – r2 = 0.9984 or 99.84% of the variations in the dependent variable (y) are not explained by variations in the independent variable (x).

In such a scenario, assessing the effect of x on y is a futile exercise. Such a model would be useless for drawing conclusions or predicting anything. And here is the point that many SEOs at SEOMOZ (http://www.seo.co.uk/seo-news/seo-tools/the-seomoz-lda-tool-%E2%80%93-our-disappointing-findings.html , Fishkin, Hendrickson, and others elsewhere) don’t seem to grasp:

When a correlation coefficient is useless for all practical purposes.

If the raw data constantly changes, that’s another “Chaos Layer” that compounds the problem.

Enters Cohen’s Power

According to Cohen’s work, when conducting a sample size study of correlation coefficients, one needs to consider the required confidence level and power of the test, the desired probability for Type I and Type II Errors, and the hypothesized or anticipated correlation coefficient (http://www.medcalc.be/manual/correlation_coefficient.php ). One cannot just use an arbitrary sample size for testing things.

In general, given any three of the following, the fourth one can be determined (http://www.statmethods.net/stats/power.html ):

1. sample size
2. effect size
3. significance level = P(Type I error) = probability of finding an effect that is not there
4. power = 1 – P(Type II error) = probability of finding an effect that is there

One also needs to consider what is the statistical parameter that is undergoing the power analysis. One needs to ask questions like the following:

Are we testing means from a given group? http://www.nss.gov.au/nss/home.nsf/pages/Sample+Size+Calculator+Description?OpenDocument

Are we testing means from different groups? http://www.ncbi.nlm.nih.gov/pmc/articles/PMC137461/

Are we testing correlation coefficients? Read Simon’s take on the impact of sample size on the desired level of precision in correlation coefficients (http://www.childrens-mercy.org/stats/weblog2005/CorrelationCoefficient.asp ).

Are we interested in significance level, effect size, sample effect, or power?

When conducting an effect size analysis one must keep in mind that effect sizes estimate the strength of a possible relationship, rather than assigning a significance level. However, effect sizes do not determine significance levels, or vice-versa.

So, how do we go about implementing Power Analysis?

For those interested in implementing power analysis written in the R Language, I recommend the libraries at http://www.statmethods.net/stats/power.html

Software for conducting power analysis is also available elsewhere, as shown in the following table. My favorites are G*Power and SPSS SamplePower (http://www.spss.com/software/statistics/samplepower/).

Power Analysis SoftwareSource: http://www.epibiostat.ucsf.edu/biostat/sampsize.html
Software Remarks
G*Power License: Free Uses both exact and approximate methods to calculate power. It will deal with sample size/power calculations for t-tests, 1-way ANOVAs, regression, correlation, and chi-square goodness of fit. For t-tests and ANOVAs you find the effect size by supplying mean and variance information. For correlation coefficients the effect size is a function of r2. http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
PC-Size License: Free Deals with sample size/power calculations for t-tests, 1-way and 2-way ANOVA, simple regression, correlation, and comparison of proportions. http://www.esf.edu/efb/gibbs/monitor/usingDSTPLANandPCSIZE.pdf
ftp://ftp.simtel.net/pub/simtelnet/msdos/statstcs/size102.zip
DSTPLAN License: Free Uses approximate methods to calculate power. It will calculate sample size/power for t-tests, correlation, a difference in proportions, 2xN contingency tables, and various survival analysis designs. http://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx?Software_Id=41
PS License: Free Performs sample size/power calculations for t-tests, Chi-square, Fisher’s exact, McNemar’s, simple regression, and survival analysis. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/PowerSampleSize
Tibco Spoffire S+ License: Paid The only commercially-supported statistical analysis software that delivers a cross-platform IDE for the award-winning S programming language, the ability to analyze gigabyte class data sets on the desktop, and a package system for sharing, reuse and deployment of analytics in the enterprise and in validated environments. Used widely in validated production environments (e.g., 21 CFR Part 11).http://spotfire.tibco.com/products/s-plus/statistical-analysis-software.aspx
NQuery Advisor License: Paid Performs sample size/ power calculations for t-tests, 1 and 2 way ANOVAS, tests of contrasts in 1-way ANOVAs, univariate repeated measures designs, regression (simple, multiple and logistic), correlation, difference of proportions, 2XN contingency tables, and survival analyses. http://www.statsol.ie/nquery/nquery.htm
PASS License: Paid Performs sample size/power calculations for z-tests, t-tests, 1, 2, and 3-way ANOVAs, univariate repeated measures designs, regression (simple, multiple and logistic), correlations, difference in proportions, 2xN contingency tables, survival analyses and simple non-parametric analyses.  http://www.ncss.com/pass.html
Stata License: Paid It has some simple built-in power and sample size functions. http://www.stata.com/
SPSS SamplePower License: Paid  If your sample size is too small, you could miss important research findings. If it’s too large, you could waste valuable time and resources. Finds the right sample size for your research in minutes and test the possible results before you begin your study, with IBM SPSS SamplePower. Strikes the right balance among confidence level, statistical power, effect size, and sample size using IBM SPSS SamplePower. Compares the effects of different study parameters with its flexible analytical tools. http://www.spss.com/software/statistics/samplepower/  

On Correlation Coefficients and Sample Size

18 Monday Oct 2010

Posted by egarcia in IR Tutorials, SEO Myths, Spam, Statistics and Mathematics

≈ 1 Comment

Today I updated my Tutorial on Correlation Coefficients to include a new section on the effect of sample size on the significance of correlation coefficients. This was motivated by some comments from search engine marketers on correlation strengths. (http://searchenginewatch.com/3641002). The new material might help those interested in learning whether a reported correlation coefficient is statistically different from zero. It is given below. Enjoy it.

The problem with correlation strength scales is that these say nothing about how the size of a sample impacts the significance of a correlation coefficient. This is a very important issue that is now addressed.

Consider three different correlation coefficients: 0.50, 0.35, and 0.17. Assume that we want to test that there is no significant relationship between the two variables at hand. The null hypothesis (H0) to be tested is that these r values are not statistically different from zero (rho = 0). How to proceed?

As recommended by Stevens (17), for rho = 0, H0 can be tested using a two tailed (i.e.,two sided) t-test at a given confidence level, usually at a 95% level. If tcalculated ≥ ttable, H0 is rejected. However, if tcalculated < ttable H0 is not rejected and there is no significant correlation between variables.

Here tcalculated is computed as r/SEr = r*SQRT[((n – 2)/(1 – r2))] while ttable values are obtained from the literature (http://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values ). Table 2 summarizes the result of testing the null hypothesis at different sample size values.

Table 2. H0 tests at different sample sizes; two-tailed, 95% confidence.
n df = n – 2 r SEr t(calc) t (0.95) Reject (H0 : rho = 0)?
5 3 0.50 0.50 1.000 3.182 don’t reject
10 8 0.50 0.31 1.633 2.306 don’t reject
12 10 0.50 0.27 1.826 2.228 don’t reject
14 12 0.50 0.25 2.000 2.179 don’t reject
20 18 0.50 0.20 2.449 2.101 reject
30 28 0.50 0.16 3.055 2.048 reject
40 38 0.50 0.14 3.559 2.024 reject
50 48 0.50 0.13 4.000 2.011 reject
 
5 3 0.35 0.54 0.647 3.182 don’t reject
10 8 0.35 0.33 1.057 2.306 don’t reject
12 10 0.35 0.30 1.182 2.228 don’t reject
14 12 0.35 0.27 1.294 2.179 don’t reject
20 18 0.35 0.22 1.585 2.101 don’t reject
30 28 0.35 0.18 1.977 2.048 don’t reject
40 38 0.35 0.15 2.303 2.024 reject
50 48 0.35 0.14 2.589 2.011 reject
 
5 3 0.17 0.57 0.299 3.182 don’t reject
10 8 0.17 0.35 0.488 2.306 don’t reject
12 10 0.17 0.31 0.546 2.228 don’t reject
14 12 0.17 0.28 0.598 2.179 don’t reject
20 18 0.17 0.23 0.732 2.101 don’t reject
30 28 0.17 0.19 0.913 2.048 don’t reject
40 38 0.17 0.16 1.063 2.024 don’t reject
50 48 0.17 0.14 1.195 2.011 don’t reject

The table addresses at which size level an r value is high enough to be statistically significant.

For n = 14, all three r values (0.50, 0.35, and 0.17) are not statistically different from zero.

For n = 30, r = 0.50 is statistically different from zero while r = 0.35 and r = 0.17 are not.

Conversely, r = 0.50 is not statistically different from zero when n is equal or less than 14 while r = 0.35 is not different from zero when n is equal or less than 30.

Finally, r = 0.17 is not statistically different from zero at any of the sample sizes tested.

On Inverted Indexes

15 Friday Oct 2010

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

The upcoming issue of IRW will be out soon. This will be Part 3 of the series on inverted index architectures.

In Part 1 we covered different types of inverted indexes: Boolean, non-positional indexes.

In Part 2 we covered some techniques for fast indexing.

In Part 3  we will be covering more on positional inverted indexes, examples, and techniques for fast intersecting posting lists.

Understanding Accuracy and Precision

14 Thursday Oct 2010

Posted by egarcia in IR Tutorials, SEO Myths

≈ 1 Comment

Students often have hard time understanding the difference between accuracy and precision, particularly when they read quack “science” “studies” when surfing  the Web. This post might help them to grasp these concepts.

What is Accuracy?

Accuracy is a term describing deviation of an experimental value from a target value. A target value is a value accepted as ‘true’. Constants, fundamental quantities, and theoretical values are considered ‘true values’. Thus, accuracy is proximity to a true value.

To illustrate, assume that a quantity x is measured. Its true value is xt =1.00 and we report an experimental value xe of 0.90. The absolute error of this observation is | xe – xt | = 0.10 and its relative error is (| xe – xt |/ xt)*100 = 10%. The accuracy is the ratio between the experimental to true value. When expressed as a percent, it is called relative accuracy. In this case, xe/ xt = 0.90/1.00. This corresponds to a 90% accuracy.

What is Precision?

Precision has been loosely defined as how reproducible experimental results are. However, modern convention makes a careful distinction between reproducibility (between-run precision) and repeatability (within-run precision). Furthermore according to Freiser (1992),

  • Repeatability is the closeness of agreement between individual experimental results obtained with the same method on identical test material or samples, under the same conditions (same operator, same apparatus, same laboratories, and same intervals of time).
  • Reproducibility is the closeness of agreement between individual experimental results obtained with the same method on identical test material or samples, but under different conditions (different operator, different apparatus, different laboratories, and different intervals of time).

Note that the source of dispersion and errors in the experimental results is different in each case. Therefore arbitrarily expressing the precision of results in terms of standard deviations without considering how the data was collected (within- or between-run precision) should be avoided.

Similarly, comparing any two standard deviations, or standard errors for that matter, without regard for how the data was collected (experimental conditions, number of degrees of freedom, different sampling times, etc) should also be avoided. In particular, estimates of precision or comparisons of precisions from data set that constantly change within sampling times is a futile exercise.

Last but not least, the precision of a measurement depends on the measuring scale used. For instance, saying “He is about 55 years old.” is less precise than saying “He is 660 months old.” or than saying “He is 20,075 days old”.

References

Freiser, H. (1992). Concept Calculations in Analytical Chemistry. Chapter 12, p. 203. CRC Press, Boca Raton.

Miller, J. C. & Miller, J. N. (1984). Statistics for Analytical Chemistry. Chapter 1, p.19. Wiley, New York.

PS. I misplaced repeatability and reproducibility and fixed few more typos. Well and done. Thanks Dr. J. C. for pointing that out.

Introduction to Nemeth Uniform Braille System (NUBS)

07 Thursday Oct 2010

Posted by egarcia in Data Mining, Programming

≈ Leave a Comment

I’m currently playing and trying to develop an encryption method using the Braille System and some kabalistic elements as mapping components. There is something about the beauty of this system that has attracted me for a long time.

The implications of 6 and 8 dot-matrix notation systems to IR are many. After reading on the Nemeth Uniform Braille System, you might grasp the point.

Check also the unicode entities for braille at http://unicode.org/book/ch12.pdf

♣  

October 2010
M T W T F S S
« Sep   Nov »
 123
45678910
11121314151617
18192021222324
25262728293031

♣ Favorite Sites

  • Mi Islita

♣ Pages

  • About IR Thoughts

♣ Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

♣ Recent Posts

  • Puerto Rico’s Science and Technology Trust Fund: Innovation Island Blast II
  • The L’Hôpital Rule: Deriving the Geometric Mean
  • Understanding the L’Hôpital Rule
  • How to Create Windows Metro Style Apps with JavaScript
  • Electronic Drugs and Hackers
  • Why a Social and Search Presence is Important for You
  • NY SES – 2012: My little briefing
  • Hello, World. I’m SWM.
  • SES NY – See You All There!
  • Which separators to use with title tags?
  • A Study of Puerto Rico Newspaper Home Pages
  • Hey, SEOs: On Information Gain, Keyword Wallop, and Relevance
  • Social Media and Puerto Rico Local Brands
  • When and Why not to take arithmetic averages
  • l’Hopital’s Rule and the 0^0 Power Controversy

♣ Archives

  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

♣ Category Cloud

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Image Compression Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.