• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: April 2010

Beware of SEO Statistical Studies

23 Friday Apr 2010

Posted by egarcia in Data Mining, Marketing Research, SEO Myths

≈ 35 Comments

Beware of SEO statistical studies as their findings can easily evolve into urban legends and myths when taken at face value. Why? Keep reading.

From time to time SEOs publish statistical studies in which incorrect reasoning are given in relation to Statistical Analysis and PageRank.

For instance back in 2007, an SEO published a study trying to find correlations between the rank of a page in the search results and PageRank. We wrote two rebuttals about that study:

http://irthoughts.wordpress.com/2007/11/20/a-pagerank-rank-correlation/
http://irthoughts.wordpress.com/2008/08/29/how-not-to-use-correlation-coefficients/

Evidently, we cannot read too much into correlation coefficients, particularly without providing a t-test analysis. In fact, x-y paired variables can be almost orthogonal (perpendicular) suggesting independence and still have high correlation coefficients, which was actually the case in the mentioned “study”.

Data can also be non-linear and still have a high correlation coefficient, as many books on statistics show.  Furthermore, a low or zero correlation coefficient value not necessarily means that the data is not correlated. This is illustrated in the following figure, taken from “Statistics for Analytical Chemistry” (J.C. Miller and J. N. Miller, 1984).

nonlinear correlation coefficients

Note that coefficients (Pearson in this case) can be high or low and the data can still be nonlinear AND well correlated. These two results just show that the data does not fit to a linear model. By simply looking at a coefficient and inferring no correlation if it is too low is misleading (bottom figure). Similarly if the coefficient is high we cannot just neglect non-linearity (top figure) and assume linearity. Please, enlarge  and read text above the figure to understand why.

In addition, to find a pattern in the data, other tests like PCA or SPCA are necessary. See tutorial at

http://www.miislita.com/information-retrieval-tutorial/pca-spca-tutorial.pdf

The selection of one coefficient over the other can be with or without any a priori statistical knowledge about the data. However, while a previous knowledge about the nature of the data can help, it can also hurt any statistical reasoning as a tester can be easily biased toward finding what he/she wants to find, overlooking other type of pattern finder analyses and reasonings or, worse, arbitrarily dropping outliers to fit results to a preconceived data model.

Now that I’m on the subject of choosing one correlation coefficient statistic over the other, in a recent post another SEO (Randfish) published “The Science of Ranking Correlations” article (http://www.seomoz.org/blog/the-science-of-ranking-correlations), in which he states:

“Why did we use Spearman rather than Pearson correlation?

Pearson’s correlation is only good at measuring linear correlation, and many of the values we are looking at are not.  If something is well exponentially correlated (like link counts generally are), we don’t want to score them unfairly lower.”

Almost immediately, another poster (whiteweb_b) contradicted that assertion and wrote:

“Rand your (or Ben’s) reasoning for using Spearman correlation instead of Pearson is wrong. The difference between two correlations is not that one describes linear and the other exponential correlation, it is that they differ in the type of variables that they use.”

The poster is right on that one, especially when no t-test analysis or additional pattern/trend finder techniques were provided. So, which coefficient to use, when, and why: Pearson or Spearman?  That’s a fair question. We covered this in the following post:

http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Something that few realize is the connection between Cosine Similarity with Pearson and Spearman Coefficients. This also was discussed in the following post:

http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/

To sum up, beware of SEO ”Science” and their statistical “studies”.

I hope this helps.

PS: I updated this post to add the figure above and few additional comments.

Keywords Matrix Extraction: Another use to the old meta keywords tag

22 Thursday Apr 2010

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

As described in the current issue of IRW newsletter, meta information (structured or unstructured) can be extracted from Web documents with dashboard technology.

To illustrate this, we have incorporated a non-intrusive dashboard in the Fractal Resource Index  sub-site of Mi Islita.com (http://www.miislita.com/fractals/index.html).  One of its channels reads the content of the meta keywords tag and generates a keyword matrix of bigrams (two-term keywords).

Clicking on a bigram opens a new browser window and submits the bigram as a query to a search engine (in this case, to Google). This allows one to estimate keyword co-occurrence statistics and which pair of terms is relevant to the current document. In addition, if each column (or row) is treated as a topic vector, this allows one to identify which topics are associated to a set of terms. Another advantage is that a second matrix can be constructed by inspecting a current relevance matrix. This is done as follows.

A cell bigram is first clicked. If the current page is in the top N ranked documents, color-code the cell in black, otherwise in white. Alternatively, code cells with the rank obtained. This simple exercise allows one to identify a cluster of terms that is semantically relevant to the current page.

Be advised that results can change in time or across search engines. Once relevant bigrams have been identified, trigrams (three-term keywords) containing one or two terms from the relevant bigrams can be constructed and tested. Indeed, this is how we construct trigrams for some of the sub-site pages.

This allows one to monitor, test, and, if neccesary change keywords. Originally, matrices were created out of anchor text and also the entire document. However, the resultant matrices were too big. Testing and maintenance was also another formidable task. Another alternative tested was selecting the top N terms according to their term weights (vectors space based) and then constructing the matrix out of these. Then carry out the analysis as before.

Still, we settled for the meta keywords tag purely for convenience as terms can anyway be selected based on their term weights and stored in a single HTML element and easily tested.

IRWatch:2010-04 – Meta information extraction through dashboard technology

16 Friday Apr 2010

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

Meta Information Extraction

The current issue of the IRW newsletter will arrive to subscribers’ inboxes over the weekend. In this issue, we examine how to extract hidden information through dashboard technology:

“Every HTML document contains hidden meta information (i.e., information about information) that is usable to businesses and average users. This information can be either structured or unstructured.

Structured data can be extracted from the Document Object Model (DOM) by processing its markup tags; for instance, by extracting its meta tags. In the case of unstructured data this type of information is accessible through statistical analysis and other forms of math analysis.”

Enjoy it.

The Readability War

05 Monday Apr 2010

Posted by egarcia in Machine Learning, Programming

≈ Leave a Comment

Yesterday we posted on Kincaid’s ARI. Although not mentioned, the main reason for bloggin about it was that we are testing some readability index (RI) tools to be integrated to Mi Islita.com.

There is a kind of endless readability “war” since the invention of the first tests. Thanks to the Internet this war is in full swing among academics. When it comes to computing such scores/indexes, should we:

include or exclude random samples from a given text? (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.5934&rep=rep1&type=pdf)
include or exclude syllables? (http://www.jstor.org/pss/27531033)
include or exclude punctuation?
include or exclude typos?
include or exclude spaces?
include or exclude non-dictionary terms?

Since punctuation can affect readability, semantics, coherence etc (http://www.aelfe.org/documents/text4-Sancho.pdf), why remove punctuation from the analysis? Unfortunately, current readability formulas only measure structural difficulty (http://www.iacis.org/iis/2008_iis/pdf/S2008_1071.pdf).

To do or not to do tokenization… that is the question.

Funny that some algorithms define “words” as a continuous sequence of nonspaces, regardless if these are punctuation characters or valid words. But then, word lists free-from punctuation and spaces are used for assessing the performance of such algorithms. That’s smell,… and it is not lavanda.

There is also the problem of scoring non-narrative text as found in most Web documents with formulas intended to be used for scoring narrative text. How futile is that?

And how futile is scoring Web documents with ever changing dynamic content?

Do readability formulas work? (http://blogs.wsj.com/numbersguy/do-readability-formulas-work-297/tab/article/)

And if so: which tool is better? It all depends on who is counting, how, and why.

Kincaid’s ARI: The Original Paper

04 Sunday Apr 2010

Posted by egarcia in Programming

≈ Leave a Comment

Here is the original article describing Kincaid’s Automated Readability Index (ARI) . Apparently, there are several tools online that do not calculate ARI grade levels the right way.

The Instructions for Automated Readability Index section explains how this is done.  

See also Appendix E of this PDF.

Historical Typo: Note that in the Document Resume section of the paper, the author’s name was mispelled as Kinkaid. This has caused a spurious record across some journals and reference works as no one known as Kinkaid wrote the article. This reminds me of The most influential paper that Gerard Salton never wrote.

Edward Roberts, Father of Altair Dies

03 Saturday Apr 2010

Posted by egarcia in Programming

≈ Leave a Comment

Ed Roberts, creator of the first inexpensive microcomputer, the MITS Altair, dies at 68.

http://www.nytimes.com/2010/04/03/business/03roberts.html

Before the modern Internet and PC revolution,… Dr. Roberts already figured things out.

Saddly to say, but unfortunately, life is not about who got there first, but who cashed in first.

Ask Edison, Gates, or Google.

A great ppt presentation on Latent Semantic Indexing

01 Thursday Apr 2010

Posted by egarcia in Latent Semantic Indexing

≈ Leave a Comment

Here is a nice ppt presentation from Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann lectures on Latent Semantic Indexing. Happy to see in the notes of slide 25 that my SVD and LSI Tutorial was referenced.

For those that were gamed by some unethical SEOs using LSI verbose crap, check the following oldie, but still relevant, post:

http://irthoughts.wordpress.com/2007/05/01/irwatch-may-issue-demystifying-lsi/

April 2010
M T W T F S S
« Mar   May »
 1234
567891011
12131415161718
19202122232425
2627282930  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.