• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: August 2008

How Not to Use Correlation Coefficients

29 Friday Aug 2008

Posted by egarcia in Data Mining, SEO Myths

≈ 3 Comments

This is a continuation of yesterday post.

In this post http://irthoughts.wordpress.com/2007/11/20/a-pagerank-rank-correlation/ we discuss a study that claims a PageRank-Rank correlation. The study was criticized by many SEOs on the grounds that it was flawed.

They were right, but none of them provided a statistical analysis assessment, which prompted back then our post.

The study seems to perpetuate SEO myths regarding ranking results.

If you look at the study, the reported slopes were very small, indicating that variables were almost orthogonal (independent), despite the correlation coefficients.

Second, no t-test confidence analysis on the correlation coefficents or slopes were provided.

Third, no transparency in the data sampling process was provided.

This is an example of how not to do stat analysis.

Spearman and Pearson Correlation Coefficients

28 Thursday Aug 2008

Posted by egarcia in Data Mining

≈ 4 Comments

I’ve been asked to explain the difference between Spearman (S) and Pearson (P) Correlation Coefficients. Good question as these are frequently used in data mining studies.

 I hope this helps.

S is equivalent to P, computed on variables, after these have been transformed into rank-orders. In such a case, we can determine S from the coefficient of determination (D) of a linear regression equation. For instance, if D = 0.49, S=0.7. BTW, D = 0.49 means that 49% of the variations can be explained by the regression model, but 51% cannot be explained by the model. Thus, to compute S with EXCEL, simply rank-order the variables, apply linear regression on a scatter plot, and square root the coefficient of determination. You can also inspect the slope.

Any changes in the original variables that do not affect the rank-order, should not change S, but P. For a givent set of variables, if S > P, we might conclude that the variables are consistently correlated, but not in a linear fashion. However, if S and P are very similar and different from zero, there is indication of a linear relationship.

Pros/Cons of S

It is less sensitive to bias due to outliers, does not require data to be metrically scaled or of normality assumptions, but of assumptions about symmetry of a gaussian-like distribution. It is applied to ordinal variables. Ties must be factored in to computations and calculations are tedious.

Pros/Cons of P

It is easy to compute. Assumes normality in both variables. It is sensitive to outliers.

Important Notes on Correlation

A correlation coefficient varies from +1 to -1. If it is zero the variables are not related. If it is positive, these are positively correlated: one increases when the other increases. If it is negative, these are negatively correlated: one increases when the other decreases and viceversa.

Correlation is not causality. It is just a measure of association between variables that addresses whether these covary. It is not necessary to prejudge these as dependent or independent before estimating correlation.

To determine whether these covary in a significant fashion, we can apply a t-test to the correlation coefficient at a given n – 2 degrees of freedom and confidence level, usually at 95%.

PS.

In a more recent post, (http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/) I explained the connection between Pearson and Spearman coeffficients with cosine similarities and dot products and a particular case wherein all these are equivalent.

References

https://www.msu.edu/~nurse/classes/summer2002/813/week8spearman.htm

http://www.chipst2c.org/lectures/Stat_lecture_correlation.pdf

http://www.statpac.com/statistics-calculator/correlation-regression.htm

TREC 2009 Track Proposals

22 Friday Aug 2008

Posted by egarcia in Conferences

≈ Leave a Comment

I received yesterday an email from Dr. Ellen Voorhees, Chair, TREC Program Committee, informing us of the new track proposal for TREC 2009. For those not familiar with TREC, as part of NIST, is where proponents gather to explore new IR and search technologies, retrieval models, frameworks, etc.

Sorry, but SEO non sense or hearsays are not allowed at TREC.

For additional information, visit http://trec.nist.gov/overview.html.

To disseminate the good news to our IR audience, I am reproducing the call below.

Dear TREC community,

TREC uses a track proposal mechanism to select the set of tracks to be run in a given TREC. We are currently soliciting proposals for tracks to include in TREC 2009. All candidate tracks (both existing and newly proposed) must submit a proposal by September 15, 2008. The submission deadline is in mid-September so that the TREC program committee will have time to make track selections before the TREC 2008 meeting in November. This allows the track discussions held at the TREC meeting to be informed as to the status of the track for the following year.

The criteria for judging a track proposal are as before: a strong advocate who is willing to be the track coordinator (track coordinator is a volunteer position); a large enough core of interested researchers to make the track viable; the availability of sufficient resources such as appropriate corpora and assessors with expertise in the area; and the fit with other tracks.

Proposals need to contain enough information for the PC to assess the criteria above. Proposals should contain an explicit statement of the goals of the track (i.e., what is expected to be learned and/or what infrastructure would be created if the track were run). If relevance judging (or some similar sort of annotation) is required, the proposal needs to include where the judging would occur (NIST or elsewhere?), any special qualifications the assessors would need (special domain expertise required?), as well as an estimate of the amount of time such assessing would will require. Any special constraints on the document sets needed should also be noted. Finally, proposals must contain full contact details of the proposer.

Proposals should be sent as either a postscript, PDF, or ASCII document to trec@nist.gov.

Ellen Voorhees
Chair, TREC program committee

On Online Hackers, Marketers, and Criminals

19 Tuesday Aug 2008

Posted by egarcia in Hacking, Homeland Security, Spam

≈ Leave a Comment

Hackers that market themselves are fully getting into the crime scene.

We have seen marketers getting into hacking and vice versa: hackers getting into marketing. Designing web pages that rank high in the search engines for the sole purpose of using these to spread malicious resources and tools is one example. We call them hacketers = hackers + marketers.

Now hackers are getting physical.

Back in March, 2008 it was reported how hackers were causing harm to folks suffering from epilepsy. Some usability and accessibility marketers are using those incidents to better promote their own services a la your-problem-is-my-opportunity.

Other marketers are creating reputation management problems and then ‘go back through the kitchen’ to market “reputation management” solutions. A scam not any different from the click fraud scam promoted by marketers part of a mob organization. Hah, Hah.

Now, we have the news of a hacker allegedly kidnaping and torturing another alleged hacker.

These probably are the first cases of hackers physically hurting others.

What is next? Google worse than ISP Snooping? –as AT&T claims.

Some times controlling information is worse than physically controlling others.

Ah, the many faces of opportunism.

IR and SEO Misnomers

14 Thursday Aug 2008

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ Leave a Comment

Dictionary defense definition lovers, here is one for you: misnomer.

According to Dictionary.com, a misnomer is:

1. a misapplied or inappropriate name or designation.
2. an error in naming a person or thing.

i.e., an inappropriate designation.

Once a while, misnomers are found in IR. Here are at least two:

1. Binary Independence Model
2. Latent Semantic Indexing (LSI)

Here is a quick overview

1. Binary Independence IR Model

Back in 1991 Cooper wrote “Some Inconsistencies and Misnomers in Probabilitistic Information Retrieval”

His article’s abstract states:

“The probabilistic theory of information retrieval involves
the construction of mathematical models based
on statistical assumptions of various sorts. One of the
hazards inherent in this kind of theory construction
is that the assumptions laid down may be inconsistent
with the data to which they are applied. Another
hazard is that the stated assumptions may not
be the real assumptions on which the derived modelling
equations or resulting experiments are actually
based. Both kinds of error have been made repeatedly
in research on probabilistic information retrieval.
One consequence of these lapses is that the statistical
character of certain probabilistic IR models, including
the so-called ‘binary independence’ model, has been
seriously misapprehended.”

In that article Cooper wrote:

“Since one can derive all possible assertions from an inconsistent
theory, such a theory must be meaningless –
entirely lacking in significance or predictive power. It
makes no sense that good experimental results could
come out of an inconsistent theory.”

“It is tempting to explain this conundrum by suggesting
that the inconsistencies in question were only minor
ones. However, there is no such thing as a theory
that is just ‘a little bit’ inconsistent. A theory cannot
be just a little bit inconsistent, any more than the
scientist proposing it can be just a little bit pregnant.
A logical inconsistency, if its implications are followed
out, destroys a theory utterly. It is a disaster, and is
totally unacceptable. If rationality is to be preserved,
inconsistencies simply cannot be tolerated.”

2. Latent Semantic Indexing

Contrary to SEO myths, LSI is not an “indexing” model, but an SVD matrix decomposition technique. Despite of what has been written in the early literature on LSI, today we know that LSI does not assesses semantics, but gets its power from high order co-occurrence relationships. Furthermore, one can grasp latent (i.e., hidden) structures present in a system by using diverse techniques other than LSI. In the early papers on LSI, structured collections of journal articles and abstracts about specific topics -medical, science, and computers-  were used. These were rich in synonyms. Evidently, the latent semantic structures identified these by forming clusters of synonyns and related terms.

Soon after, SEOs that misread those papers developed a synonymity myth around LSI. The fact is that in LSI we can arrive to the so-called “LSI latent clusters” regardless if terms are synonyms or not. This myth is easy to debunk. Once the clusters have been obtained, arbitrarily replace one term from the term-doc matrix that is known to belong to a cluster by a new arbitrary term not present in the collection, keeping the weights intact. Run the SVD algorithm again. You should arrive at the same latent structure, but the new term might not be semantically related at all to terms in the cluster since the algorithm only understands and processes numbers, not semantics. Thus, it processes a la garbage in-garbage out. This is a great topic for another ‘SEO Myth Debunking’ IRW article.

For those SEOs, marketers, and spammers that claim to know what is LSI, like Aaron Wall and the likes, read: http://irthoughts.wordpress.com/2007/05/01/irwatch-may-issue-demystifying-lsi/

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Mike Grehan
Lies, Lies, and LSI by Mike Grehan
http://www.clickz.com/showPage.html?page=3623571

Bill Slawski
Personalization Through Tracking Triplets of Users, Queries, and Web Pages
http://www.seobythesea.com/?p=535

Rand Fishkin
InfoSearch Media & ContentLogic – Purveyors of Falsehoods
http://www.seomoz.org/blog/infosearch-media-contentlogic-purveyors-of-falsehoods

Lee Odden
5 Myths about SEO
http://www.toprankblog.com/2006/12/5-myths-about-seo/

Marios Alexandrou
The History of Latent Semantic Indexing
http://www.searchgrit.com/history-of-latent-semantic-indexing/

Carson
Web content and LSI mega-rant. Part Two…
http://contentdonebetter.com/2007/03/30/web-content-and-lsi-mega-rant-part-two/

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/  

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

Independence, Disjointness, and IR Flaws

11 Monday Aug 2008

Posted by egarcia in Machine Learning, Vector Space Models

≈ Leave a Comment

Often independent events are mistaken for exclusive (disjoint) events. These are two different animals.

Consider two events, A and B. Let p(A OR B) be their union probability and p(A AND B) their joint probability. The general addition law for probabilities states that for any two events A and B

p(A OR B) = p(A) + p(B) – p(A AND B)

If events are independent

p(A AND B) = p(A)p(B)

Thus,

p(A OR B) = p(A) + p(B) – p(A)p(B)

Whereas if these are exclusive

p(A AND B) = 0

Therefore,

p(A OR B) = p(A) + p(B)

Furthermore,

if p(A AND B) = p(A)p(B) events are independent, occurring by chance.
if p(A AND B) > p(A)p(B) events are positively correlated, occurring more often than by chance.
if p(A AND B) < p(A)p(B) events are negatively correlated, occurring less often than by chance.

Talking in “rice and beans” (Hablando en “arroz con habichelas”):

Exclusive events do not have common outcomes as the occurrence of one excludes the occurrence of the other. By contrast, independent events have common outcomes, but the occurrence of one does not influence the occurrence of the other.

Independence and disjointness are very different things.

In IR, assuming that the IDF of a combination of terms can be taken for the sum of individual term IDF values presumes that terms are independent regardless of the actual data.

Arbitrarily assuming event independence, ignoring the experimental evidence, is one of the main sources of innaccuracies/flaws in many IR models (Cooper, 1991). However, excluding independence altogether is also unreasonable (Sparck-Jones, Walker, and Robertson, 1998).

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Sparck Jones, K., Walker, S., & Robertson, S. E. (1998). A probabilistic model of information retrieval: development and status. TR 446, September. Computer Laboratory, University of Cambridge.

Search Interfaces and Visual Clues

06 Wednesday Aug 2008

Posted by egarcia in Human-Computer Interaction

≈ 1 Comment

This is a continuation of yesterday’s post on Search Interface Usability. This time we want to touch upon visual clues and search interfaces.

Such clues should be obvious; i.e., they should guide users without explicitly having to explain anything.  Often users interpret such elements as a friendly environment.

Whoever is in charge of Google’s interface is good at it.

A screenshot of Google’s Book results for lsi tutorial illustrates this:

 

[Hum.How comvenient query, you may think, since our tutorials on LSI/SVD is referenced in a book that ranks #1. And you bet you are right :) ]

Nevertheless, back to the post.

Do a search in Google and note that its search interface has great usability clues in the form of anchor text consisting of action terms (below, in bold font).

At the top of the page you have crumb menu ending with a “more” link. This instructs the user to find more options. The arrow next to “more” suggests the user that this triggers a pulldown menu.

At the far right corner there are two links “MyLibrary” and “Sign in” link, instructing users to sign in;i.e. to register.

This is followed by visual clues like:

“Search Books” in the search button

“Advanced Book Search“

“Google Book Search Help”

“Showing” (next to a form selection menu)

“View all web results for…”

Then, there is also at the far right the following action text:

“List view” and “Cover view“

Note that to improve usability we don’t need to reinvent the wheel or mess with what users perceive as a “standard” search interface.

Search Interface Usability Issues

05 Tuesday Aug 2008

Posted by egarcia in IR Tools

≈ 4 Comments

Average search engine users don’t reformulate queries as we do in IR. They often recycle their query terms, using short queries of 2 to 3 terms. Frequently, their search sessions describe ‘query chains’, and using the default search mode.

Unless they are advanced-searching, most do not use query operators or shortcut search commands. Many do not consult a lookup list, thesaurus, or query logs to expand a query as we do in IR. Most don’t keep expanding a query. After few sessions they simply move on to another Web resource that might satisfy their information requirements.

Most don’t care about searching for very rare terms or terms with a high discriminative power, prefering to search for ‘what is hot’ or for what terms supply their information needs. Period.

Most are lazy users whose mentality is: “Don’t make me think!” or “I’m too busy to deal with a cluttered interface or learn new how-tos”. Many are so lazy or busy that don’t even scroll down a page. Others have a blue-linker mentality; i.e. assuming that blue underlined text has to be a link.

Thus, when a search interface is designed it should take into account the user’s search behavior, shaped mentality, and prejudgments. Their search experience should be guided by intuition and should be obvious, not requiring of extra information in order to search and find relevant documents.

Search engines that do not provide users with a ‘lazy search experience’ often do not attract enough users, visitors, or advertisers. I hope to be wrong about this one, but the two new search engines, Cuil and SearchCloud more likely will not make the A-List, in part because of several usability issues.

Size and hype is not enough.

PS. I forget to mention that:

1. of the two above, Cuil is a bit more user friendly, but its entry page/result page design is awful.

2. “search interface” to me means anything the end-user must interact with in order to search and find. Among other things, this includes the query box, entry page, and the results page.

Claps and Slaps, the LSI Way

04 Monday Aug 2008

Posted by egarcia in Latent Semantic Indexing, SEO Myths, Spam

≈ 2 Comments

Claps

We are happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at the Indian Institute of Technology in Madras, India is using our SVD LSI tutorial as lecture material for his course: CS625, Memory Based Reasoning in AI. http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf 

Another investigator, this time from the cancer research field, congratulated us for the LSI tutorials. Jaime Fernandez Vera from Structural Biology and Biocomputing, Centro Nacional de Investigaciones Oncologicas, Madrid, Spain wrote (contact info removed):

Estimado Dr. García:

Muchas gracias por poner a disposición de la Comunidad sus magníficas guías prácticas y, en especial, la de LSI que es la que he seguido.

Un abrazo,

Jaime Fernández Vera

Biología Estructural y Biocomputación Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas

Our LSI/SVD tutorials are also listed in http://www-timc.imag.fr/Benoit.Lemaire/lsa.html huge repository of LSI research resources.

For additional IR resources quoting our tutorials, check the following link at http://www.miislita.com.

http://www.miislita.com/searchito/educational-links.html.

Slaps

Talking about LSI…

Spammers disguised as ethical SEOs and that promote LSI crap are now hidding. There is less talking on the blogosphere on “SEO LSI” and “LSI-friendly SEO Optimization” myths. As we always say, these crooks are a black eye to the ethical sector of the search marketing industry.

Their signature seems to be the promotion of crap tools and services like Keyword Density tools, Markov Chain generators (if you believe that crap), TFIDF rarity calculators, “semantic page strength” estimators, lookup lists based on “LSI operators”, etc. What will be their next effort at misleading the public? Latent Dirichlet Allocation (LDA) tools?

However, in an effort to save face, the usual suspects are still making gymnastic wording. They are desperate. It is clear that our efforts at exposing these crook marketers through IR knowledge are working.

Many are learning why they should stay away from the incorrect knowledge promoted by marketers that ocassionally use IR jargon to pretend they know what they are talking about. They often do these IR-like talking attempts to promote their image as “experts” before either naive or ignorant followers. We still cannot assess the dumbers, if the snakeoil sellers or their groupies. They even game each others.

When we expose SEO myths from their competitors they praise us as long as the debunkig works for them, but when their own myths are exposed they get angry at us. Ha, Ha.

 

Posts somehow related with this post

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/ 

http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

Sneak Preview of IRW: Graduate Research

01 Friday Aug 2008

Posted by egarcia in Graduate Courses, Machine Learning, Marketing Research, Theses

≈ Leave a Comment

The current issue of IRW, Graduate Students Research, is out. It consists of short abstracts of research conducted by graduate students.

In this issue:

Introduction
Genetic Algorithms, K-Means, and Fuzzy C-Means
Word Association Patterns
U-Site Search Engine Interface
Enhancement of a U-Site Search Engine Interface
News, Research, and Events
Terms of Use and Copyright

The next issue will go back to its how-to mode.

August 2008
M T W T F S S
« Jul   Sep »
 123
45678910
11121314151617
18192021222324
25262728293031

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.