• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: October 2008

IRW Sneak Preview: Fraudulent Web Analytics

31 Friday Oct 2008

Posted by egarcia in Marketing Research, Newsletters, Spam

≈ Leave a Comment

Fraudulent Web Analytics

This post is the monthly sneak preview of the next issue of IR Watch Newsletter, now in its new format.

In this issue, the featuring article pretends to raise awareness on some of the schemes used to defraud those that make business decisions based on Web Analytics. If you are an advertiser or investor, you must read it. Don’t be gamed by unethical marketers and spammers. 

The article exposes how some marketers/spammers engineer the fraud by gaming the wisdom of crowds. We expose how traffic fraud, click-through injections, and form injections are used within viral networks to produce bogus Web Analytics advertisers might be paying for or using to make critical decisions.

The Question of the Month column is dedicated to precision vs. recall.

In the Who is Who in IR section, the late Karen Sparck Jones is featured.

In the Top CS Departments, the CS Dept of Stanford University is featured.

We have a new column dedicated to historical notes on computers, search engines, and IR. In the current column, Hewlett-Packard origins are highlighted.

Last, but not least, more IR blogs and graduate theses are listed.

Now, some great news! Please keep reading.

We are currently in negotiations with a local university to co-launch an interesting start-up at the intersection of IR, search engines, and business research.

The way we see it, a bad Economy presents opportunities. The time is right for such a unique project.

Similarity, Pearson, and Spearman Coefficients

29 Wednesday Oct 2008

Posted by egarcia in Data Mining, SEO Myths, Vector Space Models

≈ 4 Comments

This post shows the connection between Pearson and Spearman’s Correlation Coefficients with Cosine Similarity and Dot Products. As mentioned in previous posts:

Pearson’s Coefficient is equivalent to Spearman’s Coefficient for raw data that has been transformed into ranks.

http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/

Similarity is not Distance.

http://irthoughts.wordpress.com/2008/10/15/seos-and-their-semantic-distance-myths/

The ‘similarity distance’ expression is an oxymoron that should be avoided.

http://irthoughts.wordpress.com/2008/10/20/having-fun-with-oxymorons/

Arbitrary transformations between Distance and Similarity must be avoided.

http://irthoughts.wordpress.com/2007/09/17/binary-similarity-calculator/

It is important to know how to define Distance

http://irthoughts.wordpress.com/2008/10/23/why-defining-distance-is-important/

Transforming Pearson’s Coefficients into Cosine Similarities

A Pearson’s Correlation Coefficient is equivalent to the cosine of the angle of a paired data represented as vectors, provided that the raw data is first transformed into z-scores:

z(xi) = (xi – mean_x)/s_x
z(yi) = (yi – mean_y)/s_y

where the mean is deducted and the difference normalized respect to the standard deviation. Thus, we end up with a mean-centered, deviation-normalized data. A z-score data is often called a ‘centered data’.

An Example

To convince yourself, Pearson’s Correlation Coefficient for the following data is 0.8837…


 

xi yi
2 3
4 2
6 5
20 7

If you center the data and calculate its cosine similarity, it should be the same.

If the centered data is converted into column unit vectors, the dot product of the vectors is also 0.8837… since for unit vectors cosine angle = dot product

Now try this.

Rank the raw data. Next, center it, and then convert into unit vectors. Convince yourself that in this case

Spearman’s = Pearson’s = Cosine Angle = Dot Product = 0.8000…

Since Similarity is not Distance, these are not Distances.

Why Defining Distance is Important

23 Thursday Oct 2008

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ Leave a Comment

In previous posts we mentioned the difference between distance (dissimilarity) and similarity. Both can be used to describe proximity; i.e., how alike objects are. A distance is a metric function, while similarity is a relative judgment of proximity. A generalization of distance is the Minkowski Distance. The Euclidean Distance cannot be greater than the Manhattan or City Block Distance –named in this way because taxicab drivers in Manhattan can only go from one point to another by driving around rectangular city blocks.

We could extend on this topic and provide math arguments, explaining the Minkowski Distance or what is a metric space wherein a distance ‘live’. We could also add insult to injury and explain why many SEOs don’t have a clue when mistaking distance and similarity, or of the non-sense of talking about ‘similarity distance’ and ‘semantic distance’ in a hyperdimensional space (say ‘Hi’ to SEOs that sell LSI Snake Oil), blah, blah, blah.

Instead, this time I want to provide a real case scenario (taken from Chapter 2 of my upcoming ebook, Keyword Clustering Analysis with Excel) which would help readers and students understand the importance of properly defining distance.

In 2005, the New York Times reported that it was brought to the Court of Appeals’ attention a case wherein a man named James Robbins was accused of selling drugs within 1,000 feet of a school. He was arrested in March 2002 on the corner of Eighth Avenue and 40th Street in Manhattan and charged with selling drugs to an undercover police officer. The nearest school, Holy Cross, is on 43rd Street between Eighth and Ninth Avenues (
http://www.nytimes.com/2005/11/23/nyregion/23drugs.html?_r=1&oref=slogin
).

Defendant lawyers argued the distance should be measured as a pedestrian would walk city blocks; i.e using the Manhattan or City Block Distance. They claimed the school was more than 1,000 feet away by walking from the site of the arrest.

Law enforcement officials calculated the Euclidean Distance, measuring the distance up Eighth Avenue (764 feet) as one side of a right triangle, and the distance to the church along 43rd Street (490 feet) as another, to find that the length was about 908 feet.

The Court of Appeals upheld his conviction and determined that the Legislature’s intention effectively extended the boundaries of school grounds outward in order to encompass all public areas within a 1000-foot radius of the school (
http://www.courts.state.ny.us/ctapps/decisions/nov05/162opn05.pdf
). Read reactions to the ruling at
http://volokh.com/posts/1132938765.shtml
. It is a hilarious discussion.

Having Fun with Oxymorons

20 Monday Oct 2008

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ 1 Comment

Over the weekend some asked me to expand on oxymorons, so this post goes.

As mentioned in our previous post, an oxymoron is a combination of contradictory terms. All antonyms are contradictory terms, but not all contradictory terms are antonyms. For instance ‘alone together’ and ‘big baby’ are oxymorons, but only the former consists of antonyms. Thus, it is possible to extract these two types of clusters from a list of oxymorons. We can also extract clusters of oxymorons with a common term or theme.

We must not mistake oxymorons for misnomers. A misnomer is an incorrect designation. Not all oxymorons are misnomers. For example, ‘binary independence’ is a misnomer, but not an oxymoron. All this is explained in my upcoming ebook Keyword Clustering Analysis with Excel.

As mentioned in my previous post, ‘similarity distance’ is an oxymoron since distance is dissimilarity. The expression is also a misnomer.

Unfortunately, some IR authors and many SEOs have used the ‘similarity distance’ expression. The problem here seems to be a combination of poor selection of words and lack of knowledge about basic IR cluster analysis concepts, particularly of LSI.

In Cluster Analysis, objects are grouped into clusters using Proximity, a criterion of how ‘close’ or alike objects are. Proximity can be defined as Distance (dissimilarity) or Similarity. Clustering by distance is a minimization problem whereas by similarity is a maximization problem.

There are more definitions for similarity than for distance. Which type of proximity and definition to use depends of the type of attributes and scale of attributes of the data.

In a very high dimensional space the notion of distance or similarity is useless, if not meaningless. Thus, in a high dimensional space, talking about a ‘semantic distance’ is a waste. We can try to do dimensionality reduction with LSI, but we end up computing cosine similarity, not distance.

The current state of the art is that LSI is also a misnomer as (a) its clustering power is the result of high-order word co-occurrence not semantics and (b) it is not exactly a document indexing method (before applying LSI, documents must be already indexed).

With regard to similarity and distance, it is tempting to think that one is just the numerical complement of the other or that we can blindly transform one from the other. These are short sighted views. Fortunately few IRs believe this. Unfortunately, we cannot say the same about search marketers and SEOs

Want to know more about the difference between these two concepts? Study the topic or read my ebook. Dont’t be fooled by self-proclaimed SEO “experts” or any “seobook”.

SEOs and their Semantic Distance Myths

15 Wednesday Oct 2008

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ 1 Comment

As many of our readers know, one of the goals of this blog is to debunk SEO myths through IR knowledge. At times, we also try to clarify incorrect statements made in the IR field as well. It is not surprising from time to time find IR papers with clear gross errors and misnomers. That’s why we believe the name of this blog, IR Thoughts, is on target.

Today’s SEO myth to debunk is the so-called Semantic Distance Myth in connection with LSI. But, first some definitions. The following material is taken from chapter 1 of my upcoming ebook Keyword Clustering Analysis with Excel.

Dissimilarity characterizes how different objects from a cluster are.

Similarity indicates how similar the objects are.

In IR, it is customary to use the term ‘distance’ to mean dissimilarity. This is because unlike similarity, dissimilarity is a distance metric. Thus, similarity and distance (dissimilarity) are opposite terms. We can convert similarities into distances and viceversa, but not without first understanding the model at hand * (
http://www.miislita.com/searchito/binary-similarity-calculator.html
).

Beware of Oxymorons and Misnomers

The expression ‘similarity distance’ is an oxymoron or a combination of contradictory terms (like a ‘small giant’, ‘approximately equal’, etc). Unfortunately, the expression has been used in the IR literature (
http://www.google.com/search?q=%22similarity+distance%22
). Avoid it.

Some search marketers when trying to explain Latent Semantic Indexing (LSI) have used expressions like the ‘semantic distance between words’ when in fact what is being discussed is word similarity or how words relate to each other or to a topic. Used in this context, their discourses are oxymoronic if not dumb, let alone the fact that LSI does not measure any ‘semantic distance’.

Some IR authors have also used the expression ‘distance between words’ in reference to the number of words between any two words. The expression is loosely accepted to describe word spacing/distribution.

Perhaps loosely excluding the last one, all these expressions are misnomers (incorrect designations) since do not conform to the definition of a distance metric. Let us address this point.

Distance Defined

A function f is called a distance if it exhibits reflexivity, symmetry, and triangular inequality. To grasp these concepts, visualize three points (a, b, and c) describing a triangle.

Reflexivity means that the distance from a point to itself is zero; e.g., f(a, a) = f(b, b) = f(c, c) = 0.

Symmetry refers to the fact that the distance between any two points, measured from either one, is the same; e.g., f(a, b) = f(b, a).

Triangular Inequality requires that the distance between any two points, measured from either point, must be equal or less than the distance between these measured through a third point; e.g., f(a, b) + f(b, c) => f(a, c).

If these conditions are not met, the function measure in question is not a distance.

Finally, note that distances cannot be negative and are not upperly bounded, unless their scales have been normalized.

* You might also want to check:
http://irthoughts.wordpress.com/2008/01/02/simcalc-binary-similarity-calculator/
 

or


http://www.cs.ualberta.ca/~lindek/papers/sim.pdf

PS.

Note to spammers and SEOs:  Embedding oxymorons and misnomers in documents, particularly in links, could be used as a search engine persuasion trick.

Some might argue whether the expression ‘loosely excluding’ might also qualify as a near oxymoron. For a list of oxymorons or near oxymorons, check these links:


http://www.ethanwiner.com/oxymoron.html


http://cs.bilgi.edu.tr/~hobbittr/critical/carlin_oxymorons.html


http://www.atlantamortgagegroup.com/oxymoronlist.htm

 

Best Algorithm Combinations for Speech Processing

15 Wednesday Oct 2008

Posted by egarcia in Machine Learning

≈ Leave a Comment

I am happy to read that my Cosine Similarity tutorial is being referenced by Serguei Mokhov in his paper

“Study of best algorithm combinations for speech processing tasks in machine learning using median vs. mean clusters in MARF”

http://portal.acm.org/citation.cfm?id=1370262

The paper was presented in the ACM International Conference Proceeding Series; Vol. 290 Proceedings of the 2008 C3S2E conference and is a great read.

Jaxer, a new playground for kids and script kiddies

14 Tuesday Oct 2008

Posted by egarcia in Programming

≈ Leave a Comment

We recently featured Jaxer, and end-to-end AJAX server. According to Ian Selby,kids are now discovering how many great things they can do with it.


http://www.gen-x-design.com/categories/jaxer/

Jaxer is one of the best news I heard this year in the programming field. We mentioned this server platform before in
http://irthoughts.wordpress.com/2008/09/03/more-good-reasons-for-learning-javascript/

Script kiddies, spammers, and viral marketers, take note.

Getting Ready for AIRWeb2009

13 Monday Oct 2008

Posted by egarcia in Conferences, Hacking, Newsletters, Spam

≈ Leave a Comment

For the last few years I have served as PC member of AIRWeb. I just received and accepted invitation to be a PC for AIRWeb 2009.

For those of you not familiar with, the International Workshop on Adversarial Information Retrieval on the Web (AIRWeb)
http://airweb.cse.lehigh.edu/
has been held four times: in conjunction with the WWW’05, SIGIR’06, WWW’07, and WWW’08.

Topics discussed at the workshops include all forms of search engine spamming and hacking practices. SEO spamming practices are exposed and countermeasures are tested. It is a lot of fun examining in advance manuscript describing these malicous practices, months before the accepted papers hit mainstream.

Incidentally, the next issue of the IR Watch newsletter features Fraudulent Web Analytics, an article on adversarial techniques. We expose several practices spammers and hackers use to produce fake analytics and to defraud advertisers.

Zeeker: A Topic-Based Search Engine

07 Tuesday Oct 2008

Posted by egarcia in Theses

≈ Leave a Comment

I came across a 2007 graduate thesis that describes the Zeeker Search Engine. It is accessible at


http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/5502/pdf/imm5502.pdf

I am happy to learn it quotes my article The Keyword Density of Non Sense and even republishes in page 9 and 10 two of my images from that article. While it mentions this author and the article’s title, no reference was included in the body or reference sections of the thesis so readers would need to search the Web to find the article.

Fortunately, the thesis mentions the importance of document linearization, a topic we discussed in the ‘KD Non-Sense’ article and that many SEOs still don’t get it.

The thesis is worth a read and explains in easy terms the Spherical K-Means and NMF algorithms. I believe it deserves to be nominated to the next issue of IRW. Read the thesis and let me know what you think.

A temporary holding page is available at


http://fredmus.com/

The music version of Zeeker is available at


http://muzeeker.com/Index.aspx

Outstanding Graduate Theses

06 Monday Oct 2008

Posted by egarcia in Theses

≈ 1 Comment

The following have been nominated in IRW as Outstanding Graduate Theses

Identification of Saudi Arabian License Plates

http://library.kfupm.edu.sa/lib-downloads/A1Y685.pdf

A Language-Based Approach to Categorical Analysis

http://alumni.media.mit.edu/~cameron/cv/pubs/01-thesis.pdf

Just in time Information Retrieval

http://www.bradleyrhodes.com/Papers/rhodes-phd-JITIR.pdf

← Older posts
October 2008
M T W T F S S
« Sep   Nov »
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.