• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: August 2010

Redirection Harvesting

31 Tuesday Aug 2010

Posted by egarcia in Newsletters, Spam

≈ Leave a Comment

Here in the Caribbean we are “surviving” hurricane Earl, so the IRW issue has been delayed. This and the upcoming two  issues will be a give away about different type of inverted index architectures and fast indexing techniques.  This is what I plan to cover:

Part One:    Inverted Index Types

Part Two:    Fast Indexing Techniques

Part Three:  Fast Posting Lists Intersecting & Sharding

The QA section will cover redirection harvesting. I’m going to do now something not done before: releasing the QA section in advanced, so those not subscribed to IRW will realize what they are missing.

Q: What is Redirection Harvesting?

A: Redirection harvesting is a phishing technique wherein a hacker or spammer identifies a trusted site that redirects users to specific pages by appending name-value commands to the redirection mechanism.

The mechanism is often a form or a URL without a security layer for filtering appended URLs. The idea is to replace the landing URL with the hacker or spammer’s URL which is often obfuscated.

Although it no longer works, the best known example of this was due to Ebay. For details, check http://www.google.com/search?q=ebay+redirections

The URL mechanism abused was

http://cgi4.ebay.com/ws/eBayISAPI.dll?MfcISAPICommand=RedirectToDomain&

DomainUrl=obfuscatedURLgoeshere

Note that a trusted site (Ebay) is the one doing the redirection.

Naïve or unaware users receiving an email with such doctored URLs might think these belong to Ebay and that it will redirect to a page within EBay when in fact it takes users to a malicious page. Once there, users are exposed to all kind of attacks.

Many large and popular sites, including educational and government sites, are still guilty of allowing this to happen. The lesson here is that redirection mechanisms without URL filtering layers can and will be abused.

Matrix Algebra for Search Marketing

23 Monday Aug 2010

Posted by egarcia in IR Quizzes, Latent Semantic Indexing, Search Engines Architecture Course

≈ Leave a Comment

Today I feel like giving away a little quiz material on applied linear algebra. The topic is relevant these days wherein some misleading SEOs are playing the we-do-”science” game (quack “science”, after all).

The following is taken from the Search Engines Architecture grad course I lectured back in 2008. I’m providing only one exercise with multiple parts. The quiz with answers might be a great topic for an IRW issue.

1.1 A search engine has three types of revenue channels: pay-per-click (PPC), pay-per-placement (PPP), and pay-for-conversion (PFC). In quarter 1, the million-dollar revenues respectively were: 20, 4, and 9. In quarter 2, PPC revenues were 20% less, PPP revenues doubled, and PFC revenues remained constant.

1.1.1 Write a matrix M1 expressing the revenue and quarter vectors for the first two quarters.

1.1.2 If the goal in quarter 3 is to increase by 20% all revenues earned in quarter 2, update M1 so it reflects such a goal as a new matrix M2.

1.1.3 If the goal in quarter 4 is to meet the average revenues of each of the previous channels in quarter 4, update M2 such that it reflects that goal as a new matrix M3.

1.1.4 Express the above quarters as column unit vectors. Inspecting either rows or columns, construct a nearest neighbor similarity matrix Mnn and construct scalar clusters of quarters. Ignore cosine similarity deviations of 0.02 units or less. How similar the quarters are?

Have fun.

Understanding Fisher’s Z Transformations

20 Friday Aug 2010

Posted by egarcia in SEO Myths, IR Tutorials, Quack Science

≈ Leave a Comment

As mentioned in my Tutorial on Correlation Coefficients, the best known technique for transforming correlation coefficient (r) values into weighted additive quantities is the r-to-Z transformation due to Fisher.

Fisher’s r-to-Z transformation is an elementary transcendental function called the inverse hyperbolic tangent function. The reverse, a Z-to-r transformation, is therefore a hyperbolic tangent function.

In Windows computers, these functions are built-in in their scientific calculator program which is accessible by navigating to Start > All Programs > Accessories > Calculator. Microsoft Excel also has these built-in as the ATANH and TANH functions.

These transformations are needed to compute a weighted mean correlation coefficient and for hypothesis testing. Note that averaged correlation coefficients are not computable directly from raw r values.

Indeed, it is not possible to add, subtract, average or take standard deviations out of raw r values.

Unfortunately some researchers with a limited knowledge on Statistics have published papers containing such gross errors. What is worse, reviewers of those papers are either not statisticians or have been lazy enough to overlook at the concept, leading graduate students and post docs into error.

Search marketers are also buying into the error. An example of this are the SEOs from SeoMOZ promoting quack “science” and sloppy “statistical studies”. If you are an SEO and still want to believe their snakeoil marketing, that’s up to you.

On trusting search engine stemming and matching results

12 Thursday Aug 2010

Posted by egarcia in Data Mining, Queries

≈ Leave a Comment

If you believe or care about it, then following resources might interest you.

On stemming and counting search results

http://jis.sagepub.com/content/early/2009/05/28/1363459309336801

Our results indicate that Google uses a document-based algorithm for stemming. It evaluates each document separately and makes a decision to index or not for the conflated forms of the words it has. It indexes documents only for word forms that are semantically strongly correlated. While it indexes documents for singulars and plurals frequently, it rarely indexes documents for word forms with the postfixes of -able or -tively.

http://jis.sagepub.com/content/35/4/469.short

This study investigates the accuracy of search engine hit counts for search queries. We investigate the accuracy of hit counts for Google, Yahoo and Microsoft Live Search, and the accuracy of single and multiple term queries. In addition, we investigate the consistency of hit count estimates for 15 days. The results show that all three provide estimates for the number of matching documents and the estimation patterns of their counting algorithms differ greatly. The accuracy of hit counts for multiple word queries has not been studied before. The results of our study show that the number of words in queries affects the accuracy of estimations significantly. The percentages of accurate hit count estimations are reduced almost by half when going from single word to two word query tests in all three search engines. With the increase in the number of query words, the error in estimation increases and the number of accurate estimations decreases.

http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf

 

This article reviews the rewards and limitations of either acquiring Web content and processing it nto a static corpus or else accessing it directly as a dynamic corpus, a distinction captured in or / as corpus the process it surveys typical applications of such data to both academic analysis and real-world.

 

http://gplsi.dlsi.ua.es/congresos/qwe10/fitxers/QWE10_Funahashi.pdf

Abstract. In this paper, we investigate the trustworthiness of search engines’hit counts, numbers returned as search result counts. Since many studies adoptsearch engines’ hit counts to estimate the popularity of input queries, thereliability of hit counts is indispensable for archiving trustworthy studies.However, hit counts are unreliable because they change, when a user clicks the“Search” button more than once or clicks the “Next” button on the searchresults page, or when a user queries the same term on separate days. In thispaper, we analyze the characteristics of hit count transition by gathering varioustypes of hit counts over two months by using 10,000 queries. The results of ourstudy show that the hit counts with the largest search offset just before searchengines adjust their hit counts are the most reliable. Moreover, hit counts arethe most reliable when they are consistent over approximately a week.

For those that still believe in counting search results, despite the evidence from the above articles:

http://www2006.org/programme/item.php?id=3047

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine’s index using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding “weight”, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine’s index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.
 

Great Resources on NonNegative Matrix Factorization (NMF)

06 Friday Aug 2010

Posted by egarcia in Data Mining

≈ 1 Comment

Here are some great resources on NMF or NonNegative Matrix Factorization. I might put out my own tutorial on the subject, if I can find the time to do so.

Tutorial:

 http://spinner.cofc.edu/~langvillea/NISS-NMF.pdf?referrer=webcluster&

Research Papers:

http://research.microsoft.com/pubs/119077/DNMF.pdf

http://www.inma.ucl.ac.be/publi/441048.pdf

http://www.bsp.brain.riken.jp/publications/2008/MLSP-08-Cichocki-Phan-fin_corrected_CesarV1.pdf

http://lgm.fri.uni-lj.si/matic/clanki/smc2009.pdf

Will NMF be the next stop of the usual snakeoil marketers, SEO quack “science” sellers, and purveyors of falsehood? (Who said that last expression before? One crook that is now becoming one…ha, ha.

PS: I’ve added the following resources:

http://hebb.mit.edu/people/seung/papers/nmfconverge.pdf

http://spinner.cofc.edu/~langvillea/SIAMSEAS-NMF.pdf

http://www.csie.ntu.edu.tw/~cjlin/nmf/index.html

http://www.stanford.edu/~vcs/papers/NMFCDP.pdf

August 2010
M T W T F S S
« Jul   Sep »
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.