• About IR Thoughts
  • Minerazzi Tools
  • Minerazzi Tutorials

IR Thoughts

~ Thoughts on Information Retrieval, Search Engines. Data Mining, and Science & Engineering

IR Thoughts

Monthly Archives: August 2010

Redirection Harvesting

31 Tuesday Aug 2010

Posted by egarcia in Newsletters, Spam

≈ Leave a comment

Here in the Caribbean we are “surviving” hurricane Earl, so the IRW issue has been delayed. This and the upcoming two  issues will be a give away about different type of inverted index architectures and fast indexing techniques.  This is what I plan to cover:

Part One:    Inverted Index Types

Part Two:    Fast Indexing Techniques

Part Three:  Fast Posting Lists Intersecting & Sharding

The QA section will cover redirection harvesting. I’m going to do now something not done before: releasing the QA section in advanced, so those not subscribed to IRW will realize what they are missing.

Q: What is Redirection Harvesting?

A: Redirection harvesting is a phishing technique wherein a hacker or spammer identifies a trusted site that redirects users to specific pages by appending name-value commands to the redirection mechanism.

The mechanism is often a form or a URL without a security layer for filtering appended URLs. The idea is to replace the landing URL with the hacker or spammer’s URL which is often obfuscated.

Although it no longer works, the best known example of this was due to Ebay. For details, check http://www.google.com/search?q=ebay+redirections

The URL mechanism abused was

http://cgi4.ebay.com/ws/eBayISAPI.dll?MfcISAPICommand=RedirectToDomain&
DomainUrl=obfuscatedURLgoeshere

Note that a trusted site (Ebay) is the one doing the redirection.

Naïve or unaware users receiving an email with such doctored URLs might think these belong to Ebay and that it will redirect to a page within EBay when in fact it takes users to a malicious page. Once there, users are exposed to all kind of attacks.

Many large and popular sites, including educational and government sites, are still guilty of allowing this to happen. The lesson here is that redirection mechanisms without URL filtering layers can and will be abused.

Matrix Algebra for Search Marketing

23 Monday Aug 2010

Posted by egarcia in IR Quizzes, Latent Semantic Indexing, Search Engines Architecture Course

≈ Leave a comment

Today I feel like giving away a little quiz material on applied linear algebra. The topic is relevant these days wherein some misleading SEOs are playing the we-do-“science” game (quack “science”, after all).

The following is taken from the Search Engines Architecture grad course I lectured back in 2008. I’m providing only one exercise with multiple parts. The quiz with answers might be a great topic for an IRW issue.

1.1 A search engine has three types of revenue channels: pay-per-click (PPC), pay-per-placement (PPP), and pay-for-conversion (PFC). In quarter 1, the million-dollar revenues respectively were: 20, 4, and 9. In quarter 2, PPC revenues were 20% less, PPP revenues doubled, and PFC revenues remained constant.

1.1.1 Write a matrix M1 expressing the revenue and quarter vectors for the first two quarters.

1.1.2 If the goal in quarter 3 is to increase by 20% all revenues earned in quarter 2, update M1 so it reflects such a goal as a new matrix M2.

1.1.3 If the goal in quarter 4 is to meet the average revenues of each of the previous channels in quarter 4, update M2 such that it reflects that goal as a new matrix M3.

1.1.4 Express the above quarters as column unit vectors. Inspecting either rows or columns, construct a nearest neighbor similarity matrix Mnn and construct scalar clusters of quarters. Ignore cosine similarity deviations of 0.02 units or less. How similar the quarters are?

Have fun.

Understanding Fisher’s Z Transformations

20 Friday Aug 2010

Posted by egarcia in IR Tutorials, Quack Science, SEO Myths

≈ Leave a comment

12-20-2018 Update:

Not all researchers know that score-to-rank transformations can change the sampling distribution of a statistic (e.g. a correlation coefficient) and that Fisher transformations are sensitive to normality violations. Combining both types of transformations (scores to ranks and then applying Fisher Transformations) is a recipe for a statistical disaster.

See https://irthoughts.wordpress.com/2018/11/01/on-the-non-additivity-of-correlation-coefficients-part-3-the-bias-nature-of-correlation-coefficients/

 

9-3-2016 Update:

One of the best known technique for transforming correlation coefficient (r) values into weighted additive quantities is the r-to-Z transformation due to Fisher.

Fisher’s r-to-Z transformation is an elementary transcendental function called the inverse hyperbolic tangent function. The reverse, a Z-to-r transformation, is therefore a hyperbolic tangent function.

In Windows computers, these functions are built-in in their scientific calculator program which is accessible by navigating to Start > All Programs > Accessories > Calculator. Microsoft Excel also has these built-in as the ATANH and TANH functions.

Fisher’s r-to-Z transformation is applicable only to bivariate normal distributions; i.e. if the (x, y) paired variables both describe bell-shaped curves. Non trivial errors arise if one of the variables is not normally distributed.

5-22-2016 Update: We have developed a tool for easily computing these transformations and explaining the bivariate normality restriction.

The tool is available at

http://www.minerazzi.com/tools/fisher/transformations.php

Beware of Sloppy Calculations

Correlation coefficient arithmetic averages are not computable directly from individual values.

Indeed, it is not possible to add, subtract, average or take standard deviations out of raw r values.

Unfortunately some researchers with a limited knowledge on Statistics have published papers containing such gross errors. What is worse, reviewers of those papers are either not statisticians or have been lazy enough to overlook at the concept, leading graduate students and post docs into error.

Search marketers are also buying into the error. An example of this are the SEOs from SeoMOZ (now MOZ) promoting quack “science” and sloppy “statistical studies”. If you are an SEO and still want to believe their snakeoil marketing, that’s up to you.

On trusting search engine stemming and matching results

12 Thursday Aug 2010

Posted by egarcia in Data Mining, Queries

≈ Leave a comment

If you believe or care about it, then following resources might interest you.

On stemming and counting search results

http://jis.sagepub.com/content/early/2009/05/28/1363459309336801

Our results indicate that Google uses a document-based algorithm for stemming. It evaluates each document separately and makes a decision to index or not for the conflated forms of the words it has. It indexes documents only for word forms that are semantically strongly correlated. While it indexes documents for singulars and plurals frequently, it rarely indexes documents for word forms with the postfixes of -able or -tively.

http://jis.sagepub.com/content/35/4/469.short

This study investigates the accuracy of search engine hit counts for search queries. We investigate the accuracy of hit counts for Google, Yahoo and Microsoft Live Search, and the accuracy of single and multiple term queries. In addition, we investigate the consistency of hit count estimates for 15 days. The results show that all three provide estimates for the number of matching documents and the estimation patterns of their counting algorithms differ greatly. The accuracy of hit counts for multiple word queries has not been studied before. The results of our study show that the number of words in queries affects the accuracy of estimations significantly. The percentages of accurate hit count estimations are reduced almost by half when going from single word to two word query tests in all three search engines. With the increase in the number of query words, the error in estimation increases and the number of accurate estimations decreases.

http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf

 

This article reviews the rewards and limitations of either acquiring Web content and processing it nto a static corpus or else accessing it directly as a dynamic corpus, a distinction captured in or / as corpus the process it surveys typical applications of such data to both academic analysis and real-world.

 

http://gplsi.dlsi.ua.es/congresos/qwe10/fitxers/QWE10_Funahashi.pdf

Abstract. In this paper, we investigate the trustworthiness of search engines’hit counts, numbers returned as search result counts. Since many studies adoptsearch engines’ hit counts to estimate the popularity of input queries, thereliability of hit counts is indispensable for archiving trustworthy studies.However, hit counts are unreliable because they change, when a user clicks the“Search” button more than once or clicks the “Next” button on the searchresults page, or when a user queries the same term on separate days. In thispaper, we analyze the characteristics of hit count transition by gathering varioustypes of hit counts over two months by using 10,000 queries. The results of ourstudy show that the hit counts with the largest search offset just before searchengines adjust their hit counts are the most reliable. Moreover, hit counts arethe most reliable when they are consistent over approximately a week.

For those that still believe in counting search results, despite the evidence from the above articles:

http://www2006.org/programme/item.php?id=3047

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine’s index using only the search engine’s public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding “weight”, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine’s index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.
 

Great Resources on NonNegative Matrix Factorization (NMF)

06 Friday Aug 2010

Posted by egarcia in Data Mining

≈ 1 Comment

Here are some great resources on NMF or NonNegative Matrix Factorization. I might put out my own tutorial on the subject, if I can find the time to do so.

Tutorial:

 http://spinner.cofc.edu/~langvillea/NISS-NMF.pdf?referrer=webcluster&

Research Papers:

http://research.microsoft.com/pubs/119077/DNMF.pdf

http://www.inma.ucl.ac.be/publi/441048.pdf

http://www.bsp.brain.riken.jp/publications/2008/MLSP-08-Cichocki-Phan-fin_corrected_CesarV1.pdf

http://lgm.fri.uni-lj.si/matic/clanki/smc2009.pdf

Will NMF be the next stop of the usual snakeoil marketers, SEO quack “science” sellers, and purveyors of falsehood? (Who said that last expression before? One crook that is now becoming one…ha, ha.

PS: I’ve added the following resources:

http://hebb.mit.edu/people/seung/papers/nmfconverge.pdf

http://spinner.cofc.edu/~langvillea/SIAMSEAS-NMF.pdf

http://www.csie.ntu.edu.tw/~cjlin/nmf/index.html

http://www.stanford.edu/~vcs/papers/NMFCDP.pdf

August 2010
M T W T F S S
« Jul   Sep »
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Favorite Sites

  • Minerazzi.com

Pages

  • About IR Thoughts
  • Minerazzi Tools
  • Minerazzi Tutorials

Categories

  • 4D Printing
  • AIRWeb Course
  • Algorithms
  • Amazon Alexa
  • Android
  • Arithmetic Geometry
  • Best Match Models (BM)
  • Big Data
  • BioDesign
  • bioinformatics
  • Blogroll
  • calculators
  • Cancer
  • Cancer Research
  • Causality & Determinism
  • Chaos
  • chemical mining
  • Chemist Biographies
  • chemistry
  • Chemometrics
  • Clinical Trials
  • Cloud Computing
  • Conferences
  • Correlation Coefficients
  • Cortana
  • Crawlers
  • Curated Collections
  • Data Conversion
  • Data Mining
  • Data Structures
  • Deep Neural Networks
  • directories
  • Directories
  • Docker
  • Dynamics
  • Favorite Sites
  • Feed Tools
  • Fisher Transformations
  • Fractal Geometry
  • Fractal Patterns
  • google
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • information retrieval
  • Internet Engineering
  • Internet Standards
  • inverted index
  • ir
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Kubernetes
  • Latent Semantic Indexing
  • Legacy Posts
  • LIGO
  • Machine Learning
  • Marketing Research
  • Mathematics
  • Medical Cannabis
  • meta-analysis
  • Mind Retrieval
  • miner
  • minerazzi
  • Miscellaneous
  • National Laboratories
  • New Information Retrieval Paradigms
  • News
  • Newsletters
  • Nonlinear Dynamics
  • One-to-Many (O2M)
  • Open Source Projects
  • PageRank
  • Patents
  • PCA
  • People Searches
  • Perfectoid Spaces
  • PHP
  • Physiology
  • Poems
  • political networks
  • Predatory Conferences
  • Predatory Journals
  • Programming
  • Public Databases
  • Public Records
  • Quack Science
  • Quantum Computing
  • Quantum Information Retrieval
  • Quantum Searches
  • Quantum Theory
  • Queries
  • Ranking Results
  • Research Centers
  • RSS/Atom Feeds
  • Scripts
  • search engines
  • Search Engines Architecture Course
  • Search Modes
  • self-weighting
  • SEO Myths
  • Sitemaps
  • social mining
  • social pulse parser
  • Software
  • Spam
  • Statistics and Mathematics
  • SVD
  • Technology Inventions
  • Theoretical Physics
  • Theses
  • twitter
  • URLs Mining
  • Vector Space Models
  • Voice Assistants
  • Web Mining
  • Web Mining Course
  • Web Security
  • Xamarin

Recent Posts

  • On the Myth of d Orbitals Hybridization
  • The Bond Order Calculator: Updates
  • Semantic Similarity of Healthcare Data
  • Going the multidisciplinary way
  • CUNY Computational Chemistry Tools
  • Why I chose to be a multidisciplinary scientist?
  • A Simple Example of Phonetic Similarity vs. Text Similarity
  • A Simple News Hub
  • Zillman’s 2019 Directory of Directories
  • More on Perfectoid Spaces
  • Lymphomas Miner
  • Extracting Topic-Specific Wikipedia Links
  • Programming Languages Miner
  • Beware of Chemistry Heuristics
  • New IANA Miners

Archives

  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

Algorithms calculators chemical mining Conferences Data Mining Hacking Homeland Security Human-Computer Interaction IR Tools IR Tutorials Latent Semantic Indexing Machine Learning Marketing Research Mathematics miner minerazzi Miscellaneous New Information Retrieval Paradigms News Newsletters Programming Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Vector Space Models Web Mining Course

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy