• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: May 2009

On Term Repetition and Local Models

27 Wednesday May 2009

Posted by egarcia in SEO Myths, Vector Space Models

≈ 6 Comments

I’m putting together a piece on several local term weight models. It should be ready in few weeks.

It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.

The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:

<last nail in KD coffin  style=”intensity:100%;”>

“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:

 Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.

Aboutness (A); e.g., that the document is x times more about the term.

Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.

Relevance (R);i..e., that a document repeating a term x times is x times more relevant.

Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”

</last nail in KD coffin>

Defining Data Mining and Database

25 Monday May 2009

Posted by egarcia in Data Mining

≈ Leave a Comment

What is the (^H^H^H) best definition for data mining and database? It depends on who you ask and in which context.

According to Section 126 of the USA Patriot Act,

(1) DATA-MINING- The term `data-mining’ means a query or search or other analysis of one or more electronic databases, where

(A) at least one of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement;

(B) the search does not use personal identifiers of a specific individual or does not utilize inputs that appear on their face to identify or be associated with a specified individual to acquire information; and

(C) a department or agency of the Federal Government is conducting the query or search or other analysis to find a pattern indicating terrorist or other criminal activity.

(2) DATABASE- The term `database’ does not include telephone directories, information publicly available via the Internet or available by any other means to any member of the public, any databases maintained, operated, or controlled by a State, local, or tribal government (such as a State motor vehicle database), or databases of judicial and administrative opinions.

Asking the government or a KDDM researcher the question and using LSI to clusters results for the above question can be a futile exercise.

It is like asking President Obama or Vice President Cheney to agree on: “What is Torture?”

When Noise is a Good Thing.

22 Friday May 2009

Posted by egarcia in Latent Semantic Indexing

≈ Leave a Comment

Today, a reader (name removed to protect confidentiality) asked me:

My name is **** ****. I working as a junior research fellow in a project in India. I red the SVD techniques from the web page http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html#right-eigenvectors. I found it is quite satisfactory for me. Now I can understand how SVD works. But I have a query as follows.

query:

As mentioned in this tutorial that we have arrange these eigen-values in descending order. Cold you please tell me if I put these values in ascending order or arbitrary what will be wrong with the SVD.

Looking forward your early kind response.

Thanking you.

With best regards.

*******

My answer follows.

It depends on what you are trying to address.

SVD is used to identify singular values interpreted as dimensions. When used as a dimensionality reduction technique, the largest N singular values are normally retained and thus retaining the smaller singular values is meaningless.  The largest singular values capture most of the information of the original data set and is therefore a noise minimization approach.

If the retention criterion used is reversed (smaller singular values are retained) this implies retaining the more noisy dimensions such that the reconstructed matrix will be a matrix of the hidden (latent) data noise. This is a noise maximization approach.

If the retention criterion is based on a random selection, the resultant reconstructed matrix might be one representing a data structure with randomized noise.

These scenarios depend on the original data under examination. 

In Image Compression, these approaches have been already explored. If the goal is a stability study and not just SVD dimensionality reduction, “the ratio between the highest singular value and the lowest singular value of the Jacobian matrix quantifies the spread of the Jacobian’s singular values, which in practice, reflects the extent of the solution’s instability with respect to small changes in the observation”  (Horesh’s Thesis )

Having said all that, we should not render noise in a data set as something that must be discarded at all cost.

This is intimate linked with the so-called Inverse Problem. Incorporating noise and a priori SVD information can provide the complete information in a linear sense. Qianqian Fang has a beautiful PPT presentation “Look Closer to Inverse Problem” on the subject. If you want to visualize the MATRIX Problem, this presentation is for you.

I’m thinking in putting together a tutorial on the Singular Value Expansion algorithm (SVE), if I ever find the time.

I hope this helps.

Ethical Hacking: An Oxymoron, a Misnomer, or Both?

18 Monday May 2009

Posted by egarcia in Hacking, Spam

≈ Leave a Comment

According to a report from the British Computer Society (BCS) covering a Security Panel Strategic Forum, “ethical hacking” is an oxymoron.

The report highligths do’s and don’t when it comes to defining terms like “hacker”, “ethical hacking”, “penetration tester”, “white/black hats”, and derivatives terms. These labels are frequently used in the IT industry. The report also underscores which terms should not be used by schools offering IT courses.

The problem with defining and redefining such labels is that there will always be others disagreeing with/circumventing said definitions.

For instance, in the December 1986 issue of MicroTimes, Bob Bickford wrote:

“A Hacker is any person who derives joy from discovering ways to circumvent limitations.”

If we accept this definition then a person that doesn’t derive any joy from discovering ways to circumvent limitations is not a hacker. Similarly a spouse cheater, an SEO, a spammer, a politician, a mobster, or a kid trying to get some candies from mom is a hacker.

I am taking this extreme, off-topic interpretation to illustrate the problem of semantics when it comes to defining things.

Whether you agree or disagree partial or totally with the report, it is a good read. For sure it will be a good piece for students planning to take my AIRWeb graduate course.

Google Accused of Conversion-Inflation Syndication Fraud

15 Friday May 2009

Posted by egarcia in Marketing Research

≈ Leave a Comment

According to Ben Edelman, Google is engaged in a conversion-inflation syndyication fraud.

These tactics are nothing new.

In the featuring article of the November 2008 issue of IR Watch, “Fraudulent Web Analytics – Engineering the Fraud“, we covered how in-the-middle mechanisms are part of Web Analytic Frauds and Business Collusion Schemes.

As in man-in-the-middle attacks found in information security settings, the underlying goal is the same: the crafting of deceiving intermediary events.

Expect soon a pr damage control campaign from the useful idiots/moles.

What is next? A class action lawsuit?

Still, I have a little taste of satisfaction in my mouth when crooks disguised as advertisers/search marketers are gamed. Gaming the gamers: Life ironies!

IR Quiz: Matrices

13 Wednesday May 2009

Posted by egarcia in IR Quizzes

≈ 1 Comment

Explain and give example for the following matrices used in IR:

1. Term-document occurrence matrix.

2. Term-term cooccurrence matrix.

3. Term-term correlation matrix.

4. Term-term similarity matrix.

5. Term-term coweights matrix.

6. Term-term distance matrix (*).

7. Covariance matrix (*).

 

(*) PS. I forgot to list these other matices.

Vector Normalization with Excel – Part II

07 Thursday May 2009

Posted by egarcia in IR Tutorials, Newsletters

≈ Leave a Comment

Back in March, we explained how to normalize column vectors with Excel. But, what about normalizing row vectors? This question is addressed in the current QA column of IRW. I think it might be useful sharing the answer with readers since many of these are students struggling with similar questions. So, here we go.

The following table emulates an Excel array consisting of three columns (A, B, and C) and six rows (1-6).

  A B C
1 1 2 3
2 4 5 6
3 7 8 9
4 0.27 0.53 0.80
5 0.46 0.57 0.68
6 0.50 0.57 0.65

Rows 1, 2, and 3 are row vectors. Rows 4, 5, and 6 are the corresponding normalized vectors, also known as unit vectors because their length is 1. To compute these, do as follows:

1. In cell A4, enter the formula =A1/(SQRT(SUMSQ($A1:$C1))). The result should be as given in this cell.

2. Copy this formula, select cells A5 and A6 and paste the formula in these.

 3. Finally, copy at once cells A4 through A6, select the remaining empty cells of the array, i.e., cells B4 through C6 and paste the formulas in these.

NSA/DHS Designates PUPR as a CAE

05 Tuesday May 2009

Posted by egarcia in Homeland Security, Newsletters

≈ Leave a Comment

As blogged yesterday, the current issue of IRW should reach subscribers inbox today. The Top CS Departments column features Polytechnic University of Puerto Rico, where I teach graduate courses. As mentioned few days ago, PUPR has been designated a CAE. This is a great news that is making a splash across academic centers within the U.S., the Caribbean Region and Latin America, and whose mission is research relevant to homeland security.

Associate Director for Computer Science, Dr. Alfredo Cruz, sent me an  official announcement, which I am reproducing.

Polytechnic University of Puerto Rico (PUPR) is Designated National Center of Academic Excellence in Information Assurance Education by NSA and DHS. PUPR was recently designated as a National Center of Academic Excellence in Information Assurance Education (CAE/IAE) by the National Security Agency (NSA) and the Department of Homeland Security (DHS) on April 22, 2009. The goal of these centers is to reduce the vulnerability of the national information infrastructure by promoting higher education and research in Information Assurance (IA) and Security through the development of a growing number of professionals with IA expertise in various related disciplines. PUPR will be recognized as the first institution in Puerto Rico to be designated as a CAE/IAE on June 3, 2009 in Seattle, Washington. Dr. Alfredo Cruz from the Department of Electrical & Computer Engineering and Computer Science will be present to receive the designation. He is the Director of the Center of Information Assurance for Research and Education (CIARE) at PUPR. Dr. Cruz is the person responsible for this designation. PUPR is of the very few Hispanic serving institution (HSI) in the Nation to receive this designation, and to become one of the first 100 institutions nationwide; this is a very special recognition. This designation requires that the President of the United States send the Governor of Puerto Rico a certification that should be handed to the president of PUPR designating the Institution as a CAE/IAE at a National level. The Congress and all the respective Congressional Committees are also notified.

Some of the benefits of the CAE/IAE designation are:
• PUPR will receive formal recognition from the U.S. Government as well as opportunities for prestige and publicity for our roll in securing the Nation’s information systems.
• This designation increases collaboration opportunities between designated and aspiring institutions at local and national levels. This includes internships, faculty and student exchange, research, and publications, among other activities.
• With this designation as a CAE/IAE PUPR can obtain scholarships that can help outstanding students to pursue graduate studies in IA, enabling them to work with the Federal Government or other federal institutions and agencies.
• PUPR can compete and benefit from proposal calls (RFP) that are specifically for designated CAE/IAE institutions. These proposals offer millions of dollars from the DoD, NSF, NSA and “Homeland Security”, among others, for research and infrastructure.
• Student scholarships offered under the NSF’s Scholarship for Service (SFS) program. The SFS scholarship offers the following:
–2-year scholarship, includes 8K stipend (12K for graduate students), plus tuition and nominal room and board expenses.
–Paid summer internship in a federal agency.
–Placement in federal government at the end of the scholarship period.

IRW: RIA Vulnerabilities

04 Monday May 2009

Posted by egarcia in Hacking, Newsletters, Spam

≈ Leave a Comment

The current of issue of IRW should reach subscribers inbox tomorrow.

In this issue:

Featuring article: RIA Vulnerabilities

This issue of the newsletter discusses how hackers might be exploiting Web vulnerabilities found in Rich Internet Applications (RIAs). As mentioned in our previous issue, some RIAs are based on Adobe’s technologies like Flash, Flex, or AIR. Some are designed to be run online or offline. Their rising popularity has attracted developers and marketers, and -as expected- hackers and spammers.

QA: Excel Vector Normalization: How do I convert a row vector into a unit vector?
Who is Who in IR: C.J. van Rijsbergen
Top CS Departments: Polytechnic University of Puerto Rico
Historical Notes: ENIAC Computer
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…

May 2009
M T W T F S S
« Apr   Jun »
 123
45678910
11121314151617
18192021222324
25262728293031

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.