Archive for May, 2007

What is Data Mining?

May 31, 2007

What is Data Mining? Good question.

After a great one week vacation away from the blog, it is good to be back. During my vacation I was asked to explain the difference between data mining and information retrieval; so this post goes.

Here is a standard definition I wrote for a graduate course syllabus to be taught next fall at a local university:

(more…)

Being Quoted at University of Campinas, Brazil

May 24, 2007

More universities are quoting our tutorials. I’m happy to learn that these are read even in the huge and great nation that is Brazil.

Today we found out that Prof. Wu, Shin - Ting from Department of Computer Engineeering and Industrial Automation, School of Electrical and Computer Engineering State University of Campinas, Sao Paulo, Brazil and who teaches EA978 Graphic Information Systems is referencing our Matrix Tutorial 3: Eigenvalues and Eigenvectors.

(more…)

The Problem with Translations Software

May 23, 2007

The simplest litmus test I have come with to know if a translation software is free from flaws consists in checking its output under a recursive translation between two languages. I like to call this “iterlation”. I like Spanish and English, so these are my preferred languages.

To iterlate text I normally do this. Defining x as an iteration step, do this:

(more…)

Minerazzi: What in a name?

May 22, 2007

I’m building a client-side suite of text mining tools for extracting intelligence from text files, Web pages, and email documents.  It comes in four versions: basic,  intermediate,  advance,  and pro. The basic version provide the following reports:

(more…)

Eigenvectors and Reggaeton Music = Eiggaeton

May 21, 2007

Eigenvectors and eigenvalues come in pairs; that is why we use the term eigenpair. Some have asked me about practical applications of eigenpairs. So this post goes.

Did you know the connection between eigenvectors and Reggaeton Music (or music in general)? How about eigenvectors and bridges, car designers, speakers, architecture, or oil companies?

(more…)

Our Tutorials, Required Readings at University of Maryland

May 18, 2007

Yan Qu over at the College of Information Studies, University of Maryland taughts the graduate course

LBSC 670 Information Structure

For the course Qu selected as required readings our tutorials:

(more…)

LSA: A Goldmine for Educators and Curriculum Developers

May 17, 2007

LSA

Marco Kalz, M.A. over at Educational Technology Expertise Centre Open University of the Netherlands, informed me months ago that the University of Netherlands was organizing the 1st European Workshop on LSA in Technology-Enhanced Learning. Marco is part of the Scientific Committee responsible for organizing the event and co-author of the workshop proceedings.

It is my pleasure to inform our readers that the event was a complete success. I will ask Marco for additional inside information, to perhaps include in our next issue of IRW Newsletter.

(more…)

Upcoming IPAM Workshops

May 16, 2007

Dr. Mark Green, Director of the Institute of Pure and Applied Mathematics at UCLA (IPAM) informed me by email of the upcoming workshops IPAM is organizing. I meet Dr. Green last year  during a one-week workshop they organized (The Document Space Workshop)

I am listing below the new workshops relevant to search engines:

(more…)

The New Iteration of Mi Islita

May 15, 2007

Today I uploaded the new iteration of Mi Islita.com site.

I’ve added or updated the following resource pages:

IR Thoughts Archives - A sample of posts from this blog, powered by a homemade AJAX reader and regexps.

IR Calls - A list of conferences and industry events we recommend you to attend.

IR Tutorials - Tutorials on Vector Space and LSI Models, Matrix Algebra, and more.

Educational Links - Graduate theses and research projects referencing Mi Islita.

Marketing Links - Search engine marketing articles referencing Mi Islita.

(more…)

Open Source Machine Learning Software

May 14, 2007

If you are an IR researcher looking for some open source software this post is for you.

During the second day of the OJOBuscador Congress 2.0 held in Madrid, Spain (March 8, 9) I attended the IR with Usability track.

The first speaker was Dr. Carlos Castillo from Yahoo! Research Spain. He presented on IR with Adversarial and Web Spam.

Then the next speaker and IR practitioner, Jose Ramon-Perez Aguera, presented on several open source software. I want to share with you a handly list, thanks to Jose’s presentation:

(more…)

Zoom in this Theme: The LSI Myth

May 11, 2007

Few days ago, Michael Duz had a great post, The LSI Myth wherein he describes the nonsense promoted by snakeoil SEO marketers. He has a list of common taglines used by these people:

(more…)

Intelligent Extraction of Information

May 10, 2007

Interested in Intelligent Extraction from Graphs and High Dimensional Data? Then, watch videos or read papers from IPAM’s Graduate Summer School: Intelligent Extraction of Information from Graphs and High Dimensional Data from July 11 - 29, 2005.

(more…)

Thesis: A Hybrid Knowledge-based/Content-based Recommender

May 10, 2007

Here are some great news:

1. I am getting ready for my presentation at the Intektel International Conference and Expo. I am presenting the second day of the conference on “The Impact of Search Engines in the Internet”.

2. Next week we have the ARIN Conference (American Registry of Internet Numbers) in Puerto Rico, and in June we have also in San Juan, PR the 29th ICANN Conference. WOW!

3. Taschuk Morgan has written an excellent Honour Thesis in which kindly references our tutorial on Cosine Similarity and Term Weights. Morgan writes:

(more…)

Thesis: Understanding LSI via the Term-Term Truncated Matrix

May 10, 2007

As we mentioned in IR Watch - The Newsletter (got a free subscription?), although LSI (LSA) itself is not first-order co-occurrence (see Prof. Tom Landauer: Introduction to Latent Semantic Analysis), a recent thesis from Regis Newo shows that high-order co-occurrence might be at the heart of LSI and is what makes the technique works. This 2005 thesis abstract on Understanding LSI via the Truncated Term-Term Matrix states:

(more…)

Thesis: Information Retrieval with Genetic Programming

May 10, 2007

Here is the 2002 master thesis of Nir Oren, University of the Witwatersand, Johannnesburg:

Improving the effectiveness of information retrieval with genetic programming

where he proposes an interesting approach to IR using genetic algorithms. Part of his abstract states:

(more…)

Thesis: A Language-Based Approach to Categorical Analysis

May 10, 2007

I am finishing reading the 2001 Master Thesis of Cameron Alexander Marlow, from MIT:
A Language-Based Approach to Categorical Analysis

where he proposed the use of Synchronic Imprints (SI) combined with LSI. Great thesis. Essentially, SI incorporates a spring model in which term frequencies are inversely proportional to their distances.

(more…)

Keyword Density Myth - The Devil’s Advocate

May 9, 2007

Those that promote keyword density (KW) myths are now claiming that search engines use KW as an inexpensive spam detection mechanism; i.e.,

if (KW = fij/lj > upper threshold value) {// raise the spam red flag }

(more…)

IR Watch - The Newsletter

May 8, 2007

The goal of IR Watch - The Newsletter is to disseminate recent advances, research, and news from the information retrieval world. The current issue (IRW-2007-5) is a summary of my presentation at the OJOBuscador Congress 2 (March 8, 9 - Madrid, Spain),

Demystifying LSI for SEOs.

(more…)

Keyword Density (KD): Revisiting an SEO Myth

May 7, 2007

Back in March of 2005 I wrote The Keyword Density of Non Sense article for Mike Grehan’s newsletter. An expanded and improved version was also published at Mi Islita.com. After these articles, many SEOs saw the light.

However, in an attempt at perpetuating KD myths, few SEOs tried to reformulate the alleged importance or usefulness of keyword density by presenting KD as a spam detection filter used by search engines. Good try, but this still is non sense and another SEO myth.

(more…)

SEOs Blogging LSI Non Sense

May 6, 2007

At this SEOMoz.org blog, posters are discussing about search engines semantic capabilities, including LSI.

I stopped by to clarify several things since many of these present their hearsay as valid statements.

(more…)

Representing Documents, Terms, and Queries in the Same Space

May 6, 2007

A reader asked me an interesting question: Without using LSI, how do you represent documents, terms, and queries in the same space?

(more…)

PCA Is Not LSI

May 5, 2007

The fact that singular value decomposition (SVD) is used in principal component analysis (PCA) and in latent semantic indexing (LSI) has made some (even some “johnnycomeslate-to-IR” assistant professors) to think that PCA is LSI.

(more…)

On SVD and PCA: Some Applications

May 5, 2007

Some readers have asked me to clarify the difference between SVD and PCA, since these have many overlapping heritages. This was clarified at a TREC9 presentation. For those interested in a mathematical explanation or in ongoing research using these, the following might help.

(more…)

How to Populate a Matrix for SVD

May 4, 2007

SVD has been applied to different scenarios like IR, Economy, Computational Chemistry, BioComputation, and other scenarios. In all these cases, one must pay attention to how one populates the initial matrix to be “SVDied”.

(more…)

Demystifying LSA, LSI, SVD, PCA, and IS ACRONYMS

May 3, 2007

If you are interested in learning what the LSA, LSI, SVD, and PCA acronyms mean this post is for you.

(more…)

Two SEO Blogonomies

May 3, 2007

As I mentioned in a ClickZ column written by Mike Grehan, The Myths and Maths of SEOs, a blogonomy is the dissemination of false knowledge through electronic forums, especially through blogs. Today I want to commment on two LSI blogonomies promoted by several SEO firms.

(more…)

“LSI-Friendly” Documents: No Such Thing

May 3, 2007

Indeed, this was the topic of a post I made at this Cre8asiteForums thread

Quoting myself in part:

“When LSI is applied to a term-document matrix representing a collection of documents in the zillions, the co-occurrence phenomenon that affects the LSI scores becomes a global effect, occuring between documents in the collection.

(more…)

Latest SEO Incoherences (LSI)

May 3, 2007

One of the reasons I started the SVD and LSI Tutorial series was to debunk so many myths about latent semantic indexing. These myths come mostly from a given sector of the search engine marketing industry. In the 1800s and 1900s, when new drugs and medicines were discovered, an interesting phenomenon took place in the old wild west: unscrupulous marketers started to sell “amazing potions” and ”miracle syrups”. These “snake oil sellers” are nothing new since each decade has its versions.

(more…)

SEO Blogonomies: The Search Engine Markov Chain

May 3, 2007

Note: I added this post content to the  Stochastic Matrix tutorial.

The spreading of incorrect knowledge or at best innaccurate representation of concepts is prevalent in circles associated to search engine optimization (SEO). This is a social phenomenon more notorious in the blogosphere and through public forums (sites and discussion forums). Because of this, we call the phenomenon a bunches of “blogonomies”.

(more…)

Some Definitions

May 3, 2007

While working on Part 1 of the math tutorial I was asked to define “blogonomies”, a term I like to use in reference to an interesting social blog behavior.

Well, I have other definitions, equally interesting and worth a study: “blogorrhea” and “linkphilis

I call “blogonomies” the dissemination of false knowledge through blogs and “blogorrhea” when a false concept is promoted for profit.

A blogonomy can be the result of ignorance or speculations; nothing that a damage control campaign can fix to save face. I have seen many of these in some blogs and discussion forums.

(more…)