Archive for May, 2007
May 31, 2007
What is Data Mining? Good question.
After a great one week vacation away from the blog, it is good to be back. During my vacation I was asked to explain the difference between data mining and information retrieval; so this post goes.
Here is a standard definition I wrote for a graduate course syllabus to be taught next fall at a local university:
(more…)
Posted in Data Mining, Machine Learning | No Comments »
May 24, 2007
Posted in IR Tutorials | No Comments »
May 23, 2007
The simplest litmus test I have come with to know if a translation software is free from flaws consists in checking its output under a recursive translation between two languages. I like to call this “iterlation”. I like Spanish and English, so these are my preferred languages.
To iterlate text I normally do this. Defining x as an iteration step, do this:
(more…)
Posted in Machine Learning | No Comments »
May 22, 2007
I’m building a client-side suite of text mining tools for extracting intelligence from text files, Web pages, and email documents. It comes in four versions: basic, intermediate, advance, and pro. The basic version provide the following reports:
(more…)
Posted in Miscellaneous | No Comments »
May 21, 2007
Eigenvectors and eigenvalues come in pairs; that is why we use the term eigenpair. Some have asked me about practical applications of eigenpairs. So this post goes.
Did you know the connection between eigenvectors and Reggaeton Music (or music in general)? How about eigenvectors and bridges, car designers, speakers, architecture, or oil companies?
(more…)
Posted in Legacy Posts, Miscellaneous | No Comments »
May 18, 2007
Yan Qu over at the College of Information Studies, University of Maryland taughts the graduate course
LBSC 670 Information Structure
For the course Qu selected as required readings our tutorials:
(more…)
Posted in IR Tutorials, Legacy Posts, Vector Space Models | No Comments »
May 17, 2007

Marco Kalz, M.A. over at Educational Technology Expertise Centre Open University of the Netherlands, informed me months ago that the University of Netherlands was organizing the 1st European Workshop on LSA in Technology-Enhanced Learning. Marco is part of the Scientific Committee responsible for organizing the event and co-author of the workshop proceedings.
It is my pleasure to inform our readers that the event was a complete success. I will ask Marco for additional inside information, to perhaps include in our next issue of IRW Newsletter.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts | No Comments »
May 16, 2007
Dr. Mark Green, Director of the Institute of Pure and Applied Mathematics at UCLA (IPAM) informed me by email of the upcoming workshops IPAM is organizing. I meet Dr. Green last year during a one-week workshop they organized (The Document Space Workshop)
I am listing below the new workshops relevant to search engines:
(more…)
Posted in Conferences, Machine Learning | No Comments »
May 15, 2007
Today I uploaded the new iteration of Mi Islita.com site.
I’ve added or updated the following resource pages:
IR Thoughts Archives - A sample of posts from this blog, powered by a homemade AJAX reader and regexps.
IR Calls - A list of conferences and industry events we recommend you to attend.
IR Tutorials - Tutorials on Vector Space and LSI Models, Matrix Algebra, and more.
Educational Links - Graduate theses and research projects referencing Mi Islita.
Marketing Links - Search engine marketing articles referencing Mi Islita.
(more…)
Posted in Miscellaneous | No Comments »
May 14, 2007
If you are an IR researcher looking for some open source software this post is for you.
During the second day of the OJOBuscador Congress 2.0 held in Madrid, Spain (March 8, 9) I attended the IR with Usability track.
The first speaker was Dr. Carlos Castillo from Yahoo! Research Spain. He presented on IR with Adversarial and Web Spam.
Then the next speaker and IR practitioner, Jose Ramon-Perez Aguera, presented on several open source software. I want to share with you a handly list, thanks to Jose’s presentation:
(more…)
Posted in Machine Learning | No Comments »
May 11, 2007
Few days ago, Michael Duz had a great post, The LSI Myth wherein he describes the nonsense promoted by snakeoil SEO marketers. He has a list of common taglines used by these people:
(more…)
Posted in Latent Semantic Indexing, SEO Myths | No Comments »
May 10, 2007
Posted in Conferences | No Comments »
May 10, 2007
Here are some great news:
1. I am getting ready for my presentation at the Intektel International Conference and Expo. I am presenting the second day of the conference on “The Impact of Search Engines in the Internet”.
2. Next week we have the ARIN Conference (American Registry of Internet Numbers) in Puerto Rico, and in June we have also in San Juan, PR the 29th ICANN Conference. WOW!
3. Taschuk Morgan has written an excellent Honour Thesis in which kindly references our tutorial on Cosine Similarity and Term Weights. Morgan writes:
(more…)
Posted in Legacy Posts, Theses | No Comments »
May 10, 2007
I am finishing reading the 2001 Master Thesis of Cameron Alexander Marlow, from MIT:
A Language-Based Approach to Categorical Analysis
where he proposed the use of Synchronic Imprints (SI) combined with LSI. Great thesis. Essentially, SI incorporates a spring model in which term frequencies are inversely proportional to their distances.
(more…)
Posted in Legacy Posts, Theses | No Comments »
May 9, 2007
Those that promote keyword density (KW) myths are now claiming that search engines use KW as an inexpensive spam detection mechanism; i.e.,
if (KW = fij/lj > upper threshold value) {// raise the spam red flag }
(more…)
Posted in SEO Myths | 2 Comments »
May 8, 2007
The goal of IR Watch - The Newsletter is to disseminate recent advances, research, and news from the information retrieval world. The current issue (IRW-2007-5) is a summary of my presentation at the OJOBuscador Congress 2 (March 8, 9 - Madrid, Spain),
Demystifying LSI for SEOs.
(more…)
Posted in Latent Semantic Indexing, Newsletters | No Comments »
May 7, 2007
Back in March of 2005 I wrote The Keyword Density of Non Sense article for Mike Grehan’s newsletter. An expanded and improved version was also published at Mi Islita.com. After these articles, many SEOs saw the light.
However, in an attempt at perpetuating KD myths, few SEOs tried to reformulate the alleged importance or usefulness of keyword density by presenting KD as a spam detection filter used by search engines. Good try, but this still is non sense and another SEO myth.
(more…)
Posted in SEO Myths | 2 Comments »
May 6, 2007
At this SEOMoz.org blog, posters are discussing about search engines semantic capabilities, including LSI.
I stopped by to clarify several things since many of these present their hearsay as valid statements.
(more…)
Posted in Latent Semantic Indexing, SEO Myths | 2 Comments »
May 6, 2007
A reader asked me an interesting question: Without using LSI, how do you represent documents, terms, and queries in the same space?
(more…)
Posted in Legacy Posts, Vector Space Models | No Comments »
May 5, 2007
The fact that singular value decomposition (SVD) is used in principal component analysis (PCA) and in latent semantic indexing (LSI) has made some (even some “johnnycomeslate-to-IR” assistant professors) to think that PCA is LSI.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts | 1 Comment »
May 5, 2007
Some readers have asked me to clarify the difference between SVD and PCA, since these have many overlapping heritages. This was clarified at a TREC9 presentation. For those interested in a mathematical explanation or in ongoing research using these, the following might help.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts | No Comments »
May 4, 2007
SVD has been applied to different scenarios like IR, Economy, Computational Chemistry, BioComputation, and other scenarios. In all these cases, one must pay attention to how one populates the initial matrix to be “SVDied”.
(more…)
Posted in Latent Semantic Indexing | 4 Comments »
May 3, 2007
If you are interested in learning what the LSA, LSI, SVD, and PCA acronyms mean this post is for you.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts | No Comments »
May 3, 2007
As I mentioned in a ClickZ column written by Mike Grehan, The Myths and Maths of SEOs, a blogonomy is the dissemination of false knowledge through electronic forums, especially through blogs. Today I want to commment on two LSI blogonomies promoted by several SEO firms.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts, SEO Myths | 1 Comment »
May 3, 2007
Indeed, this was the topic of a post I made at this Cre8asiteForums thread
Quoting myself in part:
“When LSI is applied to a term-document matrix representing a collection of documents in the zillions, the co-occurrence phenomenon that affects the LSI scores becomes a global effect, occuring between documents in the collection.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts | No Comments »
May 3, 2007
One of the reasons I started the SVD and LSI Tutorial series was to debunk so many myths about latent semantic indexing. These myths come mostly from a given sector of the search engine marketing industry. In the 1800s and 1900s, when new drugs and medicines were discovered, an interesting phenomenon took place in the old wild west: unscrupulous marketers started to sell “amazing potions” and ”miracle syrups”. These “snake oil sellers” are nothing new since each decade has its versions.
(more…)
Posted in Latent Semantic Indexing, Legacy Posts | 1 Comment »
May 3, 2007
Note: I added this post content to the Stochastic Matrix tutorial.
The spreading of incorrect knowledge or at best innaccurate representation of concepts is prevalent in circles associated to search engine optimization (SEO). This is a social phenomenon more notorious in the blogosphere and through public forums (sites and discussion forums). Because of this, we call the phenomenon a bunches of “blogonomies”.
(more…)
Posted in Legacy Posts, SEO Myths | No Comments »
May 3, 2007
While working on Part 1 of the math tutorial I was asked to define “blogonomies”, a term I like to use in reference to an interesting social blog behavior.
Well, I have other definitions, equally interesting and worth a study: “blogorrhea” and “linkphilis
I call “blogonomies” the dissemination of false knowledge through blogs and “blogorrhea” when a false concept is promoted for profit.
A blogonomy can be the result of ignorance or speculations; nothing that a damage control campaign can fix to save face. I have seen many of these in some blogs and discussion forums.
(more…)
Posted in Legacy Posts | No Comments »