Archive for August, 2007

Snake Preview of IRW-2007-9

August 31, 2007

Constrained Co-Occurrence Searches

“Unlike AND and EXACT searches, constrained co-occurrence searches (cc searches), consists in searching within a text window wherein search terms are either unordered, as in proximity searching, or ordered, as in adjacency searching. Thus, cc searching is a contextual way of searching within similar neighboring terms.

Here is a snake preview of the September issue of IR Watch - The Newsletter.

IRW-2007-9: Constrained Co-Occurrence Searches

In this issue:

Introduction
AND Searches
EXACT Searches
CC Searches Defined
Proximity Searches
Automated Search Modes
Adjacency Searches
CC Searches at ONR
Differential CC Searches
Testing the ONR Algorithm
CC Searches and Keyword Research
CC Searches and Search Commands
CC Searches vs. Google’s Tilde Operator
Conclusion
References
IR Thoughts - News, Research, and Events
Terms of Use and Copyright

If you are a subscriber this issue should be in your inbox during the day.

Random Notes

August 30, 2007

I’m putting the final touches to IR Watch, now in its first year of publication. I started the project a year ago. Thank you for your support.

Tomorrow I will post a sneak preview of the September issue. This one is about research conducted at the Office of Naval Research in the area of search modes. If you are a keyword researcher you need to read this issue.

I’m also researching a large repository of obscure databases, accessible through ftp. If you are a KDD researcher, you will love to know about these.

A Call to Expose SEO Liars

August 29, 2007

Since A Call to SEOs claiming to Sell LSI many are finally realizing they were taken/gamed by crook SEOs selling snake oil in the form of spurious LSI arguments. It is now time to issue a call to expose all these sinisters marketers that are giving a black eye to the search marketing industry. So, you are welcome to join great guys like Mike Duz, David Petar, and Mike Grehan and expose these people.

If you prefer, do like Dan Thies and blog about their myths. In Lies, Damn Lies, Thies has exposed another old SEO myth: keyword density. Here are additional reasons against this myth many marketers are still hanging around:

2007/05/09 Keyword Density Myth - The Devil’s Advocate

2007/05/07 Keyword Density (KD): Revisiting an SEO Myth

On the Evolution of SEO Myths

The evolution of KD myths and KD tools within SEO circles is anecdotal. It is quite similar to the evolution of LSI-based SEO myths promoted by almost the same marketers. There is a clear pattern of deception:

Repeat a hearsay many times, spin it, play with words, convince cheerleaders to repeat like parrots your hearsays and then repeat everything again until many cheerleaders, peers, and “experts” repeat your nonsense in blogs and seo books. Invent formulas out of thin air and tools that support these, etc, etc, etc.

If you prefer, misquote or copy/dump IR papers and patents in your blog to give the impression you know about information retrieval. Then, stretch these IR papers or patents to your heart needs or to whatever you are trying to sale or promote. That can be your own image or other crooks services.

Two wings of the same bird

That’s how the KD and LSI SEO myths have survived all these years. These are two wings of the same bird. Unfortunately the very same marketers go to fancy search marketing conferences, blogs, forums and few other channels to spread the same misinformation or to induce others into error. No wonder Mike Grehan has called these ‘hot air’.

Take for instance, those marketers that have preached about LSI for years or selling “LSI-like services” without even having a clue on how SVD actually works. They either do so to build an image as “experts” or to intentionally deceive their peers and clients, because of vested interests.

When caught with the pants off they often have two choices:

1. recanting.
2. recoiling.

The few raise the royal “we” and “honest” flag and then resource to throwing dirt rather than prove their case regarding their LSI claims. As far as I’m concern they can throw dirt or scream like babies all they want. They deserve their head to be hammered away any day of the week.

These are the very same folks that give a black eye to the damn search marketing industry, by deceiving the public and prospective clients while posing as honest business guys. No wonder so many IR folks perceive SEOs just as vulgar spammers.

As I always say to peer IRs and graduate students, not all SEOs are deceivers. Some are indeed ethical and quite honest. However, the bad apples are easy to spot.

More likely the more vocal “SEO experts” are the less they know about information retrieval and search engines. To be on the safe side, stay away from those that peer marketers call “SEO experts”. As we say in Spanish: ‘Ante la duda, saluda’.

Many of these have been exposed many times and in different places. Here are some references for your perusal:

SVD and LSI Tutorial 1: Understanding SVD and LSI

SEOs and their LSI Misconceptions

LSI Blog Posts and SEOs

When SEOs are caught in Lies

How Complex Simplex Words Can Be

August 28, 2007

1. Are you interested in the word frequency effect on users?

2. Want to know why high-frequency words such as car is recognized more quickly than a low-frequency word such as doe?

3. Interested in why processing time is determined not only by the frequency of complex words, but by the frequencies of its constitutents and surrounding terms?

If you answer yes to any of these, this post is for you.

Back in 1997, Schreuder and Harald wrote in How Complex Simplex Words Can Be:

“A series of experiments investigated components of the word frequency effect in visual lexical decision, progressive demasking, and subjective frequency ratings. For simplex, i.e., monomorphemic, nouns in Dutch, we studied the effect of the frequency of the monomorphemic noun itself as well as the effect of the frequencies of morphologically related forms on the processing of these monomorphemic nouns. The experiments show that the frequency of the (unseen) plural forms affects the experimental measures. Nouns with high-frequency plurals are responded to more quickly in visual lexical decision, and they receive higher subjective frequency ratings. However, the summed frequencies of the formations in the morphological family of a given noun (the compounds and derived words in which that noun appears as a constituent) did not affect the experimental measures. Surprisingly, the size of the morphological family, i.e., the number of different words in the family, emerged as a substantial factor. A monomorphemic noun with a large family size elicits higher subjective frequency ratings and shorter response latencies in visual lexical decision than a monomorphemic noun with a small family size. The effect of family size disappears in progressive demasking, a task which taps into the earlier stages of form identification. This suggests that the effect of family size arises at more central, post-identification stages of lexical processing.”

Although oldie, I still find their research relevant.

Identifying Hidden Sequences

August 27, 2007

In the Row-Pruning Algorithm Tutorial post I introduced an algorithm for identifying terms sequences hidden in collections. The tutorial is available in Mi Islita.com site. Several applications were also presented.

In a nutshell, given a set of terms T={t1, t2..tm} extracted from a document collection, C={d1, d2..dn} the goal is to combine each of these terms with all other terms of T and then find the family of term sequences more frequently present in C. Such families are of the form

“k1″
“k1 + k2″
“K1 + k2 +..km”

The double quotes indicate that these are term sequences.

The algorithm then is aimed at establishing the identity of the k’s using T and C. Once k1 is identified, the algorithm reduces to conducting a binary partition on the corresponding largest answer sets. Normally T consists of few terms (m < threshold value) and as such is a subset of the index of terms. Whenever possible these should be on-topic terms and preselected by a suitable algorithm or heuristic criterion.

Confidence-based pruning is used, but one could as well use other measures like co-occurrence, support, or similar measures. One does not need to restrict the algorithm to just terms.

I see many applications to this, including visitations:

“Users clicking on X and then on Y tend to click next on Z

“Customers visiting store X and then store Y tend to visit next store Z.”

and so forth…

ACM Document Engineering 2007 Conference

August 24, 2007

The DocEng 2007 will be held at the University of Manitoba, Winnipeg, Canada from August 28 - 31, 2007.

The ACM Symposium on Document Engineering is an annual international academic conference devoted to the dissemination of research on models, tools and processes that improve our ability to create, manage and maintain documents.

Documents are one of the centerpieces of globally interconnected systems that store information drawn from many media and deliver that information as required by users. A document may be stored in final presentation form or may be generated on-the-fly, undergoing substantial transformations in the process. Documents may include extensive hyperlinks, thereby permitting virtual documents, and also making available structured collections of information on which to anchor automated reasoning, such as promoted through the Semantic Web. Furthermore, document technologies like XML are having a profound impact on data modeling, in part because of the way these technologies bridge and integrate a variety of paradigms.

The attend next week conference or for additional information visit

http://www.cs.umanitoba.ca/~doceng07/

Applications of Edit Distances

August 23, 2007

After uploading the Levenshtein Edit Distance Tool I received several recommendations for its implementation. No doubt that this is a simlarity measure for the masses. Here is a current list.

The Levenshtein Edit Distance Algorithm can be used:

  1. for automatic marking of musical dictations.
  2. for regular expressions approximate matching.
  3. to identify if two genetic sequences have similar functions.
  4. to filter blocks of email lists (candidate spam addresses) within a LED threshold value.
  5. as the ultimate baby name explorer.
  6. to name products and services like domains, brands, etc.
  7. to conduct fuzzy search matches in EXCEL or your preferred environment.
  8. for spamdexing search engines - by randomly converting text into gibberish.
  9. for spam stemming search engines - by systematically appending edits to valid stems.
  10. as part of a spell checker routine.
  11. to identify duplicated content and plagiarism.

Got an idea, suggestion, or reference? Let me know. Don’t forget to include a link.

Data Mining and Reports on Terrorism

August 22, 2007

I’m researching the topic of Data Mining (KDD) and Terrorism Information Awareness (TIA) for a graduate course and came across a great old resource:

Data Mining

It is oldie, but the important part are the references.

It may interest IRs conducting similar research.

Here is another great resource:

Data Mining and Homeland Security

JavaScript Tips

August 21, 2007

This is not a post about an IR topic, but since at some point IR projects resource to programming, I believe the post is relevant to this blog –especially when many IR tools used in a classroom demonstration setting are written in JavaScript.

I’m reading Douglas Crockford great video/ppt presentations on JavaScript via http://101out.com/js.php. There are many things average programmers don’t know about JavaScript, the most misunderstood programming language on the Planet. For those not familiar with Crockford, few years ago he pioneered the right way of writing JavaScript. Haven’t heard of JSON?

He is giving so much great tips in those videos and ppt slides. Here are some tips:

Tip #1

//Instead of

if(a==null) {...} //which does coercion

//do this:

if(a===null){...}

//Also instead of != use !==//Avoid altogether == and != in your code. The === operator compares objects references, not values. It is true only if both operands are the same object

Tip #2

//Instead ofif(a){return a.member;}else{return a;}

//do this, which is shorter:

return a && a.member;

 Tip #3

//Use || to set default values
//do this, which requires less typing:

var last=input||nr_items;

//if input is truthy, last is input, otherwise set last to nr_items

Tip #4

//Statements can have labels. Break statements can refer to labels. Use labels only on do, for, switch, and while.

//do this

loop: for(; ;)
{
//do something
if(…){break loop;}
//do something
}

There are more great tips, but is better if you assimilate these at your own pace. Time to use literals more often. So,

//it is time to use () instead of new Object() and [] instead of new Array().

For code conventions for the JavaScript programming Language visit

http://javascript.crockford.com/code.html

I must agree with him that most JavaScript code on the web is crap.

Levenshtein Edit-Distance Based Tool

August 20, 2007

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

Upcoming Tool on Edit Distances

August 17, 2007

I’m working on a tool for computing edit distances (number of insertions, deletions, and substitutions) in a text stream.

It will be up and running this Monday. It is great for a hands-on tutorial.

Did you know that to change Democrats into Republicans and vice versa requires of just 8 edits? :)

(more…)

Reviewing Papers: How-To

August 16, 2007

As reviewer of journal manuscripts and conference papers I normally look to see if the piece before me answers the following questions:

1. WHAT-WHY: What is the scientific problem at hand and why is important?
2. WHO-WHAT-WHY: Who proposed what previous solutions and why are these inadequate or incomplete?
3. WHAT-YOUR-WHY: What is your proposed solution and why is better?
4. HOW-WHAT: How is the solution implemented and what are the benefits or practical applications?
5. PROS-CONS-WHAT: What are the possible pros and cons of your solution and what are the next areas of research?

(more…)

Computer Science & Engineering, Berlin

August 15, 2007

The Fourth International Conference on Computer Science and Engineering (CSE 2007) will be held in Berlin, Germany during August 24-26, 2007.

CSE 2007 aims to bring together researchers, scientists, engineers, and students to exchange and share their experiences, new ideas, and research results about all aspects of Computer Science and Engineering, and discuss the practical challenges encountered and the solutions adopted.

(more…)

Special Matrices You Should Know About

August 14, 2007

A reader asked me about matrices, so I referred him to my series of Tutorials on Matrices and IR. He then asked me about some special kind of matrices. I answered his questions with some examples.  He then replied with some analogies.

(more…)

Argentine Symposium on Artificial Intelligence

August 13, 2007

ASAI, the Argentine Symposium on Artificial Intelligence, is an annual event intended to be the main forum of the Artificial Intelligence (AI) community in Argentina. The symposium aims at providing a forum for researchers and AI community members to discuss and exchange ideas and experiences on diverse topics of AI. Previous ASAI editions stimulated presentations on both applications of AI and new tools and foundations currently under development.

(more…)

When Local Relevancy is Irrelevant to Locals

August 10, 2007

When is local relevancy irrelevant to locals? In other words, when is local not important to locals?

That depends on whom you ask. For instance, at times news relevant to a location are not known by locals because of manipulation by media moguls. When globals know more than locals about local news you know that something is not working right.

(more…)

Search Smart with WCC’s ELISE

August 9, 2007

The list of subscribers to IRWatch is growing at fast pace. One of our recent subscribers is a developer at WCC, makers of ELISE Smart Search & Match. This seems to be a quite interesting technology. I highly recommend readers to visit their site http://wcc-group.com/

(more…)

Row-Pruning Algorithm Tutorial

August 8, 2007

In IRW-2007-8 I introduced a row-pruning algorithm (RPA) and its unconditional version (URPA).

Let C be a collection of documents and let T = {t1, t2… tm} be m unique terms extracted from C. Assume that term i is combined with all other terms such that starting with term i a family of term sequences is obtained. Assume that these term sequences are hidden (latent) in C. The purpose of RPA and URPA is to identify the composition of these term sequences and to find those that occur more frequently in C.

(more…)

2007 SIGIR - The Conference Papers

August 7, 2007

The 30th Annual International ACM SIGIR Conference was over 10 days ago (23-27 July 2007, Amsterdam). I didn’t have time to list the accepted papers/posters/demos. Here they are.

http://www.sigir2007.org has all the glory. While you are there and if you have an account, check Karen Sparck Jones online video.

(more…)

ADSAM: Emotional Response Modeling

August 6, 2007

I have the pleasure of learning about Dr. Jon Morris, Professor at the University of Florida and CEO of ADSAM. He specializes in Emotional Response Modeling. His company is at the leading edge of the field and has incredible research articles and studies applied to advertising and marketing. I highly recommend those interested in emotional adverstising to read about his work.

(more…)

TREC 2008 Call for Papers

August 3, 2007

Dr. Ellen Voorhees, over at TREC, a NIST.gov dependency, sent me the TREC 2008 Call for Papers.

If you are a colleague feel free to submit or forward to your college faculty –across depts/disciplines.

(more…)

THESUS: Semantic Subsets Based on WWW Links

August 2, 2007

I came across a 2003 paper on mining WWW text links, in which researchers introduced THESUS. Really interesting piece.

The abstract of THESUS: ORGANIZING WEB DOCUMENT COLLECTIONS BASED ON LINK SEMANTICS follows:

(more…)

DARPA Agent Markup Language (DAML)

August 1, 2007

DARPA Agent Markup language (DAML) site has tons of tools and resources CS/IR graduate students and SEM/SEO practitioners with some IR knowledge can use for data mining purposes. These can help with nice experiments, from ontology-based keyword discovery to the construction of crawlers (or at least really learn how these actually work).

Here is a list of resources.

(more…)