• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: August 2007

Sneak Preview of IRW-2007-9

31 Friday Aug 2007

Posted by egarcia in Newsletters

≈ Leave a Comment

Constrained Co-Occurrence Searches

“Unlike AND and EXACT searches, constrained co-occurrence searches (cc searches), consists in searching within a text window wherein search terms are either unordered, as in proximity searching, or ordered, as in adjacency searching. Thus, cc searching is a contextual way of searching within similar neighboring terms.”

Here is a sneak preview of the September issue of IR Watch – The Newsletter.

IRW-2007-9: Constrained Co-Occurrence Searches

In this issue:

Introduction
AND Searches
EXACT Searches
CC Searches Defined
Proximity Searches
Automated Search Modes
Adjacency Searches
CC Searches at ONR
Differential CC Searches
Testing the ONR Algorithm
CC Searches and Keyword Research
CC Searches and Search Commands
CC Searches vs. Google’s Tilde Operator
Conclusion
References
IR Thoughts – News, Research, and Events
Terms of Use and Copyright

If you are a subscriber this issue should be in your inbox during the day.

Random Notes

30 Thursday Aug 2007

Posted by egarcia in Miscellaneous, Newsletters

≈ Leave a Comment

I’m putting the final touches to IR Watch, now in its first year of publication. I started the project a year ago. Thank you for your support.

Tomorrow I will post a sneak preview of the September issue. This one is about research conducted at the Office of Naval Research in the area of search modes. If you are a keyword researcher you need to read this issue.

I’m also researching a large repository of obscure databases, accessible through ftp. If you are a KDD researcher, you will love to know about these.

A Call to Expose SEO Liars

29 Wednesday Aug 2007

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ 1 Comment

Since A Call to SEOs claiming to Sell LSI many are finally realizing they were taken/gamed by crook SEOs selling snake oil in the form of spurious LSI arguments. It is now time to issue a call to expose all these sinisters marketers that are giving a black eye to the search marketing industry. So, you are welcome to join great guys like Mike Duz, David Petar, and Mike Grehan and expose these people.

If you prefer, do like Dan Thies and blog about their myths. In Lies, Damn Lies, Thies has exposed another old SEO myth: keyword density. Here are additional reasons against this myth many marketers are still hanging around:

2007/05/09 Keyword Density Myth – The Devil’s Advocate

2007/05/07 Keyword Density (KD): Revisiting an SEO Myth

On the Evolution of SEO Myths

The evolution of KD myths and KD tools within SEO circles is anecdotal. It is quite similar to the evolution of LSI-based SEO myths promoted by almost the same marketers. There is a clear pattern of deception:

Repeat a hearsay many times, spin it, play with words, convince cheerleaders to repeat like parrots your hearsays and then repeat everything again until many cheerleaders, peers, and “experts” repeat your nonsense in blogs and seo books. Invent formulas out of thin air and tools that support these, etc, etc, etc.

If you prefer, misquote or copy/dump IR papers and patents in your blog to give the impression you know about information retrieval. Then, stretch these IR papers or patents to your heart needs or to whatever you are trying to sale or promote. That can be your own image or other crooks services.

Two wings of the same bird

That’s how the KD and LSI SEO myths have survived all these years. These are two wings of the same bird. Unfortunately the very same marketers go to fancy search marketing conferences, blogs, forums and few other channels to spread the same misinformation or to induce others into error. No wonder Mike Grehan has called these ‘hot air’.

Take for instance, those marketers that have preached about LSI for years or selling “LSI-like services” without even having a clue on how SVD actually works. They either do so to build an image as “experts” or to intentionally deceive their peers and clients, because of vested interests.

When caught with the pants off they often have two choices:

1. recanting.
2. recoiling.

The few raise the royal “we” and “honest” flag and then resource to throwing dirt rather than prove their case regarding their LSI claims. As far as I’m concern they can throw dirt or scream like babies all they want. They deserve their head to be hammered away any day of the week.

These are the very same folks that give a black eye to the damn search marketing industry, by deceiving the public and prospective clients while posing as honest business guys. No wonder so many IR folks perceive SEOs just as vulgar spammers.

As I always say to peer IRs and graduate students, not all SEOs are deceivers. Some are indeed ethical and quite honest. However, the bad apples are easy to spot.

More likely the more vocal “SEO experts” are the less they know about information retrieval and search engines. To be on the safe side, stay away from those that peer marketers call “SEO experts”. As we say in Spanish: ‘Ante la duda, saluda’.

Many of these have been exposed many times and in different places. Here are some references for your perusal:

SVD and LSI Tutorial 1: Understanding SVD and LSI

SEOs and their LSI Misconceptions

LSI Blog Posts and SEOs

When SEOs are caught in Lies

How Complex Simplex Words Can Be

28 Tuesday Aug 2007

Posted by egarcia in Machine Learning

≈ Leave a Comment

1. Are you interested in the word frequency effect on users?

2. Want to know why high-frequency words such as car is recognized more quickly than a low-frequency word such as doe?

3. Interested in why processing time is determined not only by the frequency of complex words, but by the frequencies of its constitutents and surrounding terms?

If you answer yes to any of these, this post is for you.

Back in 1997, Schreuder and Harald wrote in How Complex Simplex Words Can Be:

“A series of experiments investigated components of the word frequency effect in visual lexical decision, progressive demasking, and subjective frequency ratings. For simplex, i.e., monomorphemic, nouns in Dutch, we studied the effect of the frequency of the monomorphemic noun itself as well as the effect of the frequencies of morphologically related forms on the processing of these monomorphemic nouns. The experiments show that the frequency of the (unseen) plural forms affects the experimental measures. Nouns with high-frequency plurals are responded to more quickly in visual lexical decision, and they receive higher subjective frequency ratings. However, the summed frequencies of the formations in the morphological family of a given noun (the compounds and derived words in which that noun appears as a constituent) did not affect the experimental measures. Surprisingly, the size of the morphological family, i.e., the number of different words in the family, emerged as a substantial factor. A monomorphemic noun with a large family size elicits higher subjective frequency ratings and shorter response latencies in visual lexical decision than a monomorphemic noun with a small family size. The effect of family size disappears in progressive demasking, a task which taps into the earlier stages of form identification. This suggests that the effect of family size arises at more central, post-identification stages of lexical processing.”

Although oldie, I still find their research relevant.

Identifying Hidden Sequences

27 Monday Aug 2007

Posted by egarcia in Data Mining, Machine Learning

≈ Leave a Comment

In the Row-Pruning Algorithm Tutorial post I introduced an algorithm for identifying terms sequences hidden in collections. The tutorial is available in Mi Islita.com site. Several applications were also presented.

In a nutshell, given a set of terms T={t1, t2..tm} extracted from a document collection, C={d1, d2..dn} the goal is to combine each of these terms with all other terms of T and then find the family of term sequences more frequently present in C. Such families are of the form

“k1″
“k1 + k2″
“K1 + k2 +..km”

The double quotes indicate that these are term sequences.

The algorithm then is aimed at establishing the identity of the k’s using T and C. Once k1 is identified, the algorithm reduces to conducting a binary partition on the corresponding largest answer sets. Normally T consists of few terms (m < threshold value) and as such is a subset of the index of terms. Whenever possible these should be on-topic terms and preselected by a suitable algorithm or heuristic criterion.

Confidence-based pruning is used, but one could as well use other measures like co-occurrence, support, or similar measures. One does not need to restrict the algorithm to just terms.

I see many applications to this, including visitations:

“Users clicking on X and then on Y tend to click next on Z“

“Customers visiting store X and then store Y tend to visit next store Z.”

and so forth…

ACM Document Engineering 2007 Conference

24 Friday Aug 2007

Posted by egarcia in Conferences

≈ Leave a Comment

The DocEng 2007 will be held at the University of Manitoba, Winnipeg, Canada from August 28 – 31, 2007.

The ACM Symposium on Document Engineering is an annual international academic conference devoted to the dissemination of research on models, tools and processes that improve our ability to create, manage and maintain documents.

Documents are one of the centerpieces of globally interconnected systems that store information drawn from many media and deliver that information as required by users. A document may be stored in final presentation form or may be generated on-the-fly, undergoing substantial transformations in the process. Documents may include extensive hyperlinks, thereby permitting virtual documents, and also making available structured collections of information on which to anchor automated reasoning, such as promoted through the Semantic Web. Furthermore, document technologies like XML are having a profound impact on data modeling, in part because of the way these technologies bridge and integrate a variety of paradigms.

The attend next week conference or for additional information visit


http://www.cs.umanitoba.ca/~doceng07/

Applications of Edit Distances

23 Thursday Aug 2007

Posted by egarcia in IR Tools, Programming

≈ Leave a Comment

After uploading the Levenshtein Edit Distance Tool I received several recommendations for its implementation. No doubt that this is a simlarity measure for the masses. Here is a current list.

The Levenshtein Edit Distance Algorithm can be used:

  1. for automatic marking of musical dictations.
  2. for regular expressions approximate matching.
  3. to identify if two genetic sequences have similar functions.
  4. to filter blocks of email lists (candidate spam addresses) within a LED threshold value.
  5. as the ultimate baby name explorer.
  6. to name products and services like domains, brands, etc.
  7. to conduct fuzzy search matches in EXCEL or your preferred environment.
  8. for spamdexing search engines – by randomly converting text into gibberish.
  9. for spam stemming search engines – by systematically appending edits to valid stems.
  10. as part of a spell checker routine.
  11. to identify duplicated content and plagiarism.

Got an idea, suggestion, or reference? Let me know. Don’t forget to include a link.

Data Mining and Reports on Terrorism

22 Wednesday Aug 2007

Posted by egarcia in Data Mining, Homeland Security

≈ Leave a Comment

I’m researching the topic of Data Mining (KDD) and Terrorism Information Awareness (TIA) for a graduate course and came across a great old resource:

Data Mining

It is oldie, but the important part are the references.

It may interest IRs conducting similar research.

Here is another great resource:

Data Mining and Homeland Security

JavaScript Tips

21 Tuesday Aug 2007

Posted by egarcia in Miscellaneous, Programming

≈ Leave a Comment

This is not a post about an IR topic, but since at some point IR projects resource to programming, I believe the post is relevant to this blog –especially when many IR tools used in a classroom demonstration setting are written in JavaScript.

I’m reading Douglas Crockford great video/ppt presentations on JavaScript via
http://101out.com/js.php
. There are many things average programmers don’t know about JavaScript, the most misunderstood programming language on the Planet. For those not familiar with Crockford, few years ago he pioneered the right way of writing JavaScript. Haven’t heard of JSON?

He is giving so much great tips in those videos and ppt slides. Here are some tips:

Tip #1

//Instead of

if(a==null) {...} //which does coercion

//do this:

if(a===null){...}

//Also instead of != use !==//Avoid altogether == and != in your code. The === operator compares objects references, not values. It is true only if both operands are the same object

Tip #2

//Instead ofif(a){return a.member;}else{return a;}

//do this, which is shorter:

return a && a.member;

 Tip #3

//Use || to set default values
//do this, which requires less typing:

var last=input||nr_items;

//if input is truthy, last is input, otherwise set last to nr_items

Tip #4

//Statements can have labels. Break statements can refer to labels. Use labels only on do, for, switch, and while.

//do this

loop: for(;;)
{
//do something
if(...){break loop;}
//do something
}

There are more great tips, but is better if you assimilate these at your own pace. Time to use literals more often. So,

//it is time to use () instead of new Object() and [] instead of new Array().

For code conventions for the JavaScript programming Language visit


http://javascript.crockford.com/code.html

I must agree with him that most JavaScript code on the web is crap.

Levenshtein Edit-Distance Based Tool

20 Monday Aug 2007

Posted by egarcia in Data Mining, IR Tools, IR Tutorials

≈ Leave a Comment

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

← Older posts
August 2007
M T W T F S S
« Jul   Sep »
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.