Archive for the ‘Programming’ Category

Jaxer, a new playground for kids and script kiddies

October 14, 2008

We recently featured Jaxer, and end-to-end AJAX server. According to Ian Selby,kids are now discovering how many great things they can do with it.

http://www.gen-x-design.com/categories/jaxer/

Jaxer is one of the best news I heard this year in the programming field. We mentioned this server platform before in http://irthoughts.wordpress.com/2008/09/03/more-good-reasons-for-learning-javascript/

Script kiddies, spammers, and viral marketers, take note.

More Good Reasons for Learning JavaScript

September 3, 2008

If you are one of those folks that dislike JavaScript and do the impossible to avoid it, you probably are living in oblivious and denial. If you are one of those that think the language is just client-side, I have some news for you:

Good reasons for start learning JavaScript are:

1. JAXER

This is the first AJAX Server. HTML, JavaScript, CSS, XMLHTTPRequests, JSON, DOM, scripting, etc are native to JAXER. It has access to databases, files, networking, logging, process management, scalability, security, integration API, PHP, Rubi, etc.

Check http://www.aptana.com/jaxer

2. GOOGLE CHROME BROWSER

Built on JavaScript and other technologies, this is the long awaiting Google’s Browser to rival IE and Firefox’s dominance. How does running native code sound to you?

Check http://www.google.com/googlebooks/chrome/

Also check http://googleblog.blogspot.com/2008/09/fresh-take-on-browser.html

3. ADOBE AIR

Adobe’s Integrated Runtime applications can be developed with AJAX and HTML.

Check http://livedocs.adobe.com/labs/air/1/jslr/

The Porter Stemmer

July 18, 2008

A grad student asks (name omitted):

Dear Dr. Garcia,

I’m interested in developing a Porter Stemmer for the Irish language.

Would it be possible to send me your lecture notes for Porter Stemmer
development from your graduate course?

I am doing an MSc thesis on developing a search engine for TEI marked up
multilingual texts and hope to use Apache Lucene as a basis.

Thanks for any help,

******
MSc student,
UCC, Cork, Ireland.

Thank you for reading this blog and for emailing me, but I normally don’t release lecture notes. However, the lecture was based on Martin Porter’s site, which can be accessed by visiting the following link:

http://tartarus.org/~martin/PorterStemmer/

You might also want to check the Porter2 Stemmer:
http://snowball.tartarus.org/algorithms/english/stemmer.html

For stemmers in other languages, check the Snowball site:
http://snowball.tartarus.org/

The great thing about the Porter stemmer is that it has been written in many programming flavors and languages.

Search Engines Architecture Week 7

April 25, 2008

Week 7 Agenda

Lecture Session

Review of a typical
Search Engine Architecture
Regular Expressions for Building Porter’s Stemmer

Lab Session

Finishing Lab 3 and 4

Google-IBM-NSF Partnership in Data-Intensive Computing

February 28, 2008

We received few days ago the following Call for Proposals:

Dear Colleague:

We have some exciting news to share with all of you: NSF is partnering with Google and IBM to explore data-intensive computing. Through NSF’s reach, Google and IBM are providing software and services running on a large cluster to the broad academic community to explore innovative research and education ideas in data-intensive computing. Google and IBM launched the Academic Cluster Computing Initiative [1] last October with instructional programs at six pilot universities, and the NSF will be joining this initiative as the first research-oriented pilot partner. We are calling the NSF program to provide access to these types of resources the Cluster Exploratory (CluE).

Please read more about data-intensive computing, NSF’s partnership with Google and IBM, and the upcoming CISE solicitation for CluE.

Jeannette M. Wing, Assistant Director, Computer and Information Science and Engineering Directorate Dan Atkins, Director, Office of Cyberinfrastructure

Data-Intensive Computing

Data-intensive computing is a computational paradigm in which the sheer volume of data is the dominant performance parameter. Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. For example, Google runs an average of 100,000 MapReduce jobs per day on its clusters, processing over 20 petabytes daily [2]. This scale of computing effectively supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. The architecture is extremely fault-tolerant and exhibits high degrees of reliability and availability.

Data intensive computing raises important research challenges:

O For science
  - What are the fundamental capabilities and limitations of this paradigm?
  - What new programming abstractions (including models, languages, algorithms) does this computational model suggest?

O For technology
  - How can we automatically manage the hardware and software of these systems?
  - How can we reduce their power consumption?
O For society
  - What (new) applications can best exploit this computing paradigm?

Data-intensive computing is at the forefront of ultra-large-scale commercial data processing. A July 2006 New York Times article [3] notes that “Google, Microsoft and Yahoo are spending vast sums of capital to build out their computing capabilities.” Not only is there an increasing need for advances in data-intensive computing systems software and hardware, but also an increasing demand for a trained workforce to operate and use these systems. To date, however, the academic community has had limited access to such systems.

Enter Google and IBM

On October 8, 2007, Google and IBM announced they had teamed to provide six universities access to a large-scale computing cluster together with the software and services to use it effectively [4]. After several months of discussions, the NSF will be joining this initiative and will be partnering with Google and IBM to broaden the reach of this powerful computing resource to foster more innovation than might be possible in the initial pilot.

Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do fundamental, disruptive research in data-intensive computing and to explore powerful new applications. This facility can also serve as a tool for educating the next generation of scientists and engineers. This partnership is an excellent example of an academic-industry-government relationship that is a win-win-win situation for all.

System Description

The Google-IBM cluster contains well over a thousand processors connected to terabytes of memory and hundreds of terabytes of storage with internal networking as well as a substantial external network connection. The system will be configured with open source software to include Linux and Apache Hadoop [5] – a large-scale distributed computing platform inspired by Google’s MapReduce [6] and the Google File System [7]. IBM’s Tivoli [8] software will also be used for management, monitoring and dynamic resource provisioning of the cluster.

The system will provide a powerful resource for large-scale data analysis, mining and visualization in addition to support for Internet-scale computing applications. Tutorial information describing the programming environment of the Google-IBM academic cluster available via the CluE program can be found on the Google Code for Educators website [9]. Much of this material was developed in collaboration with the University of Washington, and all of it is available under permissive licenses such as the Creative Commons Attribution License.

Upcoming Solicitation

CISE is currently developing a program solicitation that will invite researchers to submit proposals requesting allocations of the Google-IBM cluster for any new, innovative use of the system and to probe the possibilities and fundamental limits of this new computing paradigm. The emphasis of the program will be to develop new approaches and applications that are outside the typical high-performance computing applications running on today’s supercomputers.

The challenge to the academic community is three-fold: to use existing tools and to develop new programming abstractions for such a “computer” to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications. This resource will also provide an opportunity to teach students how to build, use and manage data-intensive computing systems systems that are already being used widely in industry, but are largely unavailable to students and faculty today.

Please look for the new solicitation which will be posted on the CISE web site. CISE looks forward to your bold, creative proposals for CluE!

[1] Official Google Blog: http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. of the ACM, 51(1), January 2008, pp. 107-113.

[3] http://www.nytimes.com/2006/06/14/technology/14search.html?_r=1&pagewanted=2&oref=slogin

[4] See http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html or http://www-03.ibm.com/press/us/en/pressrelease/22414.wss for the text of the press release.

[5] http://hadoop.apache.org/

[6] http://labs.google.com/papers/mapreduce.html

[7] http://labs.google.com/papers/gfs.html

[8] http://www.ibm.com/software/tivoli/

[9] http://code.google.com/edu/content/parallel.html

SIMCALC, Binary Similarity Calculator

January 2, 2008

SIMCALC, my binary similarity calculator is officially online.

The calculator was designed to compute similarity measures between any two binary strings of identical length. To use the calculator users must be familiar with Data Mining and similarity analysis. Read the instructions. Since strings are treated as vectors, the tool also works as a vector analyzer.

Who could benefit from this tool?

Scholars

IR/Statistic teachers, students, and researchers can use this calculator for classroom demonstrations or to compare results or exams of the Right (1), Wrong (0) type.

Investigators

Investigators and testers can use it to examine possible cases of duplicated content, fraud, or plagiarism.

Marketers

Marketing and sales executives can use the tool to score consumers’ satisfaction questionnaires of the Yes (1), No (0) type.

Business Intelligence Analysts

Analysts can use it to extract patterns and correlations from polls, surveys, and similar intelligence instruments.

The following figure depicts SIMCALC sample results for the 1010101 and 1010101 vectors:

Binary Similarity Calculator

Terrier IR 1.1.1 Update

November 9, 2007

Craig MacDonald, from the Terrier project at University of Glasgow sent me this update few days ago.

“Terrier, IR Platform v 1.1.1 – 24/10/2007. http://ir.dcs.gla.ac.uk/terrier/

Terrier, the open source IR platform from the University of Glasgow has been updated to version 1.1.1.

This is a minor update, which contains mostly bug fixes. Some minor code enhancements and a test harness are also included. Moreover, the Snowball stemmers were added to boost support for languages other than English.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL) and is available from the Terrier website: http://ir.dcs.gla.ac.uk/terrier

Disgression to Numerical Dynamics and Chaos

September 27, 2007

I recently reviewed a thesis project wherein the student used recursion to minimize a function. The student used number of iterations as a stopping criterion. When I revised the manuscript, it gave me a flashback to 1991.

Back then I was inmersed into Numerical Dynamics and Chaos Theory conferences. I wrote a paper wherein one little component consisted in stopping a routine if it reaches 10,000 iterations. The piece landed in the hands of a picky reviewer. It was rejected on the grounds that number of iterations was not a good stopping criterion.

The reviewer was absolutely right. I missed the point back then, though.

Why use 1,000, 10,000, or 1,000,000 iterations? The point is that no matter how many iterations one uses there will always be someone out there asking: Why use n number of iterations? Why not more?

Indeed, number of iterations as a stopping criterion is not recommended, especially if the goal is optimization. Not to mention that this is a subjective approach which requires initializing an extra parameter. The parameter might be sensitive to other variables or initial conditions.

Instead of number of iterations, a better approach consists in using a comparative function F. F can be defined as the absolute error or absolute relative error between a sequence of results.

If F is less than a threshold value, we stop the iterative procedure; otherwise, we continue with it.

Such functions are given below:

F = | (X(n) – X(n+1)) | * 100 < threshold value
F = | (X(n) – X(n+1))/X(n+1) | * 1000 < threshold value

and so forth.

For example if F is less than, let say, 10^-6, we stop the recursions.

In this case, F is based on an objective, statistical criterion: the relative error between two consecutive results. Note that:

1. the analyst states the threshold value that best sweets his/her precision needs.

2. F is independent of any initial condition or parameter.

Of course, for this to work the recursion should be toward what we call in Chaos Theory, an attractive fixed point. Methods for evaluating if this is the case do exist in Numerical Dynamics (e.g., the Fixed Point Theorem).

Fortunately, the student’s problem was not of this kind and he was not dealing with a dynamical system at all.

Binary Similarity Calculator

September 17, 2007

In my previous post I mentioned the Hamming Distance. I was thinking in adding to my Levenshtein Edit Distance calculator the ability to compute this statistic, but I changed my mind.

Rather, why not design a whole new tool that computes, in addition to the Hamming Distance, other coefficients?

So,…

Welcome to the Binary Similary Calculator and Vector Analyzer.

This is a new tool I’ve built over the weekend. It will be uploaded in the next few days.

The tool compares the similarity of a pair of strings. These should consist of binary characters and be of the same length. Since strings are treated as vectors, the tool also works as a vector analyzer; hence, its name.

The following similarity measures are computed from a contingency table of positive and negative matches and mismatches: Sokal-Michener (i.e., Simple Matching), Jaccard, Russell-Rao, Hamann, Sorensen (i.e., Dice), antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson, and Gower2.

The tool also computes Dot Products and Cosine Coefficients.

Although a distance metric, the Hamming Distance (number of positive and negative mismatches) has been included because of its close relation with some of the aforementioned similarity measures.

Hamming Distance

September 14, 2007

After uploading the Levenshtein Edit Distance tool, several readers asked if I could incorporate a version to account for the Hamming Distance (HMD). Some even asked if could clarify the difference between the two.

I’m not sure if I can find time for the former, but many scripts are already online for computing Hamming Distances. One only needs to write few lines of code to compare the number of different characters between strings of same length. This is the Hamming Distance.

Here is a nice exercise:

Compute LED and HMD for derivatives of “ELVIS”. Cluster results by LED and HMD values.

Brute Force Algorithm

September 10, 2007

The brute force algorithm (BF) consists in evaluating a text stream and determining at all positions between 0 and n-m, whether an occurrence of a pattern starts there or not. After each attempt, it shifts the window pattern by exactly one position to the right. This requires of a constant extra space. Comparisons can be done in any order.

This is quite a brute force approach, hence its name. Two loops are required for its implementation. A C code is given below.

void BF(char *x, int m, char *y, int n)
{
int i, j; 

for(j = 0; j <= n - m; ++j) 

{ 

for (i = 0; i < m && x[i] == y[i + j]; ++i);if (i >= m){output(j);} 

} 

}

This code can be converted straightforward to JavaScript or PHP.
 Modern Information Retrieval, Chapter 8, p. 210, has additional information on this and derivative string matching algorithms. These all differ in the way they check and shift the window.

Applications of Edit Distances

August 23, 2007

After uploading the Levenshtein Edit Distance Tool I received several recommendations for its implementation. No doubt that this is a simlarity measure for the masses. Here is a current list.

The Levenshtein Edit Distance Algorithm can be used:

  1. for automatic marking of musical dictations.
  2. for regular expressions approximate matching.
  3. to identify if two genetic sequences have similar functions.
  4. to filter blocks of email lists (candidate spam addresses) within a LED threshold value.
  5. as the ultimate baby name explorer.
  6. to name products and services like domains, brands, etc.
  7. to conduct fuzzy search matches in EXCEL or your preferred environment.
  8. for spamdexing search engines – by randomly converting text into gibberish.
  9. for spam stemming search engines – by systematically appending edits to valid stems.
  10. as part of a spell checker routine.
  11. to identify duplicated content and plagiarism.

Got an idea, suggestion, or reference? Let me know. Don’t forget to include a link.

JavaScript Tips

August 21, 2007

This is not a post about an IR topic, but since at some point IR projects resource to programming, I believe the post is relevant to this blog –especially when many IR tools used in a classroom demonstration setting are written in JavaScript.

I’m reading Douglas Crockford great video/ppt presentations on JavaScript via http://101out.com/js.php. There are many things average programmers don’t know about JavaScript, the most misunderstood programming language on the Planet. For those not familiar with Crockford, few years ago he pioneered the right way of writing JavaScript. Haven’t heard of JSON?

He is giving so much great tips in those videos and ppt slides. Here are some tips:

Tip #1

//Instead of

if(a==null) {...} //which does coercion

//do this:

if(a===null){...}

//Also instead of != use !==//Avoid altogether == and != in your code. The === operator compares objects references, not values. It is true only if both operands are the same object

Tip #2

//Instead ofif(a){return a.member;}else{return a;}

//do this, which is shorter:

return a && a.member;

 Tip #3

//Use || to set default values
//do this, which requires less typing:

var last=input||nr_items;

//if input is truthy, last is input, otherwise set last to nr_items

Tip #4

//Statements can have labels. Break statements can refer to labels. Use labels only on do, for, switch, and while.

//do this

loop: for(;;)
{
//do something
if(...){break loop;}
//do something
}

There are more great tips, but is better if you assimilate these at your own pace. Time to use literals more often. So,

//it is time to use () instead of new Object() and [] instead of new Array().

For code conventions for the JavaScript programming Language visit

http://javascript.crockford.com/code.html

I must agree with him that most JavaScript code on the web is crap.