Archive for the ‘IR Tools’ Category

SIMCALC, Binary Similarity Calculator

January 2, 2008

SIMCALC, my binary similarity calculator is officially online.

The calculator was designed to compute similarity measures between any two binary strings of identical length. To use the calculator users must be familiar with Data Mining and similarity analysis. Read the instructions. Since strings are treated as vectors, the tool also works as a vector analyzer.

Who could benefit from this tool?

Scholars

IR/Statistic teachers, students, and researchers can use this calculator for classroom demonstrations or to compare results or exams of the Right (1), Wrong (0) type.

Investigators

Investigators and testers can use it to examine possible cases of duplicated content, fraud, or plagiarism.

Marketers

Marketing and sales executives can use the tool to score consumers’ satisfaction questionnaires of the Yes (1), No (0) type.

Business Intelligence Analysts

Analysts can use it to extract patterns and correlations from polls, surveys, and similar intelligence instruments.

The following figure depicts SIMCALC sample results for the 1010101 and 1010101 vectors:

Binary Similarity Calculator

Binary Similarity Calculator

September 17, 2007

In my previous post I mentioned the Hamming Distance. I was thinking in adding to my Levenshtein Edit Distance calculator the ability to compute this statistic, but I changed my mind.

Rather, why not design a whole new tool that computes, in addition to the Hamming Distance, other coefficients?

So,…

Welcome to the Binary Similary Calculator and Vector Analyzer.

This is a new tool I’ve built over the weekend. It will be uploaded in the next few days.

The tool compares the similarity of a pair of strings. These should consist of binary characters and be of the same length. Since strings are treated as vectors, the tool also works as a vector analyzer; hence, its name.

The following similarity measures are computed from a contingency table of positive and negative matches and mismatches: Sokal-Michener (i.e., Simple Matching), Jaccard, Russell-Rao, Hamann, Sorensen (i.e., Dice), antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson, and Gower2.

The tool also computes Dot Products and Cosine Coefficients.

Although a distance metric, the Hamming Distance (number of positive and negative mismatches) has been included because of its close relation with some of the aforementioned similarity measures.

Hamming Distance

September 14, 2007

After uploading the Levenshtein Edit Distance tool, several readers asked if I could incorporate a version to account for the Hamming Distance (HMD). Some even asked if could clarify the difference between the two.

I’m not sure if I can find time for the former, but many scripts are already online for computing Hamming Distances. One only needs to write few lines of code to compare the number of different characters between strings of same length. This is the Hamming Distance.

Here is a nice exercise:

Compute LED and HMD for derivatives of “ELVIS”. Cluster results by LED and HMD values.

Applications of Edit Distances

August 23, 2007

After uploading the Levenshtein Edit Distance Tool I received several recommendations for its implementation. No doubt that this is a simlarity measure for the masses. Here is a current list.

The Levenshtein Edit Distance Algorithm can be used:

  1. for automatic marking of musical dictations.
  2. for regular expressions approximate matching.
  3. to identify if two genetic sequences have similar functions.
  4. to filter blocks of email lists (candidate spam addresses) within a LED threshold value.
  5. as the ultimate baby name explorer.
  6. to name products and services like domains, brands, etc.
  7. to conduct fuzzy search matches in EXCEL or your preferred environment.
  8. for spamdexing search engines - by randomly converting text into gibberish.
  9. for spam stemming search engines - by systematically appending edits to valid stems.
  10. as part of a spell checker routine.
  11. to identify duplicated content and plagiarism.

Got an idea, suggestion, or reference? Let me know. Don’t forget to include a link.

Levenshtein Edit-Distance Based Tool

August 20, 2007

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

Upcoming Tool on Edit Distances

August 17, 2007

I’m working on a tool for computing edit distances (number of insertions, deletions, and substitutions) in a text stream.

It will be up and running this Monday. It is great for a hands-on tutorial.

Did you know that to change Democrats into Republicans and vice versa requires of just 8 edits? :)

(more…)