Archive for the ‘IR Tools’ Category

Microsoft, Inter-Metro to Co-Launch a MIC

April 29, 2009

This afternoon, Microsoft in partnership with The Interamerican University of Puerto Rico, Metropolitan Campus (Inter-Metro) will announce that they are officially co-launching the Microsoft Innovation Center (MIC) of Puerto Rico.

This will be the first MIC in the region. A two stores building has been abilitated within the Inter-Metro campus for this project. As member of the MIC steering committee, I have been invited to the presentation by President, Manuel J. Fernos.

They have also provided me with office and lab space in the MIC building to put together the Internet Business Development Center (IBDC). The objectives of the MIC is the development and commercialization of ecommerce-related software tools. Emphasis will be given to egovernment and ebusiness solutions.

It looks like I will split my schedules between being the IBDC principal investigator, MIC meetings, doing research at Inter-Metro, teaching at PUPR, and writing IRWs. These are exciting news. Let see how things go, especially with the other great news  that PUPR’s ECE&CS department has been accredited by NSA as a CAE.

When OR Clusters Behave as AND

January 30, 2009

We are currently doing some testing with a new experimental engine. The experiment consists in using OR as the default mode and IDF-only for scoring terms. IDF is precomputed straight from the inverted index which is also computed at query time. We are also trying replacing IDF with Entropy scores.

With large collections, the inverted index is written to a text file and read at query time.

Since local information (e.g., term freq) is ignored, keyword spam is not an issue.

Instead of a Vector Space Model, we use a cummulative sum of scores over IDF scores, such that is not necessary to compute cosine similarities (*).

So far the results of the experiment is that with multi-term queries two extreme clusters are obtained:

1. the top N ranked documents almost behave as being queried in AND mode and as obeying the Cluster Hypothesis.

2. the M ranked documents at the bottom behave as being queried either in EXACT mode or with a single-term query. (**)

Between these extremes we have some noisy results.  

If some have tried this before, we would love to hear about it. Contact us by email.

 

PS.

(*) In this way we don’t need to make independence assumptions.

(**) With few changes, M now behaves as being queried with single-term queries or few query terms, which is what we expected. The N set still is the more interesting. The middle cases are now quite noisy.

A JavaScript Client-Side Search Engine Challenge

January 15, 2009

Last year, a student taking my Search Engine Architecture course asked me if client-side search utilities -like javascript-based tools- could be used to grasp how search engines process queries.

My answer is given below:

Well, it depends how these are scripted.

With few exceptions, most of those scripts teach you how to manipulate objects, constructors, etc.

The vast majority of these do not teach how to build the architecture of a search engine like its inverted index or how, when searched, the inverted index addresses the index and retrieves records. Most so-called “javascript search engines” do not incorporate the creation of pseudo-documents, procedures for normalizing queries/urls, attenuating term frequencies, a valid ranking algorithm, crawling agents, dispatcher, query server, etc.

Many so-called “search engines” are just oversized site-search tools with a poor sorting subroutine  that lacks of relevance judgements or valid ranking criteria.

Thus, it seems like a nice student term paper project or challenge competition: build a 100% client side search tool mimicking as many as possible of the architectural components of a real search engine, with nothing more than JavaScript and the IE browser. Here 100% means no ‘frankensteins’ like mixing JavaScript with other programming languages, additional platforms, or plug-ins.

Search Interface Usability Issues

August 5, 2008

Average search engine users don’t reformulate queries as we do in IR. They often recycle their query terms, using short queries of 2 to 3 terms. Frequently, their search sessions describe ‘query chains’, and using the default search mode.

Unless they are advanced-searching, most do not use query operators or shortcut search commands. Many do not consult a lookup list, thesaurus, or query logs to expand a query as we do in IR. Most don’t keep expanding a query. After few sessions they simply move on to another Web resource that might satisfy their information requirements.

Most don’t care about searching for very rare terms or terms with a high discriminative power, prefering to search for ‘what is hot’ or for what terms supply their information needs. Period.

Most are lazy users whose mentality is: “Don’t make me think!” or “I’m too busy to deal with a cluttered interface or learn new how-tos”. Many are so lazy or busy that don’t even scroll down a page. Others have a blue-linker mentality; i.e. assuming that blue underlined text has to be a link.

Thus, when a search interface is designed it should take into account the user’s search behavior, shaped mentality, and prejudgments. Their search experience should be guided by intuition and should be obvious, not requiring of extra information in order to search and find relevant documents.

Search engines that do not provide users with a ‘lazy search experience’ often do not attract enough users, visitors, or advertisers. I hope to be wrong about this one, but the two new search engines, Cuil and SearchCloud more likely will not make the A-List, in part because of several usability issues.

Size and hype is not enough.

PS. I forget to mention that:

1. of the two above, Cuil is a bit more user friendly, but its entry page/result page design is awful.

2. “search interface” to me means anything the end-user must interact with in order to search and find. Among other things, this includes the query box, entry page, and the results page.

SIMCALC, Binary Similarity Calculator

January 2, 2008

SIMCALC, my binary similarity calculator is officially online.

The calculator was designed to compute similarity measures between any two binary strings of identical length. To use the calculator users must be familiar with Data Mining and similarity analysis. Read the instructions. Since strings are treated as vectors, the tool also works as a vector analyzer.

Who could benefit from this tool?

Scholars

IR/Statistic teachers, students, and researchers can use this calculator for classroom demonstrations or to compare results or exams of the Right (1), Wrong (0) type.

Investigators

Investigators and testers can use it to examine possible cases of duplicated content, fraud, or plagiarism.

Marketers

Marketing and sales executives can use the tool to score consumers’ satisfaction questionnaires of the Yes (1), No (0) type.

Business Intelligence Analysts

Analysts can use it to extract patterns and correlations from polls, surveys, and similar intelligence instruments.

The following figure depicts SIMCALC sample results for the 1010101 and 1010101 vectors:

Binary Similarity Calculator

Binary Similarity Calculator

September 17, 2007

In my previous post I mentioned the Hamming Distance. I was thinking in adding to my Levenshtein Edit Distance calculator the ability to compute this statistic, but I changed my mind.

Rather, why not design a whole new tool that computes, in addition to the Hamming Distance, other coefficients?

So,…

Welcome to the Binary Similary Calculator and Vector Analyzer.

This is a new tool I’ve built over the weekend. It will be uploaded in the next few days.

The tool compares the similarity of a pair of strings. These should consist of binary characters and be of the same length. Since strings are treated as vectors, the tool also works as a vector analyzer; hence, its name.

The following similarity measures are computed from a contingency table of positive and negative matches and mismatches: Sokal-Michener (i.e., Simple Matching), Jaccard, Russell-Rao, Hamann, Sorensen (i.e., Dice), antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson, and Gower2.

The tool also computes Dot Products and Cosine Coefficients.

Although a distance metric, the Hamming Distance (number of positive and negative mismatches) has been included because of its close relation with some of the aforementioned similarity measures.

Hamming Distance

September 14, 2007

After uploading the Levenshtein Edit Distance tool, several readers asked if I could incorporate a version to account for the Hamming Distance (HMD). Some even asked if could clarify the difference between the two.

I’m not sure if I can find time for the former, but many scripts are already online for computing Hamming Distances. One only needs to write few lines of code to compare the number of different characters between strings of same length. This is the Hamming Distance.

Here is a nice exercise:

Compute LED and HMD for derivatives of “ELVIS”. Cluster results by LED and HMD values.

Applications of Edit Distances

August 23, 2007

After uploading the Levenshtein Edit Distance Tool I received several recommendations for its implementation. No doubt that this is a simlarity measure for the masses. Here is a current list.

The Levenshtein Edit Distance Algorithm can be used:

  1. for automatic marking of musical dictations.
  2. for regular expressions approximate matching.
  3. to identify if two genetic sequences have similar functions.
  4. to filter blocks of email lists (candidate spam addresses) within a LED threshold value.
  5. as the ultimate baby name explorer.
  6. to name products and services like domains, brands, etc.
  7. to conduct fuzzy search matches in EXCEL or your preferred environment.
  8. for spamdexing search engines – by randomly converting text into gibberish.
  9. for spam stemming search engines – by systematically appending edits to valid stems.
  10. as part of a spell checker routine.
  11. to identify duplicated content and plagiarism.

Got an idea, suggestion, or reference? Let me know. Don’t forget to include a link.

Levenshtein Edit-Distance Based Tool

August 20, 2007

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

Upcoming Tool on Edit Distances

August 17, 2007

I’m working on a tool for computing edit distances (number of insertions, deletions, and substitutions) in a text stream.

It will be up and running this Monday. It is great for a hands-on tutorial.

Did you know that to change Democrats into Republicans and vice versa requires of just 8 edits? :)

(more…)