A Tutorial on the Levenshtein Distance


, , ,

A short tutorial on the Levenshtein Distance is available now at


Did you know that Levenshtein Distance is at the heart of sequence analysis and text mining-based technologies? It is so simple, elegant, and relevant to many research fields.

Levenshtein Distance Calculator


, , ,

The Levenshtein Distance Calculator is back. This tool was removed from our old site, but now is available at


This is a visual and interactive tool great for sequence analysis, text mining, and teaching. A tutorial listing practical applications will soon follow.

A Tutorial on Distance and Similarity


, ,

The first of a series of companion tutorials for some of our tools is available now at http://www.minerazzi.com/tutorials/index.php

In this tutorial we present a general overview of two association measures used in data mining and information retrieval: distance and similarity. To learn the difference between the two, visit


Two Essential Tools for Data Miners


, ,

We are moving from our old site to Minerazzi.com two tools essential to data miners interested in comparing data sets.

1. The Binary Distance Calculator, available at http://www.minerazzi.com/tools/distance/binary-distance-calculator.php
2. The Binary Similarity Calculator, available at http://www.minerazzi.com/tools/similarity/binary-similarity-calculator.php

These can also be used by teachers and others that need to score or grade any two data sets in terms of their dissimilarities (distances) and resemblances (similarities). The sets to be compared must be of same size and binary.

We plan to add additional distance and similarity definitions to the tools. We are also working on a companion tutorial for these tools, to help users understand the difference between the above concepts.

The Library Recrawl Project


, ,

The Library Recrawl Project (http://www.minerazzi.com/lrp) is a new miner built with Minerazzi. It allows users to recrawl all top World libraries, their catalogs and information gateways. Users can search inside results and uncover vast amounts of resources. To search, users can enter a library name, keywords, or country.

Search Examples

Query [ library of congress ], [ world catalog ], [ national archives ], [ national libraries ], [ public libraries ], [ state libraries ], etc… Or if you prefer, search by country.

By recursively searching inside results with our Search Inside tool, you will be discovering entry points to vast amounts of new resources ( libraries, catalogs, etc) . Have fun.

Retro Searches

Query [ z3950 ] to find libraries across the World using the old z39.50 search implementation.

Wikiminer: Mining Wikileaks


, , ,

Wikiminer is a new miner built with Minerazzi ( http://www.minerazzi.com/wikiminer ) exclusively for mining Wikileaks.

It allows users to find secret information, news leaks, and classified media from anonymous sources by mining Wikileaks link graph. Search by keywords or location.

Search example

Query [ cablegate ] in this miner. Then locate the result whose URL is https://wikileaks.org/cablegate.html and click the Search Inside tool icon below said result. By recursively searching inside a result you will be walking a portion of Wikileaks link graph.


Open Source Projects – A New Minerazzi Miner


, ,

Open Source Projects is a new miner available at http://www.minerazzi.com/osp. It allows you to find or submit all kind of open source projects. Access open source community resources. Search by software, hardware, or project name.

Looking for open source projects relevant to Apache, Linux, or Windows? Need to be more specific in your search (e.g. search for Weka, JQuery, Aptana, JNode, Ubuntu, Mozilla, etc…)? Want to build your own open source collections? If so, this miner is for you.

Mining Wikileaks with Minerazzi


, ,


Wikileaks.org is one of those huge sites where researchers and investigative reporters can feel like in heaven.

That is, provided that they have a way to move across Wikileaks complex link structure. Simply put, they need a tool that allows them to understand the relationships between links and quickly move in and out of specific link paths of interest. This need to be done at different levels of the link graph, while current resources are pulled out of said structure and in almost real time.

That is hard to do by just searching or by crafting site, command, or custom searches–not even by using Wikileaks own search engine.

Fortunately, you can do the above with Minerazzi recrawling features–at least to some degree.

Although Minerazzi technology is evolving and not perfect, moving from searching indexes to mining user-driven recrawls is a right step in the right direction.

However, there might be a broad spectrum of starting experimental conditions, each one requiring of different crawling strategies.

The purpose of this post is not to discuss solutions for all possible experimental conditions. It is assumed that users are familiar with Minerazzi’s Recrawl It (RI) and Search Inside (SI) complementary tools. To simplify,  the recrawls are done with SI

Example 1: Initial URL is not given.

Search for [wikileaks] in the Investigative Journalism miner (http://www.minerazzi.com/journalism). Find a result that might interest you.

A good starting point is the result whose URL is https://www.wikileaks.org/wiki as it contains links to latest leaks and recent analyses. Click the Search Inside tool icon below this result.

That should retrieve all links from this result with the tool icon now at the right of each of the new results.  You should see three output sections. The first one logs the current URL being crawled. The other two’s are the External and Internal Links sections.

You can now recursively recrawl results by clicking their SI icon and, again, check how the above sections are updated. That is, you will be walking a portion of Wikileaks link graph. At any given step you can walk backward or forward the link graph by clicking the SI icons from the above sections.

This mechanism works as expected with the latest versions of Firefox, Opera, Safari, and Chrome browsers. However, sometimes the state of the logged section is not preserved in IE. We are working on fixing this anomaly.

Example 2: Initial URL is given.

If the initial URL is given or obtained through a search or previous crawl, recrawl its links as in Example 1. A good starting point is https://www.wikileaks.org/the-spyfiles.html

You can always submit for indexing in the above miner a particular Wikileaks URL. Once indexed, you can use it as a starting point.

Example 3: What if I still want to combine searching with recrawling?

You can always do that. Wikileaks link graph has many URLs with the pattern [keyword].wikileaks.org which can be easily mined.

For instance search for [file wikileaks] and recrawl with SI the result whose URL is https://file.wikileaks.org. Next from the results page recrawl the result whose URL is https://file.wikileaks.org/file. You will be presented with over a thousand of interesting results. Have a field day!

What is next?

Because Wikileaks is so huge, perhaps it is time for us to start building a miner exclusively for mining Wikileaks.org site. Such a miner will help us to address initial starting point and link walk issues.

Mining Cuba Newspapers and Resources


, ,

 Looking for mining newspapers and all kind of resources from Cuba?
With US and Cuba reaching out each other, there is an increasing interest in data mining resources from that beautiful caribbean island.
For companies interested in jumping on the bandwagon (e.g., marketing, tourism, and technology companies) the following might be relevant to them.
We have added a whole new set of newspapers to the News miner (http://www.minerazzi.com/news), to include newspapers, not just from Cuba, but from all the caribbean islands and 50 states from the US. Whether you want to build curated collection of resources from Cuba, Dominican Republic, or Virgin Island, use this miner to your heart needs.

Mining HuffingtonPost, DrudgeReport, Topix, Google News, and few others news services


, ,

The news miner (http://www.minerazzi.com/news) was built for indexing and mining newspapers. However, you can use it to mine news aggregation sites like HuffingtonPost, DrudgeReport, Topix, Google News, Yahoo News, Bing News, and many more. Just visit the above link and search for any of those sites.

After that you can recursively crawl these with Minerazzi’s Search Inside and Recrawl It tools. These are complementary tools so if one returns no results, try the other one.

To illustrate, the HuffingtonPost and DrudgeReport are two of the best user-friendly and content-rich news sites on the Web. These are great sources for building news collections about relevant topics like politics.

By searching for [ huffingtonpost ] or for [ drudgereport ] you can discover additional news services and even follow specific authors and their posts. You can then start building curated collections of news services, authors, and their posts.

When building collections from news services, if a remote host is busy you may want to retry it at another time. However, if the remote host denies you service you are out of luck. This is not really a drawback. As there are zillion of friendly hosts out there that will provides you with rich content, the ones that eventually refuse connection are expendable.