Mining Cuba Newspapers and Resources


, ,

 Looking for mining newspapers and all kind of resources from Cuba?
With US and Cuba reaching out each other, there is an increasing interest in data mining resources from that beautiful caribbean island.
For companies interested in jumping on the bandwagon (e.g., marketing, tourism, and technology companies) the following might be relevant to them.
We have added a whole new set of newspapers to the News miner (, to include newspapers, not just from Cuba, but from all the caribbean islands and 50 states from the US. Whether you want to build curated collection of resources from Cuba, Dominican Republic, or Virgin Island, use this miner to your heart needs.

Mining HuffingtonPost, DrudgeReport, Topix, Google News, and few others news services


, ,

The news miner ( was built for indexing and mining newspapers. However, you can use it to mine news aggregation sites like HuffingtonPost, DrudgeReport, Topix, Google News, Yahoo News, Bing News, and many more. Just visit the above link and search for any of those sites.

After that you can recursively crawl these with Minerazzi’s Search Inside and Recrawl It tools. These are complementary tools so if one returns no results, try the other one.

To illustrate, the HuffingtonPost and DrudgeReport are two of the best user-friendly and content-rich news sites on the Web. These are great sources for building news collections about relevant topics like politics.

By searching for [ huffingtonpost ] or for [ drudgereport ] you can discover additional news services and even follow specific authors and their posts. You can then start building curated collections of news services, authors, and their posts.

When building collections from news services, if a remote host is busy you may want to retry it at another time. However, if the remote host denies you service you are out of luck. This is not really a drawback. As there are zillion of friendly hosts out there that will provides you with rich content, the ones that eventually refuse connection are expendable.

Mining Google Scholar with Minerazzi: Building curated collections of topics and authors


, ,

At the time of writing, curated collections of topics and authors can be easily done by mining Google Scholar results with Minerazzi. This can be illustrated with the following examples.

Mining Topics Example

1. Search [ pagerank ] with the Information Retrieval Collection  miner at

2. Locate the result whose URL is and click the Search Inside tool icon located below said result.

3. Note from step 2 output that for Google Scholar some of the links discovered by the tool are about co-authors discovered by Google. Locate a co-author and click the Search Inside tool icon, this time located the right of said result.

4. You will be presented with a new list of results each with the Search Inside icon. Some of these include co-authors.

By recursively using Search Inside you can build a curated collection on the pagerank topic or a curated collection of co-authors, without having to resubmit the query.

This approach assumes that the initial Google Scholar URL to be mined is already in the IRC microindex. For other queries, you need to query Google Scholar and submit for indexing in IRC the search results URL. Once indexed, it can be mined as described above.

However, if a user discovers a Google Scholar URL when using the Search Inside tool on a previous result, said URL can be recrawled and mined as described above, so it no need to be in the IRC microindex at all.

In general, any URL searchable with Search Inside can be mined, unless the tool hits a dead end (no links accessible or to follow).

How to start a curated collection about the Death Penalty with Minerazzi


, ,


To start a curated collection about the Death Penalty, search for [ cornell ] in the Human and Civil Rights Collection miner ( Use the Search Inside tool on the third result whose URL is

The tool will retrieve a list of External Links. From the listed links, now Search Inside the fourth result whose URL is

You should get over 2,800 links, enough to start a curated collection on the topic. This is a practical example on building collections with Minerazzi. Great for attorneys, law students, or others interested in the above topic.

Walking the Link Graph of Sites with Minerazzi


, ,

Today we are adding a new tool that reports users which sites they have visited while walking the link graph of recrawled web pages. The tool works when they use the Search Inside tool of any miner built with Minerazzi ( At this time we are limiting it to the last 10 visits per result per query session.
On other matters, we welcome Google and its Ara Mobile project to Puerto Rico.

The Human and Civil Rights Collection Miner


, , , ,

The Human and Civil Rights Collection is a new miner built with the Minerazzi platform. It is available at To understand its capabilities, do a search for [ minnesota human rights library ]. Then find the result with the URL

Click the Search Inside icon at the right of this result to recrawl said result. By doing so, you already have access to almost the entire online collection of human rights resources from the University of Minnesota Library. By recursively using the Search Inside tool, you can keep discovering new resources.

You can do similar searches in large repositories like the Library of Congress. You just need to find said repository in the miner or as a secondary URL while recrawling.

The discovering and data mining capabilities of Minerazzi has prompted us to launch a new and ambitious project: The World Libraries Recrawl Project.

Mining SEO conference sites and their speakers with Minerazzi


, ,



The SEOMiner ( can be used to illustrate the power of recursively crawling sites and social networks with Minerazzi.

In general, the goal of Minerazzi is to turn searchers into data miners. We try to accomplish this with dozen of tools the platform comes with.

For instance, we have two complementary tools: Recrawl It (a url crawler) and Search Inside (a link crawler). In this post, we want to discuss Search Inside so you could grasp its power.

Go to the seominer ( and search for [ ] to find said site link. Click the Search Inside icon (a black square of contiguous arrows) of that result.

You should see a list of external and internal links. In the internal links list, locate the SES London result and again click the Search Inside icon this time located at the right of said result. The tool will retrieve more results.

By recursively searching inside you can extract more resources and if lucky enough hit a blog, discussion forum, portal, etc which then you can keep mining.

If you prefer, you can select results by clicking the {S} red link at the top-right of the lists. You can then export the results by copy/pasting to an external source (e.g., Excel, txt file, etc..). In this way you can build your own curated collection or organize links to your heart needs.

You can also start the above mining by just searching for [ speaker profiles ] in the seominer, doing a Search Inside in the SES London result, and then doing a Search Inside on a result as done before.

In some tests, we were able to get inside a discussion forum and all replies of a specific post. In another test, we were able to mine zillion of posts of Twitter and Facebook users.

Proceeding as describe above, you will be walking and mining the link structure of sites across the Web. That is you will be mining while searching which, as said at the beginning of this post, is our primary goal: to turn searchers into data miners.

Mining Stanford, Cornell, and MIT Universities


, , ,

Yesterday we launched the US University Sites Collection ( This is a miner built with the Minerazzi platform that allow users to search, mine, and recrawl all top university sites from the US. Note that in this miner your query should be about university sites. Our topic-specific miners in general are not for generic queries. All this is explained at our site.
You can try it by using [ stanford ], [ cornell ], or [ MIT ] as the query. Then just click the Search Inside link of any of the search results and you are your way to discover or build collections out of these university sites. You can also use any of the dozen of extraction tools of the platform to mine each result.
You can do the same with any of our miners. For instance, using the Information Retrieval Collection miner (, search for [ Gerard Salton ]. Soon you will see dozens of Salton articles and you will be in a journey discovering nice resources. If you go to the first result and click the chained arrow icon at the far right, the Search Inside tool will recrawl that result.
Then go the Internal Links list and find the link result that says “Browse All” and click again the same icon to see over 1,400 results from Cornell’s dSpace community list. Soon you will be discovering more resources.
You can also search for [ ecommons ] and mine Cornell’s eCommons database as well and any subsequent database, directory, or library resource that you might come across during the discovery journey.
Note that with Minerazzi, users can search and mine records straight from the search result pages, something that cannot be done with Google, Bing or Yahoo. So the platform might benefit researchers, librarians, students, and the general public.
To sum up, what the Minerazzi platform proposes is a new search paradigm: to turn searchers into data miners.

The US University Sites Collection


, ,

We have launched a new miner: The US University Sites Collection (US). Available at, this miner lets you mine all top US university sites, organize links, and build customized collections customized to your research or scholarly needs.

In addition, we have added a data set of all Nasdaq ticker symbols to the mystocks miner. ( ). Soon we will be adding a NYSE ticker data set.

Mining Hubs with Minerazzi


, ,

This morning we added an entire new data set to the Puerto Rico Collection ( . The set added was done by indexing the portal which itself is a hub to many data sets. To verify this, search for [ puerto rico data portal ]. Economists, Statisticians, and local researchers are familiar with that portal.

This is an example of using Minerazzi to build customized collections from a hub site. One only need to use the Search Inside feature from a search result and start discovering a whole new world of data mining possibilities.

BTW, US government versions are in the pipeline. As many agencies are too big, these will be agency-specific.

Sure, you can do a site search in Google by appending the site: operator to and submit that as a query. However, you will not be able to data mine the results and organize curated collections while searching, nor you will be able to search inside a Google result. To me, this is an interesting search paradigm. Whether it might be a game changer, time will tell. As is now, I’m just happy with what we have accomplished with the platform.

In addition, when using Minerazzi’s Search Inside crawler, the tool will pull current versions of documents as they actually exists in remote servers. That is, these are not precached records from an index.

On other matters, as an example of mining authority hubs with Minerazzi, this coming week we will be launching the University Sites Collection (USC). Librarians, students, and teachers will be able to build curated university resource collections to their heart needs using this new Minerazzi miner. A worldwide universities version is in the pipeline.

Overall, with Minerazzi, it is really simple to extract and build collections on practically any subject, business, or knowledge domain.