Text Mining Tool: A Positional Posting List Generator


, , , ,

A positional inverted index essentially is a set of posting lists storing term weights, term positions, docids, etc from a collection of documents.

Posting lists can also be generated from a single piece of text.

Said lists come handy when we want to conduct text forensics or analyze writing styles; for instance, to check if there is evidence of plagiarism, to imputate authorship, or to analyze how a writer distributes stopwords, rare words, or specific combinations of terms across paragraphs, chapters, etc…

However, counting words and term positions by hand can be time consuming, unless you have a tool that does it for you.

We have developed such a tool, precisely. It is available at


The tool generates posting lists of the form:

term: {frequency value, [array of positions]}

where frequency values are taken for term weights and an array of positions is associated to a term.

At this time the tool analyzes plain text, only.

With minor modifications, it can be used to build a positional inverted index where Robertson’s BM25 weights are stored.

The Microsoft-Nokia Fiasco


, , ,

Reality Check! The Microsoft-Nokia Fiasco.



This RSS news, found with Minerazzi’s Social Pulse Parser (SPP).

A bye bye and a welcome


, , ,

Mi Islita.com is now a miner, dedicated exclusively to the indexing of sites relevant to Puerto Rico. Its legacy content is having a second life as it is being moved to the Tools and Tutorials sections of Minerazzi.com. All these changes are part of a broader effort of placing the latter at the center of the action.

Bye bye Mi Islita (2001-2015). Welcome, Mi Islita (2015 – ?).

Say “Hello” to Mind Retrieval


, , ,

Back in 2010 I mentioned in a spanish interview the notion of Mind Retrieval and search engines powered by bio-components as something soon to be a reality.


(Access the link through Google and translate it to read the interview.)

Well, it was a matter of time. The following news, gathered from Geek.com with our rss SPP service, shows that things are moving really fast in the HCI field: Mind-to-text/audio retrieval.


The next search marketing frontier!

Say “Hello” to Mind Retrieval.

More R Resources Available through the CRAN Miner


, ,

More R resources are now available through the CRAN Miner (http://www.minerazzi.com/cranminer).

We have expanded its index of R packages. In addition and thanks to the SPP service, its main page now features rss news from R bloggers.

As there are so many R blogs out there, we are building a separate miner for them so users can have a large chunk of the vast community of R bloggers accessible from one place.

SPP as a Sitewide Service


, , ,

We are turning the Social Pulse Parser (SPP) into a sitewide service.

This means that in addition to mining social signals generated by rss news and user-defined urls, users can now quickly check the signals of every single search result across all miners built with Minerazzi.

Essentially when a user clicks the SPP icon displayed in a search result, the corresponding url is passed to the SPPV2 tool.

To complete the process, we are testing a feature that allows users to switch between topic-specific news (i.e., news specific to a miner) and general news when they visit a miner. So far, this feature is available only with the miner at http://www.minerazzi.com/prbusca

The goal of all these changes is the same: to turn searchers into data miners.

From RSS to SPP… to SPPV2


, , , , ,

We originally developed the Social Pulse Parser (SPP) as a tool for reporting the reach of rss news across several social networks.

“Reach” is used in this case to mean how many opinion signals syndicated news urls provoke across the most popular networks of the social landscape.

While a good idea, something was missing: How about the reach of urls pointing to products and services?

To fulfill said need, we have developed a second version of SPP that accepts user-defined urls. With this new tool, SPPv2, users can input a set of web addresses (urls) and follow the reach of specific products and services (e.g., branded domains, ads, blog posts, business sites,…) across the social networks.

SPPv2 is available at http://www.minerazzi.com/tools/spp/sppv2.php

Below are some suggested exercises for you to try with the tool.

  • Compare the reach of a branded domain name, with different third level domain extensions or tlds (e.g., .com, .net, .org).
  • Repeat the exercise with a set of competitive brands and same tlds (e.g. pepsi.com, coke.com,…).
  • Repeat the exercise with a set of competitive brands and different tlds (e.g.pepsi.com, coke.net,…).
  • Compare the reach of the top search marketing companies or newspapers from your local area or country.
  • Collect query metrics on a time basis (hrs, days, weeks,…) and do a time series analysis.

For instance, try the following set of search marketing urls:


If you are from Puerto Rico, try the following local newspaper urls:


If you want to share additional practical applications for the tool, let us know.

The Social Pulse Parser


, , , ,

THE SOCIAL PULSE PARSER has arrived to Minerazzi (http://www.minerazzi.com)
What do you get when you combine RSS feeds with social metrics? A social pulse parser. Today, we did just that!
Essentially we have applied the power of our rss news parser to social networks so it can now reports Google +1, Pins, Tweets, and Facebook likes, shares, and comments statistics from each of the syndicated rss news parsed.
This innovation allows readers to compare the reach and impact of several channels/sites across the Web and in no time.
Just visit any of our miners to have a taste of what is coming to Minerazzi.