Keywords Spam Detector Tool


, , , , , , , ,

This is a new tool, available at

Term repetition abuse is considered an adversarial IR practice known as keyword spam. See list of practices we fought at AIRWEB at

This tool can help you to write better titles, abstracts, descriptions, paragraphs, or full text by allowing you to detect and fix over-repeated terms. The tool uses a proprietary algorithm for detecting frequency-based spam.

Once detected, over-repeated terms can be edited by either reducing their term frequency or diluting the input by adding unique terms not present in the original text.

GFR, Another newspaper company getting into Search Engine Marketing


, , ,

Back in June 27, 2002, about 14 years ago, I presented the seminar “When they search for you, but find your competition” at the local EDP University before a small audience of about 30, mostly marketing firms and academics, where the main topic was Search Marketing and the Semantic Competitor.

I spent the time trying to convince the audience that the future of traditional ad agencies and multimedia companies was search engine marketing (SEM) and social digital platforms. Back then SEO and SEM were unknown three letters. There was no Twitter or Facebook. And very few in PR knew about Google or search engines in general. Many regional and weekly newspapers were not interested in digital marketing, viewing it as a minor competition.

Times have changed with search, social, and everything else under the hood. Now many traditional publisher are moving toward offering SEM, flexing their corporate muscle. Alone came Infopaginas and few others.

The latest one is Grupo Ferre-Rangel, a multimedia company, owner of the largest local newspaper in PR, El Nuevo DĂ­a. Yep they are getting full blast into search engine marketing. Back in the States is the same: Traditional newspapers are getting into SEM.

Move on local little SEM firms. Resistance is futile, especially with the blessing of you know who…

While many are currently fighting the good war in the Social Marketing arena, let be ready for the next waves: Internet of Things Marketing (ITM), in smart houses, buildings, entire cities, the outer space… Planet-to-Planet Internet, anyone?

Very soon.



One of the first papers on LSI


, , , , , ,

Probably one of the first official papers on LSI that is still available online. Save it before no longer is.

Found with the [ lsi ] query through the IRC miner at

The year was 1988. What you were doing back then?


Virus Evolution Citation



Happy to see that The Self-Weighting Model (SWM) paper


was briefly cited in the Virus Evolution journal published by Oxford University Press, in the research paper:

Coevolutionary Analysis Identifies Protein–Protein Interaction Sites between HIV-1 Reverse Transcriptase and Integrase

HTML version:

PDF version:

This is a great example of applying data mining techniques to HIV research, a major public health issue according to WHO, UNAIDS, and other world health organizations.

The study agrees with the SWM thesis; i.e., that correlation coefficients are not additive. Glad to see how SWM influenced their data analysis.

More on SWM below:

Time to restore online my old tutorials on the non-additivity of correlation coefficients so the next generations of scientists are not misled (@SEO quacks and @MOZ pseudo-scientists).




The Panama Papers Miner


, , , , , , ,

The Panama Papers is a new miner available at

Find resources and entry points to the Panama Papers, the largest data leak of deception and corruption. Search by name, subject, or country.

A brief illustrated guide to building curated collections with Minerazzi is also available at

An application example to the Panama Papers is provided.

04-15-2016 update: 300+ new additional records just indexed this morning.

Big Data Sources Miner & Search Engine


, , , , , , , , , , ,

Big Data Sources is a new miner available at

This is a searchable collection of big data sources from around the World, all now at your fingertips.

Search by company, location, or service.

Additional miners are available at

An Introduction to Local Weight Models


, , , , , , ,

This is Part 4 of a tutorial series on Term Vector Theory. An introduction to several local weight models is presented.

The tutorial is available at

Local weights come in different flavors. This tutorial covers the following models:














, , , , , , , ,

We have developed a new tool called the FQU Bot which is available at

The tool extracts fully qualified URLs (FQUs) from a piece of text or previous URL

  • An FQU is one that starts with a scheme like http(s) and has the complete path to a file, including, if any, its directories (folders); i.e., it is not a relative URL. Some times you just want to extract FQUs from a piece of input text or from the front-end or back-end of a file. The file may consists of an email, blog, forum, twitter, or facebook message, or can be an htm(l), asp(x), doc(x), js, css, txt, or php file. Some times you also want to extract FQUs from a web page residing in a remote server. Said page might be a search results page, an online catalog, a link hub page, or one from your competitor or prospective client. This tools helps you to extract FQUs from all of these sources.
  • Just paste in the tool textarea the piece of text you want to extract FQUs from.
  • If extracting FQUs from previous FQUs, you must submit only one web address at a time. If a remote server blocks your URL request, visit the target URL and submit the visible text or source code associated to said web address.
  • To export the output, click the results textarea. Copy/paste them as you usually would. Disclaimer Notice. To prevent abuses, we have limited the tool to the processing of the first 100,000 input characters.


Illustrative Exercises

  • Submit an initial URL or source code of a web page known to list absolute URLs.
  • Extract FQUs from an email message by submitting its visible text or source code.
  • Extract FQUs from a search results page or remote web address, like Bing, Google, DMOZ, Minerazzi (yeah; why not?), your own site, your competitor’s site, this site, etc.


Internet of Things (IoT): A New Miner


, , , , , , , , ,

Internet of Things is a new miner available now at

Use it to find Internet of Things resources. Search by company or machine-to-machine products and services. Use its recrawling power to build your own curated collections.

Additional resources will soon be indexed.