Big Data Sources Miner & Search Engine


, , , , , , , , , , ,

Big Data Sources is a new miner available at

This is a searchable collection of big data sources from around the World, all now at your fingertips.

Search by company, location, or service.

Additional miners are available at

An Introduction to Local Weight Models


, , , , , , ,

This is Part 4 of a tutorial series on Term Vector Theory. An introduction to several local weight models is presented.

The tutorial is available at

Local weights come in different flavors. This tutorial covers the following models:














, , , , , , , ,

We have developed a new tool called the FQU Bot which is available at

The tool extracts fully qualified URLs (FQUs) from a piece of text or previous URL

  • An FQU is one that starts with a scheme like http(s) and has the complete path to a file, including, if any, its directories (folders); i.e., it is not a relative URL. Some times you just want to extract FQUs from a piece of input text or from the front-end or back-end of a file. The file may consists of an email, blog, forum, twitter, or facebook message, or can be an htm(l), asp(x), doc(x), js, css, txt, or php file. Some times you also want to extract FQUs from a web page residing in a remote server. Said page might be a search results page, an online catalog, a link hub page, or one from your competitor or prospective client. This tools helps you to extract FQUs from all of these sources.
  • Just paste in the tool textarea the piece of text you want to extract FQUs from.
  • If extracting FQUs from previous FQUs, you must submit only one web address at a time. If a remote server blocks your URL request, visit the target URL and submit the visible text or source code associated to said web address.
  • To export the output, click the results textarea. Copy/paste them as you usually would. Disclaimer Notice. To prevent abuses, we have limited the tool to the processing of the first 100,000 input characters.


Illustrative Exercises

  • Submit an initial URL or source code of a web page known to list absolute URLs.
  • Extract FQUs from an email message by submitting its visible text or source code.
  • Extract FQUs from a search results page or remote web address, like Bing, Google, DMOZ, Minerazzi (yeah; why not?), your own site, your competitor’s site, this site, etc.


Internet of Things (IoT): A New Miner


, , , , , , , , ,

Internet of Things is a new miner available now at

Use it to find Internet of Things resources. Search by company or machine-to-machine products and services. Use its recrawling power to build your own curated collections.

Additional resources will soon be indexed.

Data Mining Technologies: A New Miner


, , , , , , , , , ,

Data Mining Technologies is a new miner available now at

Use it to find technology companies and software for data mining, analytics, and knowledge discovery. Search by company, product, or service.

An easy way of mining hashtag text


, , , , , , , , , , ,

We have added a built-in TRIM function to our Editor and Curator tool at

Normally we use TRIM to remove unwanted sections of a piece of text.

Here is one practical application:

Suppose you have a long list of urls with anchor fragments or hashtagged text collected from blog posts or Twitter posts. Said lists can be collected with, for instance, a Firefox plugin that grabs links from web pages on the fly.

To use TRIM, paste the list of urls in the tool input set field. For this example, don’t worry about the other tool default options.

Next, use the tool built-in trimmer (TRIM) to trim strings from the item sets starting or ending at a given input character. For instance,

enter # in the start field of TRIM to remove hashtag text; e.g., http://blabla#mmm becomes http://blabla.


enter # in the end field of TRIM to keep hashtag text; e.g., http://blabla#mmm becomes mmm.

Note: Normally you should use one field or the other. If you use both fields, the start field action is performed first.

Click submit button. You are done. Easy and to the point. Works great for extracting and mining hashtag text from a long list of urls or text from Twitter hashtags.

The Classic TF-IDF Vector Space Model


, , , , , , , ,

This is Part 3 of an introductory tutorial series on Term Vector Theory. The classic term frequency-inverse document frequency model or TF-IDF, is discussed.

Its advantages and limitations are discussed.

The tutorial is available at

For more tutorials, visit

PS. Exercises where added to the tutorial and few typos removed.

Binary and Term Count Models Tutorial


, , , , , ,

This is Part 2 of our introductory tutorial series on Term Vector Theory as used in Information Retrieval and Data Mining. The Binary (BNRY) and Term Count (FREQ) models are discussed.

The tutorial is available at

Vector Space Calculations without Linear Algebra


, , , , , , ,

We have published the new tutorial,

Vector Space Calculations without Linear Algebra

as a complement for a previous one, titled

A Linear Algebra Approach to the Vector Space Model

The calculations presented in the new tutorial are so simple that can be carried out with a spreadsheet, online calculator, or by hand. Thus, the article is suitable for those interested in learning about vector space models, but that lack of a linear algebra background.

EVE – Conversations with Computers


, , , ,

A sea of changes is coming: “So, which one is it?” The effect of alternative incremental architectures in a high-performance game-playing agent

Information Retrieval and Data Mining by conversing with computers is obvious.

Abstract follows:

This paper introduces Eve, a high performance
agent that plays a fast-paced
image matching game in a spoken dialogue
with a human partner. The agent can
be optimized and operated in three different
modes of incremental speech processing
that optionally include incremental
speech recognition, language understanding,
and dialogue policies. We present
our framework for training and evaluating
the agent’s dialogue policies. In a user
study involving 125 human participants,
we evaluate three incremental architectures
against each other and also compare
their performance to human-human game play.
Our study reveals that the most fully
incremental agent achieves game scores
that are comparable to those achieved
in human-human game play, are higher
than those achieved by partially and non incremental
versions, and are accompanied
by improved user perceptions of efficiency,
understanding of speech, and naturalness
of interaction.