New Records added to WikiMiner

Tags

, , , , , , , ,

New set of records were added to the WikiMiner at http://www.minerazzi.com/wikiminer using our newest contextual crawler (in discrete mode and 1 level of depth).

For those that at this inning of the game are still interested in the old podesta-clinton email saga (not sure why), try this:

A search for [podesta] in the WikiMiner retrieves records relevant to Podesta emails. Once in the results page you may recursively discover new records by recrawling the results.

For instance, search for [podesta] and click the Links Tool (black rectangular chain icon) under the second result (https://www.wikileaks.com/podesta-emails) to discover new records. You will be presented with two set of records: Externals and Internals .

The Links tool now displays at the right of each new record. Click the tool again, for instance for one the Internal results, to discover more records. By repeating the process you will be recrawling and discovering over and over new records.

You may play in this way with some of the miners available at http://www.minerazzi.com

 

An In-Context Topic Crawler

Tags

, , , ,

I have completed a breadth-first in-context crawler that traverses the Web, recursively discovering links in two modes:

1. continuous mode: without stopping.
2. discrete mode: controlled by the user.

The first mode quickly fills a small database, but still makes the crawler to act like a vacuum cleaner, collecting all kind of garbage; i.e. links that might be irrelevant to a topic-specific database are also grabbed.

The second one is slower, but lets me decide whether to continue or stop the crawling at a given recursion level and based on topic criteria, reducing the amount of garbage gathered. Relative urls are automatically resolved into absolute ones, a bit tricky task when recursion is involved.

In addition to links, the crawler selectively extracts strings that match specific patterns like email addresses, phone numbers, zip codes, etc. Contextual keywords surrounding the patterns are also collected so these can be reciprocally mapped. The goal is the development of a service that consumes in-context pattern-specific databases–great for people searches, intelligence, and marketing. So far the project simplifies even more the building of topic miners.

Reference
http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf

Mapping IPs: The Wikipedia Way?

Tags

, , , , , ,

Some services apparently used by Wikipedia for mapping IPs to other resources. Great for business intelligence, tracking users, spammers, marketing research,…

We used them to test yahoo.com’s IP (98.139.183.24).

Replace any instances of 98.139.183.24 with the one you want to test.

https://www.robtex.com/ip/98.139.183.24.html
http://www.robtex.com/whois/98.139.183.24.html
https://db-ip.com/98.139.183.24
http://whatismyipaddress.com/ip/98.139.183.24

https://tools.wmflabs.org/guc/index.php?user=98.139.183.24
http://www.dnsstuff.com/tools/
http://reportcard.wmflabs.org/
https://petscan.wmflabs.org/
http://tools.wmflabs.org/render-tests/catcycle-dev/catcycle.py

http://whois.arin.net/rest/ip/98.139.183.24.html
http://wq.apnic.net/apnic-bin/whois.pl?searchtext=98.139.183.24
http://www.afrinic.net/cgi-bin/whois?searchtext=98.139.183.24
http://www.ripe.net/fcgi-bin/whois?searchtext=98.139.183.24
https://rdap.lacnic.net/rdap-web/ip?key=98.139.183.24

MUST Tool Enhancements

Tags

, , , , ,

We have enhanced the MUST tool, available at

http://www.minerazzi.com/tools/must/must.php

This is a redirection checker tool that upon url redirections reports initial and final status codes, URLs, and IP addresses.

The tool now:

1. accepts 500 urls per submission.
2. summarizes broken and active URLs.

This is one of several tools that we use in-house for re-indexing databases and cleaning up crawl results, except without url limitations.

Shopping & Shopper Miner

Tags

, , , , , , , ,

This miner has been reindexed with new content, ahead of time the Black Friday and Holiday Season special offers & deals.

It is available at http://www.minerazzi.com/shopper/

The miner now features an SPP-powered news service with consumer alerts, recalls, and fraud reports

Use it to find coupon deals, price comparisons, holidays offers, consumer reports, and more.

E-Paper Miner

Tags

, , , ,

This is a new miner, available at

http://www.minerazzi.com/e-paper

More than appropriate miner on electronic paper and ink technologies, now that the news is out about Microsoft’s sticky notes using e-ink powered by ambient light.

E-Paper and E-Ink technologies are set to simplify our life: http://www.psfk.com/2016/10/microsofts-e-paper-note-runs-on-ambient-light.html

https://www.technologyreview.com/s/602710/this-e-ink-post-it-never-needs-to-be-charged/

Mobile Technology Miner

Tags

, , ,

This is a new miner available at

http://www.minerazzi.com/mobile

Find apps software and vendors for mobile devices, smartphone technology, and more. Search by vendors, makers, models, or keywords. Submit web pages relevant to these topics.

Includes an RSS News section powered by SPP to help you find industry-related news, releases, events, and more so far from AT&T, Samsung, Sprint, and Verizon. Feel free to submit for inclusion a relevant RSS url

Excel File for Quantile-Quantile Analysis

Tags

, , ,

This is an Excel .xlsx file for reproducing Table 1 of our tutorial on Quantile-Quantile Plots. Now anyone with Excel installed can play and explore this simple technique aimed at determining if a data set is normally distributed.

To download the Excel file, access the most recent update of the tutorial, available at

http://www.minerazzi.com/tutorials/quantile-quantile-tutorial.pdf

We also removed few extra “)” typos that were undetected in previous copies.

Have a great Q-Q day!🙂

Text Streamer Tool

Tags

, ,

The Text Streamer is a new tool available at

http://www.minerazzi.com/tools/text-streamer/streamer.php

Streamline text by removing non-printable or encoded characters and multiple spaces.

The tool converts non-printable characters, including tabs, returns, newliners, and multiple spaces into single spaces. User can opt to remove all encodes. These are characters encoded in %, decimal, and hexadecimal notation.

To use the tool, just enter your input text and submit form. To remove all encodes, check the form checkbox. Click the output text to select it. Copy/paste it as usually you would.

It comes handy for users that need to copy/paste streamlined text (plain text) from one file type to another or post it through html forms residing in blogs, discussion forums, and social network sites, or any site for that matter.