An In-Context Topic Crawler

Tags

, , , ,

I have completed a breadth-first in-context crawler that traverses the Web, recursively discovering links in two modes:

1. continuous mode: without stopping.
2. discrete mode: controlled by the user.

The first mode quickly fills a small database, but still makes the crawler to act like a vacuum cleaner, collecting all kind of garbage; i.e. links that might be irrelevant to a topic-specific database are also grabbed.

The second one is slower, but lets me decide whether to continue or stop the crawling at a given recursion level and based on topic criteria, reducing the amount of garbage gathered. Relative urls are automatically resolved into absolute ones, a bit tricky task when recursion is involved.

In addition to links, the crawler selectively extracts strings that match specific patterns like email addresses, phone numbers, zip codes, etc. Contextual keywords surrounding the patterns are also collected so these can be reciprocally mapped. The goal is the development of a service that consumes in-context pattern-specific databases–great for people searches, intelligence, and marketing. So far the project simplifies even more the building of topic miners.

Reference
http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf

Mapping IPs: The Wikipedia Way?

Tags

, , , , , ,

Some services apparently used by Wikipedia for mapping IPs to other resources. Great for business intelligence, tracking users, spammers, marketing research,…

We used them to test yahoo.com’s IP (98.139.183.24).

Replace any instances of 98.139.183.24 with the one you want to test.

https://www.robtex.com/ip/98.139.183.24.html
http://www.robtex.com/whois/98.139.183.24.html
https://db-ip.com/98.139.183.24
http://whatismyipaddress.com/ip/98.139.183.24

https://tools.wmflabs.org/guc/index.php?user=98.139.183.24
http://www.dnsstuff.com/tools/
http://reportcard.wmflabs.org/
https://petscan.wmflabs.org/
http://tools.wmflabs.org/render-tests/catcycle-dev/catcycle.py

http://whois.arin.net/rest/ip/98.139.183.24.html
http://wq.apnic.net/apnic-bin/whois.pl?searchtext=98.139.183.24
http://www.afrinic.net/cgi-bin/whois?searchtext=98.139.183.24
http://www.ripe.net/fcgi-bin/whois?searchtext=98.139.183.24
https://rdap.lacnic.net/rdap-web/ip?key=98.139.183.24

MUST Tool Enhancements

Tags

, , , , ,

We have enhanced the MUST tool, available at

http://www.minerazzi.com/tools/must/must.php

This is a redirection checker tool that upon url redirections reports initial and final status codes, URLs, and IP addresses.

The tool now:

1. accepts 500 urls per submission.
2. summarizes broken and active URLs.

This is one of several tools that we use in-house for re-indexing databases and cleaning up crawl results, except without url limitations.

Shopping & Shopper Miner

Tags

, , , , , , , ,

This miner has been reindexed with new content, ahead of time the Black Friday and Holiday Season special offers & deals.

It is available at http://www.minerazzi.com/shopper/

The miner now features an SPP-powered news service with consumer alerts, recalls, and fraud reports

Use it to find coupon deals, price comparisons, holidays offers, consumer reports, and more.

E-Paper Miner

Tags

, , , ,

This is a new miner, available at

http://www.minerazzi.com/e-paper

More than appropriate miner on electronic paper and ink technologies, now that the news is out about Microsoft’s sticky notes using e-ink powered by ambient light.

E-Paper and E-Ink technologies are set to simplify our life: http://www.psfk.com/2016/10/microsofts-e-paper-note-runs-on-ambient-light.html

https://www.technologyreview.com/s/602710/this-e-ink-post-it-never-needs-to-be-charged/

Mobile Technology Miner

Tags

, , ,

This is a new miner available at

http://www.minerazzi.com/mobile

Find apps software and vendors for mobile devices, smartphone technology, and more. Search by vendors, makers, models, or keywords. Submit web pages relevant to these topics.

Includes an RSS News section powered by SPP to help you find industry-related news, releases, events, and more so far from AT&T, Samsung, Sprint, and Verizon. Feel free to submit for inclusion a relevant RSS url

Excel File for Quantile-Quantile Analysis

Tags

, , ,

This is an Excel .xlsx file for reproducing Table 1 of our tutorial on Quantile-Quantile Plots. Now anyone with Excel installed can play and explore this simple technique aimed at determining if a data set is normally distributed.

To download the Excel file, access the most recent update of the tutorial, available at

http://www.minerazzi.com/tutorials/quantile-quantile-tutorial.pdf

We also removed few extra “)” typos that were undetected in previous copies.

Have a great Q-Q day! 🙂

Text Streamer Tool

Tags

, ,

The Text Streamer is a new tool available at

http://www.minerazzi.com/tools/text-streamer/streamer.php

Streamline text by removing non-printable or encoded characters and multiple spaces.

The tool converts non-printable characters, including tabs, returns, newliners, and multiple spaces into single spaces. User can opt to remove all encodes. These are characters encoded in %, decimal, and hexadecimal notation.

To use the tool, just enter your input text and submit form. To remove all encodes, check the form checkbox. Click the output text to select it. Copy/paste it as usually you would.

It comes handy for users that need to copy/paste streamlined text (plain text) from one file type to another or post it through html forms residing in blogs, discussion forums, and social network sites, or any site for that matter.

 

BM25F Model Tutorial

Tags

, ,

We have restored, expanded, and updated our tutorial on the BM25 Extension to Multiple Weighted Fields Model, best known as BM25F. It is now available at

http://www.minerazzi.com/tutorials/bm25f-model-tutorial.pdf

Active links were also added to the References section.

Enjoy it.