Tags

, , , ,

I have completed a breadth-first in-context crawler that traverses the Web, recursively discovering links in two modes:

1. continuous mode: without stopping.
2. discrete mode: controlled by the user.

The first mode quickly fills a small database, but still makes the crawler to act like a vacuum cleaner, collecting all kind of garbage; i.e. links that might be irrelevant to a topic-specific database are also grabbed.

The second one is slower, but lets me decide whether to continue or stop the crawling at a given recursion level and based on topic criteria, reducing the amount of garbage gathered. Relative urls are automatically resolved into absolute ones, a bit tricky task when recursion is involved.

In addition to links, the crawler selectively extracts strings that match specific patterns like email addresses, phone numbers, zip codes, etc. Contextual keywords surrounding the patterns are also collected so these can be reciprocally mapped. The goal is the development of a service that consumes in-context pattern-specific databases–great for people searches, intelligence, and marketing. So far the project simplifies even more the building of topic miners.

Reference
http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf