Back in 03-25-11 we released the minerazzi web crawler and link checker tool. As a beta, it is not perfect and needs improvements. Actually, this is the online version of the crawler used by the minerazzi search architecture (beta). Hopefully, it will evolved into a diagnostic tool.
We mentioned that the online version will undergo several changes, all intended to provide an online “web crawler for the masses”. The idea is to put users in control of the crawling process since current crawlers lack of human intuition with regard to the next URL to crawl from a to-do list.
We are pleased to announce the following changes.
04-19-11: Robots Text File detection capabilities added. Meta data parsing changes. (*)
04-18-11: Title and Meta Tags detection capabilities added. Layout changes.
04-16-11: User Environment detection capabilities added.
04-14-11: Timer capabilities added.
04-10-11: Deduplication capabilities added.
04-09-11: Color palette reporting capabilities added.
04-05-11: DNS and MX reporting capabilities added.
04-03-11: Source code reporting capabilities added.
03-30-11: Relative URL resolving capabilities added.
03-28-11: Hypertext wrapping, ip, and headers reporting capabilities added.
(*) Just added.
There is something for everyone here.
Web Designers: Want to use a color palette from another site or tweak yours? Easy. Launch a crawl to a css file already discovered by the crawler.
Web Developers: Want to see diamonds? View the source of any file, including PDFs.
Data miners: Need to mine links? Crawl a document. Better: launch a crawl to a site map file already discovered by the crawler.
Researchers: Want to check system configurations? Check IPs, DNS, MX and header traces (including data from cookies/sessions, etc)
More updates are coming soon!