We have improved the Minerazzi platform (http://www.minerazzi.com) by adding new features.
That includes an internal filter for deduplicating urls, which is currently being tested. The interface is also a lot cleaner.
We found that with the recrawling features of Minerazzi, it is extremely easy to build topic-specific collections from sites like Wikipedia.
Once one gets a Wikipedia record in a search result, recrawling links simplifies discovery of relevant URLs which can then be exported to a personal collection project. Cool and nice!
Feel the power of micro-indexing
That’s the great thing about microindexing. It allows you to do things you cannot do with a static, monster-sized index. That is, you start with a tiny index and explore its reach as you use it to navigate the Web. It is the reach of an index and not its size what really matters to a data miner gathering topic-specific resources.
Unlike with traditional search engines, wherein you search a huge index of cached records (often irrelevant or with outdated information), you go into a discovery journey of current records as they currently exist online! You start with a seed of highly relevant URLs (the micro-index) and then start mining and building while searching. In my book, that’s a nice search paradigm.
On other matters, the final touches of the Information Retrieval Collection (IRC) are being implemented.
So far, IRC includes records from NIST’s TREC conference proceedings, the Gerard Salton Collection from Cornell University ECommons database, Wikipedia, IR books, and lecture material or research articles from top level IR researchers.