Building topic-specific collections, the easy way

Tags

, , ,

We have improved the Minerazzi platform (http://www.minerazzi.com) by adding new features.

That includes an internal filter for deduplicating urls, which is currently being tested. The interface is also a lot cleaner.

We found that with the recrawling features of Minerazzi, it is extremely easy to build topic-specific collections from sites like Wikipedia.

Once one gets a Wikipedia record in a search result, recrawling links simplifies discovery of relevant URLs which can then be exported to a personal collection project. Cool and nice!

Feel the power of micro-indexing

That’s the great thing about microindexing. It allows you to do things you cannot do with a static, monster-sized index. That is, you start with a tiny index and explore its reach as you use it to navigate the Web. It is the reach of an index and not its size what really matters to a data miner gathering topic-specific resources.

Unlike with traditional search engines, wherein you search a huge index of cached records (often irrelevant or with outdated information), you go into a discovery journey of current records as they currently exist online! You start with a seed of highly relevant URLs (the micro-index) and then start mining and building while searching. In my book, that’s a nice search paradigm.

On other matters, the final touches of the Information Retrieval Collection (IRC) are being implemented.

So far, IRC includes records from NIST’s TREC conference proceedings, the Gerard Salton Collection from Cornell University ECommons database, Wikipedia, IR books, and lecture material or research articles from top level IR researchers.

Lessons learned from building an IR collection

Tags

, , ,

We are currently building the Information Retrieval Collection (IRC) with the Minerazzi platform. URLs pointing to resources like articles from scholarly journals and teaching material from scholarly web pages are crawled and indexed.

So far we have found that a non trivial amount of said URLs point to teaching material (lecture notes, tutorials,..) in a markup format with title tags or file names that poorly describe what their content is about, then difficulting indexing. Some don’t even have any useful information in their html head section. Something similar, occurs with the URLs of .pdf and .ps files: we often found no useful file names.

For article journals, this is understandable as that could be the result of editorial policies; not so for teaching materials, though.

Saddly, the taste is that either university webmasters or the scholars who wrote the resources seem to be sloppy or go by the “do as I teach, no as I do” rule.

In any case, the above documents have descriptors that are too short or poorly written to help a human or robot with the indexing. One can do better by going with the anchor text displayed in the documents, but again, that workaround relies on the content quality of said text.

We are working on a partial solution to the problem using temporary tagging instead of extracting summaries after full-text indexing. It is an interesting trick for dynamically building collections, but again it is not a perfect solution.

Improving the Data Structures and Algorithms Collection

Tags

, , , ,

We have almost doubled the index of the Data Structures and Algorithms (DSAC) miner. In addition, we are moving to indexing books relevant to this miner. Some changes were also made to the output of the search result pages, so these are now less cluttered.

Essentially, users now have the option of viewing sections of the search results. We did not do this before because of collisions with div ids.

These changes to the serps apply to all current miners built with the Minerazzi platform, too. Enjoy it.

Unveiling Link Honey Pots with Minerazzi

Tags

, , , , ,

In Web Spam Taxonomy, Gyongyi and Garcia-Molina, describe several web spam techniques, one being honey pots.

They describe these as “a set of pages that provide some useful resource …, but that also have (hidden) links to the target spam page(s).

To target non-human visitors (e.g., web crawlers), said links could be placed in HTML elements that are made invisible (e.g. division tags with a display:none CSS rule) and then with rel=”dofollow” in the anchor tags.

These types of tricks can be easily unveiled with the recrawling feature of Minerazzi.

For instance, searching for [ heap sort ] in http://www.minerazzi.com/dsac retrieves several records, one being http://www.aihorizon.com/resources/sourcecode/trees/heap_h.htm.

Clicking the “Recrawl it” link retrieves several URLs, one being http://www.aihorizon.com/index.htm.

Clicking again the “Recrawling the Web” Recrawling that link quickly reveals that said index page has several hidden links to porn sites. Looking at the source code of that URL shows that the above adversarial technique was used.

Minerazzi: Allowing Users to Recrawl Search Results

Tags

, , ,

Effectively immediately Minerazzi (http://www.minerazzi.com) allows users to recursively recrawl search results.

Why is recrawling so important?

The purpose of allowing users to recrawl URLs is to expose them to new content, to involve them in learning through discovery. To turn their searches into a mining activity. This makes more sense than limiting their search experience to inspecting zillion of cached records from a search engine index. The problem with the latter is that frequently those records are either outdated or irrelevant, not to mention that in that scenario the users are simply passive expectators.

Allowing users to recrawl search results has many advantages and possibilities. For instance, users can use the discovered URLs to build curated collections, self-guide investigative work, or gather link intelligence from sites, directories, blogs, forums, or social networks. In general, recrawling allows users to discover hidden paths to fresh, new, or rich content.

Considering that the total number of primary and secondary URLs defines the reach of a microindex, in theory recrawling should result into an endless reach.

At this time, we do not recrawl .css and .pdf files, but we recrawl the most common file formats (.php, .asp, .aspx, .html, .htm, .js, etc). However, if the content of a file is dynamic, obfuscated, or poorly coded more likely it will return garbage or nothing.

Having said that, we invite you to try the recrawling experience with our public miners.

The Investigative Journalism Collection

Tags

, ,

The Investigative Journalism Collection (http://www.minerazzi.com/journalism) is the most recent miner built with Minerazzi. It is a great tool for investigative reporters, free lancers, and students of journalism and their teachers.

Use it to find sites, sources, organizations, tools, and other resources relevant to investigative reporting.

Use its recrawling feature to discover information-rich links from http://wikileaks.org, http://aclu.org and zillion of other sites like

http://cironline.org

http://periodismoinvestigativo.com

http://investigativenewsnetwork.org

http://talkingpointsmemo.com

http://www.truthdig.com

http://www.opensecrets.org

http://www.propublica.org

http://www.publicintegrity.org

and many more!!!

The Newspapers Collection

Tags

,

The Newspapers Collection is now available at Minerazzi.com (http://www.minerazzi.com/news) .

Find top world newspapers in one place. As we keep growing this collection, feel free to recrawl a search result to access hard-to-find paths to more resources. A great tool for students and investigative reporters.

Coming Soon: The IR Collection.

Efficiently Consuming Web Resources with Minerazzi

As you might know by now, Minerazzi is a platform for building ‘miners’. We define a miner as a topic-specific search engine that allows end-users to search, index, mine, and recrawl online resources. The goal is to turn web searchers into data miners as a natural evolution of the traditional concept of searching.

Minerazzi proposes a different search paradigm. Each miner built with the platform allows users to be at the center of the search experience, as participants rather than mere expectators.

For instance, consider one of the main features of Minerazzi: Recrawling.

Recrawling is a guided search activity, a discovery learning technique that allows users to quickly discover new resources associated to a given search result.

To illustrate, suppose you are using the Data Structures and Algorithms Collection (DSAC) miner (http://www.minerazzi.com/dsac). At the time of writing, searching for something as specific as [ bloom filter ] returns two records with the following URLs:

1. http://xlinux.nist.gov/dads/HTML/bloomFilter.html
2. http://en.wikipedia.org/wiki/Bloom_filter

Recrawling the first URL retrieves 17 secondary URLs: 10 External (58.82%) and 7 Internal (41.18%), for a 1.43 ratio.

Recrawling the second URL retrieves 308 secondary URLs: 159 External (51.62%) and 149 Internal (48.38%), for a 1.07 ratio.

Secondary URLs are sorted by type and alphabetically. These are URLs that are somehow related to a resource content (e.g., front-end) and design (e.g., back-end).

Clicking on a secondary URL allows users to visit said resource whereas clipboard-copying an entire list of URLs can be done by clicking the top-right { S } link and pressing the Ctrl + C keys. After that the user can export the list from the clipboard to a text file or spreadsheet.

We are now testing the consumption of URLs from third-party search engine result pages. The goal is to allow users to quickly exhaust results from those third parties.

There is no way back to merely searching.

Expanding the Data Structures and Algorithms Collection

Tags

,

To help with the dissemination of research work from others, we have expanded the Data Structures and Algorithms Collection (http://www.minerazzi.com/dsac ).

In addition to NIST’s DADS, this miner now includes URLs pointing to lecture notes and other educational resources.

Feel free to submit relevant resources if these are not already indexed. Teaching material, tutorials, articles,.. and describing data structures and algorithms are accepted.

Mining SEO and Internet Marketing Competitor Sites

Are you an SEO or part of a search marketing company? Want to compare SEO competitors or know how search marketing company home pages were designed? Which CSS style rules and colors were used for their look and feel or which JavaScript “gems” were used? Or perhaps you need to check for possible misconfigurations or robot.txt exploits, keyword distributions, etc…? Or maybe you just want to quickly check if they have added new or fresh content?

Use the Internet Marketing Search Engine (http://www.minerazzi.com/seominer). Its index, now in the order of thousands, is a curated collection of SEO and search marketing resources out there. Feel free to submit a URL if it is not already indexed.