The Investigative Journalism Collection

Tags

, ,

The Investigative Journalism Collection (http://www.minerazzi.com/journalism) is the most recent miner built with Minerazzi. It is a great tool for investigative reporters, free lancers, and students of journalism and their teachers.

Use it to find sites, sources, organizations, tools, and other resources relevant to investigative reporting.

Use its recrawling feature to discover information-rich links from http://wikileaks.org, http://aclu.org and zillion of other sites like

http://cironline.org

http://periodismoinvestigativo.com

http://investigativenewsnetwork.org

http://talkingpointsmemo.com

http://www.truthdig.com

http://www.opensecrets.org

http://www.propublica.org

http://www.publicintegrity.org

and many more!!!

The Newspapers Collection

Tags

,

The Newspapers Collection is now available at Minerazzi.com (http://www.minerazzi.com/news) .

Find top world newspapers in one place. As we keep growing this collection, feel free to recrawl a search result to access hard-to-find paths to more resources. A great tool for students and investigative reporters.

Coming Soon: The IR Collection.

Efficiently Consuming Web Resources with Minerazzi

As you might know by now, Minerazzi is a platform for building ‘miners’. We define a miner as a topic-specific search engine that allows end-users to search, index, mine, and recrawl online resources. The goal is to turn web searchers into data miners as a natural evolution of the traditional concept of searching.

Minerazzi proposes a different search paradigm. Each miner built with the platform allows users to be at the center of the search experience, as participants rather than mere expectators.

For instance, consider one of the main features of Minerazzi: Recrawling.

Recrawling is a guided search activity, a discovery learning technique that allows users to quickly discover new resources associated to a given search result.

To illustrate, suppose you are using the Data Structures and Algorithms Collection (DSAC) miner (http://www.minerazzi.com/dsac). At the time of writing, searching for something as specific as [ bloom filter ] returns two records with the following URLs:

1. http://xlinux.nist.gov/dads/HTML/bloomFilter.html
2. http://en.wikipedia.org/wiki/Bloom_filter

Recrawling the first URL retrieves 17 secondary URLs: 10 External (58.82%) and 7 Internal (41.18%), for a 1.43 ratio.

Recrawling the second URL retrieves 308 secondary URLs: 159 External (51.62%) and 149 Internal (48.38%), for a 1.07 ratio.

Secondary URLs are sorted by type and alphabetically. These are URLs that are somehow related to a resource content (e.g., front-end) and design (e.g., back-end).

Clicking on a secondary URL allows users to visit said resource whereas clipboard-copying an entire list of URLs can be done by clicking the top-right { S } link and pressing the Ctrl + C keys. After that the user can export the list from the clipboard to a text file or spreadsheet.

We are now testing the consumption of URLs from third-party search engine result pages. The goal is to allow users to quickly exhaust results from those third parties.

There is no way back to merely searching.

Expanding the Data Structures and Algorithms Collection

Tags

,

To help with the dissemination of research work from others, we have expanded the Data Structures and Algorithms Collection (http://www.minerazzi.com/dsac ).

In addition to NIST’s DADS, this miner now includes URLs pointing to lecture notes and other educational resources.

Feel free to submit relevant resources if these are not already indexed. Teaching material, tutorials, articles,.. and describing data structures and algorithms are accepted.

Mining SEO and Internet Marketing Competitor Sites

Are you an SEO or part of a search marketing company? Want to compare SEO competitors or know how search marketing company home pages were designed? Which CSS style rules and colors were used for their look and feel or which JavaScript “gems” were used? Or perhaps you need to check for possible misconfigurations or robot.txt exploits, keyword distributions, etc…? Or maybe you just want to quickly check if they have added new or fresh content?

Use the Internet Marketing Search Engine (http://www.minerazzi.com/seominer). Its index, now in the order of thousands, is a curated collection of SEO and search marketing resources out there. Feel free to submit a URL if it is not already indexed.

The Data Structures and Algorithms Collection

The Data Structures and Algorithms Collection is a new miner available at http://www.minerazzi.com/dsac . This is an example of using Minerazzi to turn a pre-existent link directory or already curated collection of links into a data mining repository. We have built this miner from NIST’s Dictionary of Algorithms and Data Structures (DADS) repository (http://www.nist.gov).

Although it is now limited to NIST’s DADS, in the next days we will be adding URLs to this miner from many other bigger repositories on data structures and algorithms available across the Web. In this way students, teachers, and researchers can have in one place all these valuable programming resources. It is a great tool for graduate students and the general public interested in computer programming.

The Job Searches and Career Resources Collection

Tags

The Job Searches and Career Resources Collection is a new miner available at http://www.minerazzi.com

Use this miner to search job databases and the best career-oriented sites. Find job lists and career advice resources. Check prospective employers and recruiting companies. Submit job offers.

As usual, you can use it to mine the results from a search or compare how the resources listed were designed.

Staying Away from Spamked Domains

I call spamked domains those parked domains that spam your eyes and search experience across the web.

How pervasive this problem is and how do you fight the war against web spam?

In a recent check of URLs using MUST (1) we found that the domain http://www.metalib.com is a spamked domain that resolves to the IP 69.172.201.208

Checking this IP with MHM (2), we found that 50,673 domains share the same IP. Random sampling these reveals that these are either spamked domains or not founds.

So how do you avoid them all? Good question.

A partial solution consists in building a collection of spamked IPs and URLs which can then be queried with a candidate IP/URL. Using a dedicated crawler to recrawl these records should help you update the collection as a result of any change in IPs/URLs, something you would expect from the usual suspects.

This might not be a bullet-proof solution as may result in some inocent victims having a ‘bad company’ (host provider) and being in the ‘wrong place’ (server) at the ‘wrong time’ (crawl timestamp), but hey inocent casualties are part of any war.

Update: Our MUST and MHM tools found that the URL http://ww1.theslammer.com/ resolves to IP 72.52.4.90 and is also shared by 57,451 domains.  Also the URL http://www.bestedeal.com resolves to IP 208.73.211.250 which is shared by 10,742 domains.

Thus, with few initial IPs, we were able to identify more than 100,000 cases. Building a collection of spamked domains is not that hard to do after all.

References

1. MUST (http://www.minerazzi.com/tools/must/must.php).

2. MHM (http://www.minerazzi.com/tools/mhm/mhm.php).