, ,


Wikileaks.org is one of those huge sites where researchers and investigative reporters can feel like in heaven.

That is, provided that they have a way to move across Wikileaks complex link structure. Simply put, they need a tool that allows them to understand the relationships between links and quickly move in and out of specific link paths of interest. This need to be done at different levels of the link graph, while current resources are pulled out of said structure and in almost real time.

That is hard to do by just searching or by crafting site, command, or custom searches–not even by using Wikileaks own search engine.

Fortunately, you can do the above with Minerazzi recrawling features–at least to some degree.

Although Minerazzi technology is evolving and not perfect, moving from searching indexes to mining user-driven recrawls is a right step in the right direction.

However, there might be a broad spectrum of starting experimental conditions, each one requiring of different crawling strategies.

The purpose of this post is not to discuss solutions for all possible experimental conditions. It is assumed that users are familiar with Minerazzi’s Recrawl It (RI) and Search Inside (SI) complementary tools. To simplify,  the recrawls are done with SI

Example 1: Initial URL is not given.

Search for [wikileaks] in the Investigative Journalism miner (http://www.minerazzi.com/journalism). Find a result that might interest you.

A good starting point is the result whose URL is https://www.wikileaks.org/wiki as it contains links to latest leaks and recent analyses. Click the Search Inside tool icon below this result.

That should retrieve all links from this result with the tool icon now at the right of each of the new results.  You should see three output sections. The first one logs the current URL being crawled. The other two’s are the External and Internal Links sections.

You can now recursively recrawl results by clicking their SI icon and, again, check how the above sections are updated. That is, you will be walking a portion of Wikileaks link graph. At any given step you can walk backward or forward the link graph by clicking the SI icons from the above sections.

This mechanism works as expected with the latest versions of Firefox, Opera, Safari, and Chrome browsers. However, sometimes the state of the logged section is not preserved in IE. We are working on fixing this anomaly.

Example 2: Initial URL is given.

If the initial URL is given or obtained through a search or previous crawl, recrawl its links as in Example 1. A good starting point is https://www.wikileaks.org/the-spyfiles.html

You can always submit for indexing in the above miner a particular Wikileaks URL. Once indexed, you can use it as a starting point.

Example 3: What if I still want to combine searching with recrawling?

You can always do that. Wikileaks link graph has many URLs with the pattern [keyword].wikileaks.org which can be easily mined.

For instance search for [file wikileaks] and recrawl with SI the result whose URL is https://file.wikileaks.org. Next from the results page recrawl the result whose URL is https://file.wikileaks.org/file. You will be presented with over a thousand of interesting results. Have a field day!

What is next?

Because Wikileaks is so huge, perhaps it is time for us to start building a miner exclusively for mining Wikileaks.org site. Such a miner will help us to address initial starting point and link walk issues.