This is a new miner, available now at http://www.minerazzi.com/
Chemistry tutorials, tools, demos, lecture notes, and more for students, teachers, and researchers. Build chemistry-specific collections. Search by topic or site.
Wikileaks.org is one of those huge sites where researchers and investigative reporters can feel like in heaven.
That is, provided that they have a way to move across Wikileaks complex link structure. Simply put, they need a tool that allows them to understand the relationships between links and quickly move in and out of specific link paths of interest. This need to be done at different levels of the link graph, while current resources are pulled out of said structure and in almost real time.
That is hard to do by just searching or by crafting site, command, or custom searches–not even by using Wikileaks own search engine.
Fortunately, you can do the above with Minerazzi recrawling features–at least to some degree.
Although Minerazzi technology is evolving and not perfect, moving from searching indexes to mining user-driven recrawls is a right step in the right direction.
However, there might be a broad spectrum of starting experimental conditions, each one requiring of different crawling strategies.
The purpose of this post is not to discuss solutions for all possible experimental conditions. It is assumed that users are familiar with Minerazzi’s Recrawl It (RI) and Search Inside (SI) complementary tools. To simplify, the recrawls are done with SI
Example 1: Initial URL is not given.
Search for [wikileaks] in the Investigative Journalism miner (http://www.minerazzi.com/journalism). Find a result that might interest you.
A good starting point is the result whose URL is https://www.wikileaks.org/wiki as it contains links to latest leaks and recent analyses. Click the Search Inside tool icon below this result.
That should retrieve all links from this result with the tool icon now at the right of each of the new results. You should see three output sections. The first one logs the current URL being crawled. The other two’s are the External and Internal Links sections.
You can now recursively recrawl results by clicking their SI icon and, again, check how the above sections are updated. That is, you will be walking a portion of Wikileaks link graph. At any given step you can walk backward or forward the link graph by clicking the SI icons from the above sections.
This mechanism works as expected with the latest versions of Firefox, Opera, Safari, and Chrome browsers. However, sometimes the state of the logged section is not preserved in IE. We are working on fixing this anomaly.
Example 2: Initial URL is given.
If the initial URL is given or obtained through a search or previous crawl, recrawl its links as in Example 1. A good starting point is https://www.wikileaks.org/the-spyfiles.html
You can always submit for indexing in the above miner a particular Wikileaks URL. Once indexed, you can use it as a starting point.
Example 3: What if I still want to combine searching with recrawling?
You can always do that. Wikileaks link graph has many URLs with the pattern [keyword].wikileaks.org which can be easily mined.
For instance search for [file wikileaks] and recrawl with SI the result whose URL is https://file.wikileaks.org. Next from the results page recrawl the result whose URL is https://file.wikileaks.org/file. You will be presented with over a thousand of interesting results. Have a field day!
What is next?
Because Wikileaks is so huge, perhaps it is time for us to start building a miner exclusively for mining Wikileaks.org site. Such a miner will help us to address initial starting point and link walk issues.
Over the years, I’ve been asked about the more effective way of writing peer-reviewed articles for scientific journals.
My response is always the same: Think like a referee/editor. Here is a list of items that they want to see accomplished:
Referees/editors like to see that the content and format of the title, abstract, document body, tables, images, graphics, appendices, and references follow their journal guidelines.
In general, referees/editors like to see in the first page of the printed version of an article:
1. Statement of the problem – what is the problem to be solved.
2. Purpose of the article – how the present research solves the problem.
3. Organization – how the article is organized and what is covered in each section.
This is a general practice across scientific journals. So, whenever possible, I try to accomplish 1 – 3 in the first three paragraphs of the first page of the printed article. To do this, you need to avoid lengthy introductions and wordiness. Be concise and ‘go the point’.
Referees/editors also like to see the article as a whole semantic unit. So they like to see:
Transitional statements; i.e., sections ending as an introduction to the next section.
1. One paragraph, one idea; i.e., each paragraph discussing one main idea.
2. Short paragraphs; i.e., each paragraph of about five sentences or less, where sentences are of appropriate length. This provides a natural stop to the reading. In general, short paragraphs and sentences are easier to read than the long ones. Use compound sentences with caution.
3. Facts supported by pertinent references.
4. Opinion written as opinions, not as facts.
Of course, there are other tips to think about, but in my opinion, the above can make a difference… well, in my opinion :)
The current issue of IRW is out and should arrive to subscribers inboxes today.
In this issue of IRW, we introduce the minerazzi project and two useful tools available at minerazzi.com
Web Crawler and Link Checker Tool (http://www.minerazzi.com/labs/crawlinker.php).
Multiple Whois Domain Name Tool (http://www.minerazzi.com/labs/whois.php).
This is a research project conducted in association with several scholars and the private sector.
The current issue of IRW should reach subscribers inboxes during the day.
This is Part Two of the series on statistical analysis of n-grams. This is a text mining analysis technique widely used in information retrieval and data mining in general. In this issue we cover the implementation of association measures derived from contingency tables.
The QA section explains how to conduct a Chi Square Test for tables with many items; i.e., beyond the usual 2 x 2 contingency tables.
Back to blogging. I’ve been very busy putting together a paper on a weighting model and answering feedback received from colleagues on it.
So this might explain why the January IRW newsletter is delayed. It should arrive subscribers inboxes during the day. The February issue will be out in about one week. These are back to back issues on Statistical Analysis of N-Grams.
Part 1:N-Grams & Contingency Tables
Part 2: N-Grams & Association Measures
On other matters, a PhD student published few years ago an excellent application of the Vector Space Model applied to Protein Analysis. You can revisit the post at https://irthoughts.wordpress.com/2008/11/12/vector-space-model-and-protein-retrieval/ .
If others have other applications of VSM in other disciplines, let me know. I’m interested in multidisciplinary stuff.
Nothing better than starting 2011 with more research work.
Check this blog tomorrow as there will be a value-added good news for those interested in conducting research at the interface of information retrieval, statistical analysis, and applied mathematics. You’re welcome to grab a copy of this four-month investigation, for use in your own research, as a teaching tool, or to chase away SEO snakeoil.
The current issue of the Information Retrieval Newsletter is out! Due to the Holiday Season, it is a short issue.
The article section is dedicated to Fisher’s Z Transformation, its origins, advantages, and limitations.
We have included a less known visualization of it. Using a geometrical interpretation helps one to understand how the transformation works.
Enjoy it and Happy New Year!
I’m putting together a 4-part series on meta-analysis in information retrieval, with feedback from several researchers. It will be an interesting series to follow. Several myths will be dispelled for good and once for all. The current issue of IRW provides a sneak preview. The newsletter and the first article of the series will be probably out this or next week.
The current issue of IRW is out and should reach subscribers during the day.
In this issue we delve into our Tables of Correlation Features.
The QA section addresses the question on how university administrators can allocate faculty to programs using an interesting formula.
The Who’s Who in CS is dedicated to one of my heroes: Wesley A. Clark.
Enjoy it and happy Holiday Season.