Web Scraping

The current issue of IR Watch – The Newsletter is out. Featuring article’s abstract follows.

“Web Mining is a subfield of Data Mining where patterns are derived from the Web. If scraping tools are used for Web Mining this is referred to as Web Scraping (WS).

A scraper is a program designed to extract information from online documents. Scrapers work by matching document source codes against regular expression libraries.

WS is widely used, in part due to the rising popularity of scripting technologies like Asynchronous JavaScript and XML (AJAX), which allows users to retrieve source codes and manipulate the Document Object Model (DOM). WS is a form of Information Extraction where tools, not necessarily scrapers, and repositories, not necessarily the Web are used.

For the last 10 years we have been developing scrapers to simplify the collection and analysis of intelligence from the Web or local machines. For the last 4 years these were slowly converted to AJAX. In this issue of the newsletter, we want to share with readers our experience using several scrapers.”