The current issue of IRW features Web Scraping as a vehicle for conducting Web Mining.
As mentioned in the newsletter, there are so many things that can be done with scrapers. For instance, the below is a comparative of the number of script tags (<script …>…</script>) and link tags (<link …./ >) declared in several index pages and extracted with two scrapers mentioned in the IRW article: the Script and Link Tag Scrapers. As expected, pages with a lot of content are prone to have more scripts.
|Search Engines||Script Tags||Link Tags|
|Socially-oriented Sites||Script Tags||Link Tags|
* At the time of the analysis, Yahoo.com redirects to the m.yahoo.com alias, but same results are obtained.
** Wikipedia.org and Wikipedia.com return same results.
On the other hand, Web Scraping can unveil potential Web Vulnerabilites in an architecture, so there is a positive side to the story.
In the good hands, scrapers can do great things. In the wrong ones, they can be a nightmare.
Unfortunately, hackers know well that scrapers can be embedded into malware and get their hands on source codes. Ask victims of such scrapers like Google and other companies (http://www.wired.com/threatlevel/2010/01/google-hack-attack/).
Besides legal issues and an unfriendly landscape (censorship), it appears they got tired of chinese hackers picking on them so they are pulling out of China -or treatening to do so.
Beaten in their own game: brain power.