The current issue of IRW features Web Scraping as a vehicle for conducting Web Mining.
As mentioned in the newsletter, there are so many things that can be done with scrapers. For instance, the below is a comparative of the number of script tags (<script …>…</script>) and link tags (<link …./ >) declared in several index pages and extracted with two scrapers mentioned in the IRW article: the Script and Link Tag Scrapers. As expected, pages with a lot of content are prone to have more scripts.
| Search Engines | Script Tags | Link Tags |
| Yahoo.com * | 15 | 2 |
| Bing.com | 12 | 1 |
| Ask.com | 10 | 0 |
| Google.com | 4 | 0 |
| Gigablast.com | 1 | 0 |
| Socially-oriented Sites | Script Tags | Link Tags |
| Searchenginewatch.com | 38 | 5 |
| Twitter.com | 9 | 3 |
| Seomoz.org | 7 | 13 |
| Facebook.com | 5 | 4 |
| Wikipedia.com ** | 1 | 6 |
* At the time of the analysis, Yahoo.com redirects to the m.yahoo.com alias, but same results are obtained.
** Wikipedia.org and Wikipedia.com return same results.
On the other hand, Web Scraping can unveil potential Web Vulnerabilites in an architecture, so there is a positive side to the story.
In the good hands, scrapers can do great things. In the wrong ones, they can be a nightmare.
Unfortunately, hackers know well that scrapers can be embedded into malware and get their hands on source codes. Ask victims of such scrapers like Google and other companies (http://www.wired.com/threatlevel/2010/01/google-hack-attack/).
Besides legal issues and an unfriendly landscape (censorship), it appears they got tired of chinese hackers picking on them so they are pulling out of China -or treatening to do so.
Beaten in their own game: brain power.
The Web Mining Studio was nicely described in the Newsletter. Is this a public service? I could not find it. Thanks.
Great you read the newsletter. As mentioned in it, it is not public.