The current issue of IRW features Web Scraping as a vehicle for conducting Web Mining.

As mentioned in the newsletter, there are so many things that can be done with scrapers. For instance, the below is a comparative of the number of script tags (<script …>…</script>) and link tags  (<link …./ >) declared in several index pages and extracted with two scrapers mentioned in the IRW article: the Script and Link Tag Scrapers. As expected, pages with a lot of content are prone to have  more scripts.

Search Engines Script Tags Link Tags
Yahoo.com * 15 2
Bing.com 12 1
Ask.com 10 0
Google.com 4 0
Gigablast.com 1 0

 

Socially-oriented Sites Script Tags Link Tags
Searchenginewatch.com 38 5
Twitter.com 9 3
Seomoz.org 7 13
Facebook.com 5 4
Wikipedia.com ** 1 6

 

* At the time of the analysis, Yahoo.com redirects to the m.yahoo.com alias, but same results are obtained.

** Wikipedia.org and Wikipedia.com return same results.

On the other hand, Web Scraping can unveil potential Web Vulnerabilites in an architecture, so there is a positive side to the story.  

In the good hands, scrapers can do great things. In the wrong ones, they can be a nightmare.

Unfortunately, hackers know well that scrapers can be embedded into malware and get their hands on source codes. Ask victims of such scrapers like Google and other companies (http://www.wired.com/threatlevel/2010/01/google-hack-attack/).

Besides legal issues and an unfriendly landscape (censorship), it appears they got tired of chinese hackers picking on them so they are pulling out of China  -or treatening to do so.

http://online.wsj.com/article/SB10001424052748704362004575000440265987982.html?mod=rss_Today’s_Most_Popular

Beaten in their own game: brain power.

About these ads