I call spamked domains those parked domains that spam your eyes and search experience across the web.
How pervasive this problem is and how do you fight the war against web spam?
In a recent check of URLs using MUST (1) we found that the domain http://www.metalib.com is a spamked domain that resolves to the IP 126.96.36.199
Checking this IP with MHM (2), we found that 50,673 domains share the same IP. Random sampling these reveals that these are either spamked domains or not founds.
So how do you avoid them all? Good question.
A partial solution consists in building a collection of spamked IPs and URLs which can then be queried with a candidate IP/URL. Using a dedicated crawler to recrawl these records should help you update the collection as a result of any change in IPs/URLs, something you would expect from the usual suspects.
This might not be a bullet-proof solution as may result in some inocent victims having a ‘bad company’ (host provider) and being in the ‘wrong place’ (server) at the ‘wrong time’ (crawl timestamp), but hey inocent casualties are part of any war.
Update: Our MUST and MHM tools found that the URL http://ww1.theslammer.com/ resolves to IP 188.8.131.52 and is also shared by 57,451 domains. Also the URL http://www.bestedeal.com resolves to IP 184.108.40.206 which is shared by 10,742 domains.
Thus, with few initial IPs, we were able to identify more than 100,000 cases. Building a collection of spamked domains is not that hard to do after all.
1. MUST (http://www.minerazzi.com/tools/must/must.php).
2. MHM (http://www.minerazzi.com/tools/mhm/mhm.php).