Staying Away from Spamked Domains

I call spamked domains those parked domains that spam your eyes and search experience across the web.

How pervasive this problem is and how do you fight the war against web spam?

In a recent check of URLs using MUST (1) we found that the domain http://www.metalib.com is a spamked domain that resolves to the IP 69.172.201.208

Checking this IP with MHM (2), we found that 50,673 domains share the same IP. Random sampling these reveals that these are either spamked domains or not founds.

So how do you avoid them all? Good question.

A partial solution consists in building a collection of spamked IPs and URLs which can then be queried with a candidate IP/URL. Using a dedicated crawler to recrawl these records should help you update the collection as a result of any change in IPs/URLs, something you would expect from the usual suspects.

This might not be a bullet-proof solution as may result in some inocent victims having a ‘bad company’ (host provider) and being in the ‘wrong place’ (server) at the ‘wrong time’ (crawl timestamp), but hey inocent casualties are part of any war.

Update: Our MUST and MHM tools found that the URL http://ww1.theslammer.com/ resolves to IP 72.52.4.90 and is also shared by 57,451 domains.  Thus, with only two IPs, we were able to identify more than 100,000 cases. Building a collection of spamked domains is not that hard to do after all.

References

1. MUST (http://www.minerazzi.com/tools/must/must.php).

2. MHM (http://www.minerazzi.com/tools/mhm/mhm.php).

Chikungunya Virus Resources

A new miner about the nasty Chikungunya Virus is now available at http://www.minerazzi.com/chikun/.
Conduct research on this hot topic. Search the scientific literature for Chikungunya, a virus disease transmitted by mosquitoes. Read newspaper articles and research on CHIKV from CDC, NIH, WHO, and other sources.
This is a must needed miner built with Minerazzi. Great for students, researchers, and those that were hit by this nasty virus,
Researchers and students: Have a field day.

Finding Financial Data and News Stories with Minerazzi

Now you can find financial and news stories with these tools:

(1) The Financial Search Engine – This is a public miner for searching and mining financial web sites. It is available at http://www.minerazzi.com/mystocks

(2) The News Center – This is a free service that eventually will be available to all miners built with the platform. So far the service returns stock market summaries and news relevant to a miner. A user can submit multiple ticker symbols or use the default symbols. News sources are randomly displayed when the page is refreshed.

Crimes Against Meta Data: Sec.gov and Investor.gov sites

During the course of building a financial miner, we found sites committing a lot of crimes against meta data. The most recent are courtesy of the SEC.gov and Investor.gov sites. Perhaps the result bad copy rewritten by software or humans?

These are great sites for finding financial and business information, but some of their pages contain poorly written meta tag data that make indexers go ga-ga gu-gu.

To illustrate, check the meta description tags of the pages at the following URLs:

http://www.sec.gov/News/Page/List/Page/1356125649507

http://www.investor.gov/introduction-markets/role-sec/how-submit-comments-sec#.U-E3frR0yyZ

Links and CSS style instructions declared as meta description data? Great!

Finally…Enjoy the Ride!

Finally, Minerazzi is here and open for business, after 1 year in beta.
Enjoy the ride without registration at

Sentiment Viz: A Great Tweet Sentiment Visualization Tool

While developing new tools for our platform (http://www.minerazzi.com), we came across Prof. Christopher Healey’s work on data mining and visualization. For those interested on these subjects, Prof. Healey has an incredible site, full of superb resources for data mining the Web. For instance, he has developed several great tools for analyzing sentiment from tweets. These are available at

  1. Healey, C., Ramaswamy, S. (Accessed on 6-17-2014). Visualizing Twitter Sentiment.
  2. Healey, C. (Accessed on 6-17-2014). Sentiment Viz. Tweet Sentiment Visualization Tool.

Amazon’s, Facebook’s, and Twitter’s URL naming patterns

Discovering Amazon’s, Facebook’s, and Twitter’s URL naming patterns is easy with MHM, http://www.minerazzi.com/tools/mhm/mhm.php

An MHM search for amazon.com retrieves

affiliate-program.amazon.com
amazoin.com
amazommusic.com
amazomvisa.com
amazon-associate.com
amazon-books.org
amazon-coffee.com
amazon-posters.com
amazon-tools.com
amazon.com
amazon.com.mk
amazon.info
amazon.mk
amazonabebooks.ch
amazonabebooks.cz
amazonabebooks.nl
amazonbestseller.com
amazonbestsellersecrets.com
amazongenerics.com
amazonmarketplace.com
amazonmystery.com
amazonpreserves.com
amazonpromo.com
amazonrock.com
amazonsupplements.com
amazontools.com
amzazon.com
amzn.com
auctions.amazon.com
babyamazon.com
bestsellers.com
borderstores.amazon.com
ccs.webcrawler.com
coolest-gadgets.com
cs.webcrawler.com
dailyfreebooks.com
deals.woot.com
ec.cooks.com
freephonetracer.com
imbroke.org
imdb.com
leecao.com
liveaa.com
m.amazon.com
media-server.amazon.com
occupyweb.org
onehundredfreebooks.com
onhop.ca
p.ly
peoplesmart.com
pepsistuff.amazon.com
priveazy.com
projectorquest.com
promotions.amazon.com
rcm.amazon.com
registrar.amazon.com
simply-amazon.com
spokeo.com
stat.dealtime.com
static.amazon.com
voipequipment.voipreview.org
voipstores.myvoipprovider.com
youtube.com

Whereas an MHM search for facebook.com retrieves

apps.facebook.com
ar-ar.fi-fi.connect.facebook.com
axeroom.com
az-az.connect.facebook.com
baxleybook.com
blog.facebook.com
c.facebook.com
campverdebugleonline.com
chinovalleyreview.com
cs-cz.es-la.fbjs.facebook.com
cs-cz.ro-ro.fbjs.facebook.com
cy-gb.fr-fr.connect.facebook.com
cy-gb.vm.connect.facebook.com
da-dk.vn-vi.connect.facebook.com
de-de.el-gr.connect.facebook.com
developers.facebook.com
ebizlatam.com
edge-star-shv-13-frc1.facebook.com
en-gb.facebook.com
en-gb.vn-vn.connect.facebook.com
en-pi.en-gb.connect.facebook.com
en-pi.facebook.com
es-es.sk-sk.connect.facebook.com
es-la.connect.connect.connect.facebook.com
es-la.da-dk.connect.facebook.com
es-la.ro-ro.fbjs.facebook.com
es-la.vi-vn.connect.facebook.com
es-la.vn.connect.facebook.com
eu-es.facebook.com
facbook.com
facebok.com
faceboo.com
facebook.at
facebook.be
facebook.biz
facebook.ca
facebook.co
facebook.co.id
facebook.co.nz
facebook.com
facebook.com.br
facebook.com.es
facebook.com.tw
facebook.de
facebook.dk
facebook.es
facebook.fr
facebook.ie
facebook.in
facebook.it
facebook.jp
facebook.net.nz
facebook.nl
facebook.org
facebook.pl
facebook.se
facebook.us
facebookembed.com
facebookinc.com
facebooknya-rian.com
facebooksuppliers.com
facebooksupplierstest.com
farmtown-links.com
fb.com
fb.com.
fb.me
fbcdn.net
fbdocs.com
fbjs.facebook.com
fbquestions.com
fbsbx.com
fcaebook.com
fr-ca.f.connect.facebook.com
fr-fr.fbjs.facebook.com
gentlemanbyraw.com
gossipwithmarkzuckerberg.com
groupon.com
gsmarena.com
hornsqueal.com
hs.facebook.com
https:
hu-hu.fbjs.facebook.com
id-id.cv.connect.facebook.com
indexjournal.com
internet.org
iphone.facebook.com
jheddy.com
ka-ge.connect.facebook.com
ka-ge.facebook.com
lite.facebook.com
lite.lite.facebook.com
lol.khakim.co.vu
m.connect.facebook.com
m.facebook.com
m.fbjs.facebook.com
mbasic.facebook.com
media.newvoicemedia.org
moochspot.com
nn-no.connect.connect.facebook.com
pa-in.facebook.com
paloverdevalleytimes.com
ph.connect.facebook.com
postmaster.facebook.com
prescottvalleytribune.com
pt-br.conect.connect.facebook.com
pvtrib.com
r00tz.us
ro-ro.connect.connect.connect.connect.connect.connect.facebook.com
ro-ro.connect.facebook.com
ro-ro.fbjs.facebook.com
roomleft.com
s.ar.ar.lite.facebook.com
sv-se.connect.connect.connect.facebook.com
te-in.facebook.com
thefacebook.at
thefacebook.com
thefacebook.de
touch.facebook.com
upload.facebook.com
vn-vi.connect.facebook.com
wral.com
xenotypee.com
yourdailyglobe.com
zh-cn.ar-ar.fbjs.facebook.com
zh-cn.ck.connect.facebook.com
zh-cn.cv.connect.facebook.com
zh-cn.de-de.connect.facebook.com
zh-cn.el-gr.connect.facebook.com
zh-cn.pl-pl.connect.facebook.com
zh-cn.pt-br.connect.facebook.com
zh-cn.vivin.connect.facebook.com
zh-cn.zh-tw.fbjs.facebook.com
zh-hk.connect.facebook.com

Finally, and MHM search for twitter.com returns

199.59.149.230
adlog.com.com
become.com
bigbugnews.com
blogs.reuters.com
bomarpublishing.com
business.twitter.com
campverdebugleonline.com
chinovalleyreview.com
cubuffs.com
dcourier.com
dispatch.com
drew.edu
elmbridgetoday.co.uk
es.twitter.com
fightingsioux.com
fr.twitter.com
freestockcharts.com
groupon.com
hirunews.lk
immj.co.cc
jeantoledo.com.br
jp.twitter.com
minews26.com
nathanigel.co.cc
pages.tma-email.com
paloverdevalleytimes.com
prescottvalleytribune.com
ptleader.com
pvtrib.com
rdsig.yahoo.co.jp
svenskafans.com
translate.twttr.com
tw1.kappacs.com
twitter.ad.vu
twitter.checkorphan.org
twitter.com
twitter.fr
twitter.joemoreno.com
twitter.rs
twitter.uz
twt.si
twtr.jp
twurl.nl
vmikeydets.com
wkusports.com
womenspress.com
worldofwatches.com
wral.com
www4.twitter.com