Archive for the ‘AIRWeb Course’ Category

AIRWeb 2009 Proceedings

April 28, 2009

Here are the proceeding papers of AIRWeb 2009, available at http://airweb.cse.lehigh.edu/2009/proceedings.html

OK, SEOs, Spammers, and Hackers: start your engines and let the fun begin.

If you are a PUPR graduate student and are planning to take my AIR course, it might be a good idea to start browsing through these “gems”. Check also previous proceedings of AIRWeb.

Invited Talks

The Potential for Research and Development in Adversarial Information Retrieval — slides

Brian D. Davison

Web Spam Challenges: Looking Backward and Forward — slides

Carlos Castillo

Temporal Analysis

Looking into the Past to Better slides

Na Dai, Brian D. Davison and Xiaoguang Qi

Classify Web Spam

Web spamming techniques aim to achieve undeserved rankings in
search results. Research has been widely conducted on identifying
such spam and neutralizing its influence. However, existing spam
detection work only considers current information. We argue that
historical web page information may also be important in spam
classification. In this paper, we use content features from historical
versions of web pages to improve spam classification. We use
supervised learning techniques to combine classifiers based on
current page content with classifiers based on temporal features.
Experiments on the WEBSPAM-UK2007 dataset show that our
approach improves spam classification F-measure performance by
30% compared to a baseline classifier which only considers current
page content.

A Study of Link Farm Distribution slides

Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa

and Evolution Using a Time Series of Web Snapshots

In this paper, we study the overall link-based spam structure
and its evolution which would be helpful for the development
of robust analysis tools and research for Web spamming as a
social activity in the cyber space. First, we use strongly connected
component (SCC) decomposition to separate many
link farms from the largest SCC, so called the core. We
show that denser link farms in the core can be extracted by
node filtering and recursive application of SCC decomposition
to the core. Surprisingly, we can find new large link
farms during each iteration and this trend continues until at
least 10 iterations. In addition, we measure the spamicity
of such link farms. Next, the evolution of link farms is examined
over two years. Results show that almost all large
link farms do not grow anymore while some of them shrink,
and many large link farms are created in one year.

Web Spam Filtering in Internet slides

Miklós Erdélyi, András A. Benczúr, Julien Masanes and

Archives

Dávid Siklósi

While Web spam is targeted for the high commercial value of topranked
search-engine results, Web archives observe quality deterioration
and resource waste as a side effect. So far Web spam filtering
technologies are rarely used by Web archivists but planned in the
future as indicated in a survey with responses from more than 20
institutions worldwide. These archives typically operate on a modest
level of budget that prohibits the operation of standalone Web
spam filtering but collaborative efforts could lead to a high quality
solution for them.
In this paper we illustrate spam filtering needs, opportunities and
blockers for Internet archives via analyzing several crawl snapshots
and the difficulty of migrating filter models across different
crawls via the example of the 13 .uk snapshots performed
by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.

Content Analysis

Web Spam Identification slides

Juan Martinez-Romo and Lourdes Araujo

Through Language Model Analysis

This paper applies a language model approach to different
sources of information extracted from a Web page, in order
to provide high quality indicators in the detection of
Web Spam. Two pages linked by a hyperlink should be
topically related, even though this were a weak contextual
relation. For this reason we have analysed different sources
of information of a Web page that belongs to the context of
a link and we have applied Kullback-Leibler divergence on
them for characterising the relationship between two linked
pages. Moreover, we combine some of these sources of information
in order to obtain richer language models. Given
the different nature of internal and external links, in our
study we also distinguished these types of links getting a
significant improvement in classification tasks. The result
is a system that improves the detection of Web Spam on
two large and public datasets such as WEBSPAM-UK2006 and
WEBSPAM-UK2007.

An Empirical Study on slides

Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and

Selective Sampling in Active Learning for Splog Detection

Tomohiro Fukuhara

This paper studies how to reduce the amount of human supervision
for identifying splogs / authentic blogs in the context
of continuously updating splog data sets year by year.
Following the previous works on active learning, against the
task of splog / authentic blog detection, this paper empirically
examines several strategies for selective sampling in
active learning by Support Vector Machines (SVMs). As a
confidence measure of SVMs learning, we employ the distance
from the separating hyperplane to each test instance,
which have been well studied in active learning for text classification.
Unlike those results of applying active learning
to text classification tasks, in the task of splog / authentic
blog detection of this paper, it is not the case that adding
least confident samples performs best.

Linked Latent Dirichlet Allocation slides

István Bíró, Dávid Siklósi, Jácint Szabó

in Web Spam Filtering

and András Benczúr

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003)
is a fully generative statistical language model on the content
and topics of a corpus of documents. In this paper
we apply an extension of LDA for web spam classification.
Our linked LDA technique takes also linkage into account:
topics are propagated along links in such a way that the
linked document directly influences the words in the linking
document. The inferred LDA model can be applied for
classification as dimensionality reduction similarly to latent
semantic indexing. We test linked LDA on the WEBSPAM-UK2007
corpus. By using BayesNet classifier, in terms of
the AUC of classification, we achieve 3% improvement over
plain LDA with BayesNet, and 8% over the public link features
with C4.5. The addition of this method to a log-odds
based combination of strong link and content baseline classifiers
results in a 3% improvement in AUC. Our method
even slightly improves over the best Web Spam Challenge
2008 result.

Social Spam

Social Spam Detection

slides

Benjamin Markines, Ciro Cattuto and Filippo Menczer

The popularity of social bookmarking sites has made them prime
targets for spammers. Many of these systems require an administrator’s
time and energy to manually filter or remove spam. Here
we discuss the motivations of social spam, and present a study
of automatic detection of spammers in a social tagging system.
We identify and analyze six distinct features that address various
properties of social spam, finding that each of these features provides
for a helpful signal to discriminate spammers from legitimate
users. These features are then used in various machine learning
algorithms for classification, achieving over 98% accuracy in detecting
social spammers with 2% false positives. These promising
results provide a new baseline for future efforts on social spam. We
make our dataset publicly available to the research community.

Tag Spam Creates Large Non-slides

Nicolas Neubauer, Robert Wetzker and Klaus Obermayer

Giant Connected Components

Spammers in social bookmarking systems try to mimick
bookmarking behaviour of real users to gain the attention
of other users or search engines. Several methods have been
proposed for the detection of such spam, including domain specific
features (like URL terms) or similarity of users to
previously identified spammers. However, as shown in our
previous work, it is possible to identify a large fraction of
spam users based on purely structural features. The hypergraph
connecting documents, users, and tags can be decomposed
into connected components, and any large, but non-giant
components turned out to be almost entirely inhabited
by spam users in the examined dataset. Here, we test
to what degree the decomposition of the complete hypergraph
is really necessary, examining the component structure
of the induced user/document and user/tag graphs.
While the user/tag graph’s connectivity does not help in
classifying spammers, the user/document graph’s connectivity
is already highly informative. It can however be augmented
with connectivity information from the hypergraph.
In our view, spam detection based on structural features, like
the one proposed here, requires complex adaptation strategies
from spammers and may complement other, more traditional
detection approaches.

Spam Research Collections

Nullification Test Collections
slides

Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell

for Web Spam and SEO

Research in the area of adversarial information retrieval has
been facilitated by the availability of the UK-2006/UK-2007
collections, comprising crawl data, link graph, and spam labels.
However, research into nullifying the negative effect
of spam or excessive search engine optimisation (SEO) on
the ranking of non-spam pages is not well supported by
these resources. Nor is the study of cloaking techniques
or of click spam. Finally, the domain-restricted nature of a
.uk crawl means that only parts of link-farm icebergs may
be visible in these crawls. We introduce the term nullification
which we define as “preventing problem pages from
negatively affecting search results”. We show some important
differences between properties of current .uk-restricted
crawls and those previously reported for the Web as a whole.
We identify a need for an adversarial IR collection which is
not domain-restricted and which is supported by a set of
appropriate query sets and (optimistically) user-behaviour
data. The billion-page unrestricted crawl being conducted
by CMU (web09-bst) and which will be used in the 2009
TREC Web Track is assessed as a possible basis for a new
AIR test collection. We discuss the pros and cons of its scale,
and the feasibility of adding resources such as query lists to
enhance the utility of the collection for AIR research.

Web Spam Challenge Proposal for slides

András A. Benczúr, Miklós Erdélyi, Julien Masanes and

Filtering in Archives

Dávid Siklósi

In this paper we propose new tasks for a possible future Web Spam
Challenge motivated by the needs of the archival community. The
Web archival community consists of several relatively small institutions
that operate independently and possibly over different top
level domains (TLDs). Each of them may have a large set of historic
crawls. Efficient filtering would hence require (1) enhanced
use of the time series of domain snapshots and (2) collaboration by
transferring models across different TLDs. Corresponding Challenge
tasks could hence include the distribution of crawl snapshot
data for feature generation as well as classification of unlabeled
new crawls of the same or even different TLDs.

Hackers Hit Pentagon

April 22, 2009

It happened again: Thanks to Web vulnerabilities, hackers were able to hit the Pentagon. 

According to CCN (http://www.cnn.com/2009/US/04/21/pentagon.hacked/),

Thousands of confidential files on the U.S. military’s most technologically advanced fighter aircraft have been compromised by unknown computer hackers over the past two years, according to senior defense officials.

The Internet intruders were able to gain access to data related to the design and electronics systems of the Joint Strike Fighter through computers of Pentagon contractors in charge of designing and building the aircraft, according to the officials, who did not want to be identified because of the sensitivity of the issue.

In addition to files relating to the aircraft, hackers gained entry into the Air Force’s air traffic control systems, according to the officials. Once they got in, the Internet hackers were able to see such information as the locations of U.S. military aircraft in flight.

This news is quite relevant to my Fall 2009 Web Vulnerability graduate course (http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf)

BTW. Associate Director of the CS Department at PUPR.edu, also a colleague and friend, Dr. Alfredo Cruz, called me two days ago with some great news: The department has been accredited for 2009-2014 as a National Center of Academic Excellence in Information Assurance Education. Soon they will be listed with members of this exclusive “club” in the National Securing Agency web site (http://www.nsa.gov/ia/academic_outreach/nat_cae/institutions.shtml)

An official press release and formal presentation before the pertinent authorities is being coordinated for within the next few weeks or so.

The next issue of IR Watch – The Newsletter provides additional coverage of such an exciting news.

I have tied these two news in a single post to underscore the need for IR/data mining courses at the intersection of Information Security, which is precisely the mission statement of IRW, reaching now more than 300 investigators/research centers.

McAfee Report: Email Spam and the Environment

April 16, 2009

According to a McAfee report,

Until now, spam’s impact has been measured in time, money, and aggravation. It turns out there is a massive environmental impact as well. McAfee recently commissioned climate-change consultant ICF International and spam expert Richi Jennings to calculate the environmental impact of spam. The results that came back were startling: The energy consumed in transmitting and deleting spam is equivalent to the electricity used in 2.4 million U.S. homes, with greenhouse gas (GHG) emissions equivalent to 3.1 million passenger cars(http://resources.mcafee.com/content/NACarbonFootprintSpam)

I first learned about these findings through ABC. Essentially,

Anything powered by electricity also emits greenshouse gases. McAfee researchers say each junk e-mail emits 0.3 grams of the greenhouse gas carbon dioxide (CO2). That may not sound like much, but when you consider the volume of global annual spam, it all adds up. (http://abcnews.go.com/Technology/GlobalWarming/story?id=7343518&page=1).

Following that reasoning, spamdexing search engines and any adversarial information retrieval (AIR) practice is also an insult to injury, so as too many things that comes to my mind.

I will tell that to students of my Fall 2009 AIRWeb Course.

Humm, shocking: AIR vs. Environment.

I never thought about such an obvious connection.  :)

AIRWeb Course Announcement

April 2, 2009

During the Fall of 2009, I will be teaching 

 Adversarial Information Retrieval on the Web:  A Graduate Course on Web Spam and Internet Vulnerabilities

This a new one-full semester graduate course to be offered at Polytechnic University Puerto Rico. It is based on the material presented at the annual AIRWeb Workshops. KDDM graduate students are encouraged to enroll. An early announcement and preliminary syllabus is available at

http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf

BTW, In November 5 of 2008 PUPR became the First Academic Institution in the Caribbean to be Certified by the Committee on National Security Systems (CNSS). Additional information is available at http://www.pupr.edu/ias.html

Their goal is to become a Center of Academic Excellence in Information Assurance Education (CAE/IAE). These are great news. Nationwide, how many universities you know that are in such an exclusive ”club”?