Here are the proceeding papers of AIRWeb 2009, available at

OK, SEOs, Spammers, and Hackers: start your engines and let the fun begin.

If you are a PUPR graduate student and are planning to take my AIR course, it might be a good idea to start browsing through these “gems”. Check also previous proceedings of AIRWeb.

Invited Talks

The Potential for Research and Development in Adversarial Information Retrieval — slides

Brian D. Davison

Web Spam Challenges: Looking Backward and Forward — slides

Carlos Castillo

Temporal Analysis

Looking into the Past to Better slides

Na Dai, Brian D. Davison and Xiaoguang Qi

Classify Web Spam

Web spamming techniques aim to achieve undeserved rankings in
search results. Research has been widely conducted on identifying
such spam and neutralizing its influence. However, existing spam
detection work only considers current information. We argue that
historical web page information may also be important in spam
classification. In this paper, we use content features from historical
versions of web pages to improve spam classification. We use
supervised learning techniques to combine classifiers based on
current page content with classifiers based on temporal features.
Experiments on the WEBSPAM-UK2007 dataset show that our
approach improves spam classification F-measure performance by
30% compared to a baseline classifier which only considers current
page content.

A Study of Link Farm Distribution slides

Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa

and Evolution Using a Time Series of Web Snapshots

In this paper, we study the overall link-based spam structure
and its evolution which would be helpful for the development
of robust analysis tools and research for Web spamming as a
social activity in the cyber space. First, we use strongly connected
component (SCC) decomposition to separate many
link farms from the largest SCC, so called the core. We
show that denser link farms in the core can be extracted by
node filtering and recursive application of SCC decomposition
to the core. Surprisingly, we can find new large link
farms during each iteration and this trend continues until at
least 10 iterations. In addition, we measure the spamicity
of such link farms. Next, the evolution of link farms is examined
over two years. Results show that almost all large
link farms do not grow anymore while some of them shrink,
and many large link farms are created in one year.

Web Spam Filtering in Internet slides

Miklós Erdélyi, András A. Benczúr, Julien Masanes and


Dávid Siklósi

While Web spam is targeted for the high commercial value of topranked
search-engine results, Web archives observe quality deterioration
and resource waste as a side effect. So far Web spam filtering
technologies are rarely used by Web archivists but planned in the
future as indicated in a survey with responses from more than 20
institutions worldwide. These archives typically operate on a modest
level of budget that prohibits the operation of standalone Web
spam filtering but collaborative efforts could lead to a high quality
solution for them.
In this paper we illustrate spam filtering needs, opportunities and
blockers for Internet archives via analyzing several crawl snapshots
and the difficulty of migrating filter models across different
crawls via the example of the 13 .uk snapshots performed
by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.

Content Analysis

Web Spam Identification slides

Juan Martinez-Romo and Lourdes Araujo

Through Language Model Analysis

This paper applies a language model approach to different
sources of information extracted from a Web page, in order
to provide high quality indicators in the detection of
Web Spam. Two pages linked by a hyperlink should be
topically related, even though this were a weak contextual
relation. For this reason we have analysed different sources
of information of a Web page that belongs to the context of
a link and we have applied Kullback-Leibler divergence on
them for characterising the relationship between two linked
pages. Moreover, we combine some of these sources of information
in order to obtain richer language models. Given
the different nature of internal and external links, in our
study we also distinguished these types of links getting a
significant improvement in classification tasks. The result
is a system that improves the detection of Web Spam on
two large and public datasets such as WEBSPAM-UK2006 and

An Empirical Study on slides

Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and

Selective Sampling in Active Learning for Splog Detection

Tomohiro Fukuhara

This paper studies how to reduce the amount of human supervision
for identifying splogs / authentic blogs in the context
of continuously updating splog data sets year by year.
Following the previous works on active learning, against the
task of splog / authentic blog detection, this paper empirically
examines several strategies for selective sampling in
active learning by Support Vector Machines (SVMs). As a
confidence measure of SVMs learning, we employ the distance
from the separating hyperplane to each test instance,
which have been well studied in active learning for text classification.
Unlike those results of applying active learning
to text classification tasks, in the task of splog / authentic
blog detection of this paper, it is not the case that adding
least confident samples performs best.

Linked Latent Dirichlet Allocation slides

István Bíró, Dávid Siklósi, Jácint Szabó

in Web Spam Filtering

and András Benczúr

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003)
is a fully generative statistical language model on the content
and topics of a corpus of documents. In this paper
we apply an extension of LDA for web spam classification.
Our linked LDA technique takes also linkage into account:
topics are propagated along links in such a way that the
linked document directly influences the words in the linking
document. The inferred LDA model can be applied for
classification as dimensionality reduction similarly to latent
semantic indexing. We test linked LDA on the WEBSPAM-UK2007
corpus. By using BayesNet classifier, in terms of
the AUC of classification, we achieve 3% improvement over
plain LDA with BayesNet, and 8% over the public link features
with C4.5. The addition of this method to a log-odds
based combination of strong link and content baseline classifiers
results in a 3% improvement in AUC. Our method
even slightly improves over the best Web Spam Challenge
2008 result.

Social Spam

Social Spam Detection


Benjamin Markines, Ciro Cattuto and Filippo Menczer

The popularity of social bookmarking sites has made them prime
targets for spammers. Many of these systems require an administrator’s
time and energy to manually filter or remove spam. Here
we discuss the motivations of social spam, and present a study
of automatic detection of spammers in a social tagging system.
We identify and analyze six distinct features that address various
properties of social spam, finding that each of these features provides
for a helpful signal to discriminate spammers from legitimate
users. These features are then used in various machine learning
algorithms for classification, achieving over 98% accuracy in detecting
social spammers with 2% false positives. These promising
results provide a new baseline for future efforts on social spam. We
make our dataset publicly available to the research community.

Tag Spam Creates Large Non-slides

Nicolas Neubauer, Robert Wetzker and Klaus Obermayer

Giant Connected Components

Spammers in social bookmarking systems try to mimick
bookmarking behaviour of real users to gain the attention
of other users or search engines. Several methods have been
proposed for the detection of such spam, including domain specific
features (like URL terms) or similarity of users to
previously identified spammers. However, as shown in our
previous work, it is possible to identify a large fraction of
spam users based on purely structural features. The hypergraph
connecting documents, users, and tags can be decomposed
into connected components, and any large, but non-giant
components turned out to be almost entirely inhabited
by spam users in the examined dataset. Here, we test
to what degree the decomposition of the complete hypergraph
is really necessary, examining the component structure
of the induced user/document and user/tag graphs.
While the user/tag graph’s connectivity does not help in
classifying spammers, the user/document graph’s connectivity
is already highly informative. It can however be augmented
with connectivity information from the hypergraph.
In our view, spam detection based on structural features, like
the one proposed here, requires complex adaptation strategies
from spammers and may complement other, more traditional
detection approaches.

Spam Research Collections

Nullification Test Collections

Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell

for Web Spam and SEO

Research in the area of adversarial information retrieval has
been facilitated by the availability of the UK-2006/UK-2007
collections, comprising crawl data, link graph, and spam labels.
However, research into nullifying the negative effect
of spam or excessive search engine optimisation (SEO) on
the ranking of non-spam pages is not well supported by
these resources. Nor is the study of cloaking techniques
or of click spam. Finally, the domain-restricted nature of a
.uk crawl means that only parts of link-farm icebergs may
be visible in these crawls. We introduce the term nullification
which we define as “preventing problem pages from
negatively affecting search results”. We show some important
differences between properties of current .uk-restricted
crawls and those previously reported for the Web as a whole.
We identify a need for an adversarial IR collection which is
not domain-restricted and which is supported by a set of
appropriate query sets and (optimistically) user-behaviour
data. The billion-page unrestricted crawl being conducted
by CMU (web09-bst) and which will be used in the 2009
TREC Web Track is assessed as a possible basis for a new
AIR test collection. We discuss the pros and cons of its scale,
and the feasibility of adding resources such as query lists to
enhance the utility of the collection for AIR research.

Web Spam Challenge Proposal for slides

András A. Benczúr, Miklós Erdélyi, Julien Masanes and

Filtering in Archives

Dávid Siklósi

In this paper we propose new tasks for a possible future Web Spam
Challenge motivated by the needs of the archival community. The
Web archival community consists of several relatively small institutions
that operate independently and possibly over different top
level domains (TLDs). Each of them may have a large set of historic
crawls. Efficient filtering would hence require (1) enhanced
use of the time series of domain snapshots and (2) collaboration by
transferring models across different TLDs. Corresponding Challenge
tasks could hence include the distribution of crawl snapshot
data for feature generation as well as classification of unlabeled
new crawls of the same or even different TLDs.