Archive for the ‘Spam’ Category

Ethical Hacking: An Oxymoron, a Misnomer, or Both?

May 18, 2009

According to a report from the British Computer Society (BCS) covering a Security Panel Strategic Forum, “ethical hacking” is an oxymoron.

The report highligths do’s and don’t when it comes to defining terms like “hacker”, “ethical hacking”, “penetration tester”, “white/black hats”, and derivatives terms. These labels are frequently used in the IT industry. The report also underscores which terms should not be used by schools offering IT courses.

The problem with defining and redefining such labels is that there will always be others disagreeing with/circumventing said definitions.

For instance, in the December 1986 issue of MicroTimes, Bob Bickford wrote:

“A Hacker is any person who derives joy from discovering ways to circumvent limitations.”

If we accept this definition then a person that doesn’t derive any joy from discovering ways to circumvent limitations is not a hacker. Similarly a spouse cheater, an SEO, a spammer, a politician, a mobster, or a kid trying to get some candies from mom is a hacker.

I am taking this extreme, off-topic interpretation to illustrate the problem of semantics when it comes to defining things.

Whether you agree or disagree partial or totally with the report, it is a good read. For sure it will be a good piece for students planning to take my AIRWeb graduate course.

IRW: RIA Vulnerabilities

May 4, 2009

The current of issue of IRW should reach subscribers inbox tomorrow.

In this issue:

Featuring article: RIA Vulnerabilities

This issue of the newsletter discusses how hackers might be exploiting Web vulnerabilities found in Rich Internet Applications (RIAs). As mentioned in our previous issue, some RIAs are based on Adobe’s technologies like Flash, Flex, or AIR. Some are designed to be run online or offline. Their rising popularity has attracted developers and marketers, and -as expected- hackers and spammers.

QA: Excel Vector Normalization: How do I convert a row vector into a unit vector?
Who is Who in IR: C.J. van Rijsbergen
Top CS Departments: Polytechnic University of Puerto Rico
Historical Notes: ENIAC Computer
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…

No-Caching is Spammers Best Friend

April 30, 2009

Today I feel like giving a piece of advise to spammers, so this will force raising the bar in the “we versus them” in the Spam War. Think of this as a love-hate relationship.

C’mon spammers, I know you can do better. Don’t make our IR life easy at neutralizing your tactics. He, He.

At the recent AIRWeb Workshops, Brian Davison presented the paper Looking into the Past to Better Classify Web Spam, which received high reviews from referees and the audience.

Wannabe spammers, if you are really committed to spamdexing, at least know the how-tos. Don’t leave a temporal fingerprint of your web presence. Try this:

1. Prevent online resources from caching your web pages, like the Wayback Machine and commercial search engines.

2. Use No-Cache and No-Archive.

3. Switch hosts whenever you can.

4. Constantly mutate your link structure.

5. Don’t profile yourself with easy to detect/predictable honeypots, link swapping, strongly-connected component structures, etc.

Why giving these advices? Check current AIRWeb “gems”.

AIRWeb 2009 Proceedings

April 28, 2009

Here are the proceeding papers of AIRWeb 2009, available at http://airweb.cse.lehigh.edu/2009/proceedings.html

OK, SEOs, Spammers, and Hackers: start your engines and let the fun begin.

If you are a PUPR graduate student and are planning to take my AIR course, it might be a good idea to start browsing through these “gems”. Check also previous proceedings of AIRWeb.

Invited Talks

The Potential for Research and Development in Adversarial Information Retrieval — slides

Brian D. Davison

Web Spam Challenges: Looking Backward and Forward — slides

Carlos Castillo

Temporal Analysis

Looking into the Past to Better slides

Na Dai, Brian D. Davison and Xiaoguang Qi

Classify Web Spam

Web spamming techniques aim to achieve undeserved rankings in
search results. Research has been widely conducted on identifying
such spam and neutralizing its influence. However, existing spam
detection work only considers current information. We argue that
historical web page information may also be important in spam
classification. In this paper, we use content features from historical
versions of web pages to improve spam classification. We use
supervised learning techniques to combine classifiers based on
current page content with classifiers based on temporal features.
Experiments on the WEBSPAM-UK2007 dataset show that our
approach improves spam classification F-measure performance by
30% compared to a baseline classifier which only considers current
page content.

A Study of Link Farm Distribution slides

Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa

and Evolution Using a Time Series of Web Snapshots

In this paper, we study the overall link-based spam structure
and its evolution which would be helpful for the development
of robust analysis tools and research for Web spamming as a
social activity in the cyber space. First, we use strongly connected
component (SCC) decomposition to separate many
link farms from the largest SCC, so called the core. We
show that denser link farms in the core can be extracted by
node filtering and recursive application of SCC decomposition
to the core. Surprisingly, we can find new large link
farms during each iteration and this trend continues until at
least 10 iterations. In addition, we measure the spamicity
of such link farms. Next, the evolution of link farms is examined
over two years. Results show that almost all large
link farms do not grow anymore while some of them shrink,
and many large link farms are created in one year.

Web Spam Filtering in Internet slides

Miklós Erdélyi, András A. Benczúr, Julien Masanes and

Archives

Dávid Siklósi

While Web spam is targeted for the high commercial value of topranked
search-engine results, Web archives observe quality deterioration
and resource waste as a side effect. So far Web spam filtering
technologies are rarely used by Web archivists but planned in the
future as indicated in a survey with responses from more than 20
institutions worldwide. These archives typically operate on a modest
level of budget that prohibits the operation of standalone Web
spam filtering but collaborative efforts could lead to a high quality
solution for them.
In this paper we illustrate spam filtering needs, opportunities and
blockers for Internet archives via analyzing several crawl snapshots
and the difficulty of migrating filter models across different
crawls via the example of the 13 .uk snapshots performed
by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.

Content Analysis

Web Spam Identification slides

Juan Martinez-Romo and Lourdes Araujo

Through Language Model Analysis

This paper applies a language model approach to different
sources of information extracted from a Web page, in order
to provide high quality indicators in the detection of
Web Spam. Two pages linked by a hyperlink should be
topically related, even though this were a weak contextual
relation. For this reason we have analysed different sources
of information of a Web page that belongs to the context of
a link and we have applied Kullback-Leibler divergence on
them for characterising the relationship between two linked
pages. Moreover, we combine some of these sources of information
in order to obtain richer language models. Given
the different nature of internal and external links, in our
study we also distinguished these types of links getting a
significant improvement in classification tasks. The result
is a system that improves the detection of Web Spam on
two large and public datasets such as WEBSPAM-UK2006 and
WEBSPAM-UK2007.

An Empirical Study on slides

Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and

Selective Sampling in Active Learning for Splog Detection

Tomohiro Fukuhara

This paper studies how to reduce the amount of human supervision
for identifying splogs / authentic blogs in the context
of continuously updating splog data sets year by year.
Following the previous works on active learning, against the
task of splog / authentic blog detection, this paper empirically
examines several strategies for selective sampling in
active learning by Support Vector Machines (SVMs). As a
confidence measure of SVMs learning, we employ the distance
from the separating hyperplane to each test instance,
which have been well studied in active learning for text classification.
Unlike those results of applying active learning
to text classification tasks, in the task of splog / authentic
blog detection of this paper, it is not the case that adding
least confident samples performs best.

Linked Latent Dirichlet Allocation slides

István Bíró, Dávid Siklósi, Jácint Szabó

in Web Spam Filtering

and András Benczúr

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003)
is a fully generative statistical language model on the content
and topics of a corpus of documents. In this paper
we apply an extension of LDA for web spam classification.
Our linked LDA technique takes also linkage into account:
topics are propagated along links in such a way that the
linked document directly influences the words in the linking
document. The inferred LDA model can be applied for
classification as dimensionality reduction similarly to latent
semantic indexing. We test linked LDA on the WEBSPAM-UK2007
corpus. By using BayesNet classifier, in terms of
the AUC of classification, we achieve 3% improvement over
plain LDA with BayesNet, and 8% over the public link features
with C4.5. The addition of this method to a log-odds
based combination of strong link and content baseline classifiers
results in a 3% improvement in AUC. Our method
even slightly improves over the best Web Spam Challenge
2008 result.

Social Spam

Social Spam Detection

slides

Benjamin Markines, Ciro Cattuto and Filippo Menczer

The popularity of social bookmarking sites has made them prime
targets for spammers. Many of these systems require an administrator’s
time and energy to manually filter or remove spam. Here
we discuss the motivations of social spam, and present a study
of automatic detection of spammers in a social tagging system.
We identify and analyze six distinct features that address various
properties of social spam, finding that each of these features provides
for a helpful signal to discriminate spammers from legitimate
users. These features are then used in various machine learning
algorithms for classification, achieving over 98% accuracy in detecting
social spammers with 2% false positives. These promising
results provide a new baseline for future efforts on social spam. We
make our dataset publicly available to the research community.

Tag Spam Creates Large Non-slides

Nicolas Neubauer, Robert Wetzker and Klaus Obermayer

Giant Connected Components

Spammers in social bookmarking systems try to mimick
bookmarking behaviour of real users to gain the attention
of other users or search engines. Several methods have been
proposed for the detection of such spam, including domain specific
features (like URL terms) or similarity of users to
previously identified spammers. However, as shown in our
previous work, it is possible to identify a large fraction of
spam users based on purely structural features. The hypergraph
connecting documents, users, and tags can be decomposed
into connected components, and any large, but non-giant
components turned out to be almost entirely inhabited
by spam users in the examined dataset. Here, we test
to what degree the decomposition of the complete hypergraph
is really necessary, examining the component structure
of the induced user/document and user/tag graphs.
While the user/tag graph’s connectivity does not help in
classifying spammers, the user/document graph’s connectivity
is already highly informative. It can however be augmented
with connectivity information from the hypergraph.
In our view, spam detection based on structural features, like
the one proposed here, requires complex adaptation strategies
from spammers and may complement other, more traditional
detection approaches.

Spam Research Collections

Nullification Test Collections
slides

Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell

for Web Spam and SEO

Research in the area of adversarial information retrieval has
been facilitated by the availability of the UK-2006/UK-2007
collections, comprising crawl data, link graph, and spam labels.
However, research into nullifying the negative effect
of spam or excessive search engine optimisation (SEO) on
the ranking of non-spam pages is not well supported by
these resources. Nor is the study of cloaking techniques
or of click spam. Finally, the domain-restricted nature of a
.uk crawl means that only parts of link-farm icebergs may
be visible in these crawls. We introduce the term nullification
which we define as “preventing problem pages from
negatively affecting search results”. We show some important
differences between properties of current .uk-restricted
crawls and those previously reported for the Web as a whole.
We identify a need for an adversarial IR collection which is
not domain-restricted and which is supported by a set of
appropriate query sets and (optimistically) user-behaviour
data. The billion-page unrestricted crawl being conducted
by CMU (web09-bst) and which will be used in the 2009
TREC Web Track is assessed as a possible basis for a new
AIR test collection. We discuss the pros and cons of its scale,
and the feasibility of adding resources such as query lists to
enhance the utility of the collection for AIR research.

Web Spam Challenge Proposal for slides

András A. Benczúr, Miklós Erdélyi, Julien Masanes and

Filtering in Archives

Dávid Siklósi

In this paper we propose new tasks for a possible future Web Spam
Challenge motivated by the needs of the archival community. The
Web archival community consists of several relatively small institutions
that operate independently and possibly over different top
level domains (TLDs). Each of them may have a large set of historic
crawls. Efficient filtering would hence require (1) enhanced
use of the time series of domain snapshots and (2) collaboration by
transferring models across different TLDs. Corresponding Challenge
tasks could hence include the distribution of crawl snapshot
data for feature generation as well as classification of unlabeled
new crawls of the same or even different TLDs.

Hackers Hit Pentagon

April 22, 2009

It happened again: Thanks to Web vulnerabilities, hackers were able to hit the Pentagon. 

According to CCN (http://www.cnn.com/2009/US/04/21/pentagon.hacked/),

Thousands of confidential files on the U.S. military’s most technologically advanced fighter aircraft have been compromised by unknown computer hackers over the past two years, according to senior defense officials.

The Internet intruders were able to gain access to data related to the design and electronics systems of the Joint Strike Fighter through computers of Pentagon contractors in charge of designing and building the aircraft, according to the officials, who did not want to be identified because of the sensitivity of the issue.

In addition to files relating to the aircraft, hackers gained entry into the Air Force’s air traffic control systems, according to the officials. Once they got in, the Internet hackers were able to see such information as the locations of U.S. military aircraft in flight.

This news is quite relevant to my Fall 2009 Web Vulnerability graduate course (http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf)

BTW. Associate Director of the CS Department at PUPR.edu, also a colleague and friend, Dr. Alfredo Cruz, called me two days ago with some great news: The department has been accredited for 2009-2014 as a National Center of Academic Excellence in Information Assurance Education. Soon they will be listed with members of this exclusive “club” in the National Securing Agency web site (http://www.nsa.gov/ia/academic_outreach/nat_cae/institutions.shtml)

An official press release and formal presentation before the pertinent authorities is being coordinated for within the next few weeks or so.

The next issue of IR Watch – The Newsletter provides additional coverage of such an exciting news.

I have tied these two news in a single post to underscore the need for IR/data mining courses at the intersection of Information Security, which is precisely the mission statement of IRW, reaching now more than 300 investigators/research centers.

McAfee Report: Email Spam and the Environment

April 16, 2009

According to a McAfee report,

Until now, spam’s impact has been measured in time, money, and aggravation. It turns out there is a massive environmental impact as well. McAfee recently commissioned climate-change consultant ICF International and spam expert Richi Jennings to calculate the environmental impact of spam. The results that came back were startling: The energy consumed in transmitting and deleting spam is equivalent to the electricity used in 2.4 million U.S. homes, with greenhouse gas (GHG) emissions equivalent to 3.1 million passenger cars(http://resources.mcafee.com/content/NACarbonFootprintSpam)

I first learned about these findings through ABC. Essentially,

Anything powered by electricity also emits greenshouse gases. McAfee researchers say each junk e-mail emits 0.3 grams of the greenhouse gas carbon dioxide (CO2). That may not sound like much, but when you consider the volume of global annual spam, it all adds up. (http://abcnews.go.com/Technology/GlobalWarming/story?id=7343518&page=1).

Following that reasoning, spamdexing search engines and any adversarial information retrieval (AIR) practice is also an insult to injury, so as too many things that comes to my mind.

I will tell that to students of my Fall 2009 AIRWeb Course.

Humm, shocking: AIR vs. Environment.

I never thought about such an obvious connection.  :)

When newspapers stick to spam marketing

February 12, 2009

That’s what local newspapers in Puerto Rico are doing: insisting in old spam marketing tactics. Wake up local web masters. It is not 1995. Redirections, use of splash page ads, and keyword spamming  in meta keyword tags, not only does not work with search engines, but annoys users to no end.

This is exactly what some newspapers like the local El Nuevo Dia newspaper are doing.

Today I typed http://endi.com and was redirected to a full-size splash page ad. Then I have to opt-out to be redirected to a content page. Thank you for slapping in my face those annoying ads. Why chase away readers?

Adding insult to injury, when I looked at their horrible content page source code (http://www.elnuevodia.com/noticias), it is clear those that designed the page are insisting in keyword spamming through meta keyword tags. How many keywords can you count in the following sample?

<meta name=”Keywords” content=”ENDI El Nuevo Dia, periodico, Puerto, Rico, internet, noticias, boricua, Clima, Horoscopo, Coqui, Sapo,Concho,Telefonica,Coqui net,RonNueva York ,Horoscopo,Coqui,Sapo Concho,Telefonica,Coqui net,Ron,Bacardi,Estatus,Sila Maria Calderon,Menudo,Carlos Romero Barcelo,Pedro Rossello,Anibal Acevedo Vila,Daddy Yankee,Tego lderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Marcony,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne,Educacion,Cine,Entretenimiento,Ejercicio,Bienestar,Recetas,Recetarios,Musica,Boricuas,Carlos Arroyo, Islanders,TUTV,Motoristas,Internautas ,Medicos,Historiadores,Venezuela,Santo Domingo,Republica Dominicana,Cuba ,Queens,Manhattan,Bronx,Nueva York ,Espiritualidad,Maestros,Televicentro,Telemundo,Univision,Bellas Artes,Telenovela,Comunidades,Isla,Culebrita,St. Thomas,Isla Nena,Culebra,Isla Mona,Vieques,Filiberto Ojeda,Capitolio,Turismo,Periodico,Periodismo,Veteranos,Marina,Soldados,Senado,Energia Electrica,Plena,Bomba,Reggaeton,Wisin y Yandel,Casinos,Salsa,Construccion,Vacaciones,Jazz,Tito Trinidad,Oscar de la Hoya,Millie Corretjer,Roberto Clemente,San Juan,Miguel Cotto,Becas,Tercera Edad,Museos,Clasificados,Clasificados online,Clasificados en linea,Boletines,Servicios de noticias,Luis A. Ferre,Ferre Rangel,Hepatitis,Dengue,Diabetes,Obituarios,Obesidad,Librerias,Libros,Viejo San Juan,El Morro,Ballaja,Museo de Arte de Ponce,Ponce,Mayaguez,Parque Indigena,Observatorio de Arecibo,Tainos,El Yunque,El tunel Guajataca,Zoologico,RUM,UPR,Sagrado Corazon,Interamericana,Universidades,Colegios,Escuelas,Fajardo,Bahia,Luquillo,Bioluminiscente,La Parguera,Carlos Delgado,Igor Gonzalez,Olga Tanon,Piculin Ortiz,Primera Hora,Zonai,Virtual,Beisbol,Mundial ,Raul Papaleo,Sondeos,Justas,Pavas,Palmas,Munoz Marin,Lyann Puig,Elaine Lopez,Javier Lopez,Remi,Poesia ,Olga Nolla,Rosario Ferrer,Mayra Montero”>

How dumb is that? It is clear these folks have no clue about search engine marketing or how search engines work.  Otherwise, why insist in old 1995 practices proven a waste today? I question whether they have a clue about optimizing news stories and press releases for search engines.

OK, SEO firms, pitch them, but don’t try to sell them snake oil like keyword density, SEO LSI, rare terms crap, synonyms stuffing, etc.

Spammers from Forums.SearchEngineWatch.com?

January 12, 2009

Good question to ask.

Once a while I receive email spam, but in the last few months I am getting it apparent and allegedly from forums.searchenginewatch.com as private email. I am not sure how big is the problem at this IncisiveMedia property, but lately it is becoming a pain in you know where. Here is a header section of the latest email. Emphasis added in bold text.

Received: from unknown (HELO web-2.rpm.incbase.net) ([62.140.213.243])
by mx10.prw.net with ESMTP; 11 Jan 2009 03:45:05 -0400
Received: from web-2.rpm.incbase.net (localhost.localdomain [127.0.0.1])
by web-2.rpm.incbase.net (8.13.1/8.13.1) with ESMTP id n0B75Lp9028420
for ; Sun, 11 Jan 2009 07:05:21 GMT
Received: (from apache@localhost)
by web-2.rpm.incbase.net (8.13.1/8.13.1/Submit) id n0B75Kfu028419;
Sun, 11 Jan 2009 07:05:20 GMT
Date: Sun, 11 Jan 2009 07:05:20 GMT
To: admin@miislita.com
Subject: New Private Message at Search Engine Watch Forums
From: “Search Engine Watch Forums” <webmaster@forums.searchenginewatch.com>
Auto-Submitted: auto-generated
Message-ID: <200901110719.5efc68459520@forums.searchenginewatch.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=”ISO-8859-1″
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-Mailer: vBulletin Mail via PHP

The from claims is from their webmaster, but the first Received indicates incbase.net and the HELO gives an IP to follow.

A WhoIs using http://whois.domaintools.com/incbase.net  gives:

Registrant:
Incisive Media
28-29 Haymarket House
Haymarket
London, London SW1Y 4RX
GB

Registrar: 000DOM
Domain Name: INCBASE.NET
Created on: 16-APR-08
Expires on: 16-APR-10
Last Updated on: 16-APR-08

Administrative, Technical Contact:
Bartlett, Chris
Incisive Media plc
28-29 Haymarket House
Haymarket
London, London SW1Y 4RX
GB
44.2074849860

Domain servers in listed order:
L3DNS1.VNU.CO.UK
L4DNS1.VNU.CO.UK

And a WhoIs using http://whois.domaintools.com/62.140.213.243  gives

inetnum: 62.140.213.0 – 62.140.213.255
netname: VNU
descr: London office 1st assignment
remarks: all abuse complaints to
remarks: all abuse complaints to
country: GB
admin-c: RD4902-RIPE
tech-c: BJ441-RIPE
status: Assigned PA
mnt-by: MNET-MNTNER
source: RIPE # Filtered

person: Ron Doobay
address: 32-34 Broadwick Street
address:
address: London W1A 2HG
phone: +44 020 7316 9677
fax-no: +44 020 7316 9695
e-mail:
nic-hdl: RD4902-RIPE
source: RIPE # Filtered

person: Byron Jones
address: 32-34 Broadwick Street
London
W1A 2HG
phone: +44 20 7816 9650
e-mail:
nic-hdl: BJ441-RIPE
source: RIPE # Filtered

route: 62.140.192.0/19
descr: Aldgate LONDON POP
origin: AS24867
remarks: Abuse reports to
remarks: Peering contact is
mnt-by: MNET-MNTNER
source: RIPE # Filtered

It appears spammers got Incisive Media and SEW number!

AIRWeb2009 Call for Papers

January 8, 2009

Last night I received the following email from the organizing committee of AIRWeb2009 asking to disseminate the event:

Dear Edel,

Thank you again for agreeing to serve on the AIRWeb program committee. We have attached the AIRWeb CFP to this message and would appreciate your assistance in publicizing the workshop. The CFP is also available from the AIRWeb website – http://airweb.cse.lehigh.edu/2009/ .

Best regards,
Dennis Fetterly and Zoltan Gyongyi

This is my third year as a PC Member of AIRWeb. It is a lot of fun reviewing manuscripts to be presented at the event, months before the new anti-spamdexing and anti-adversarial IR practices are disseminated to the general public. Some, spammers like to wait and follow what comes out of AIRWeb to then try workarounds. This is a continuos arm race and cat-mouse chase.

So, this post goes as follows:

AIRWeb is a series of international workshops focusing on Adversarial Information Retrieval on the Web that brings together both researchers and industry practitioners, to present and discuss advances in the state of the art.

AIRWeb’09 will be co-located with the WWW2009 conference in Madrid, Spain. The workshop proceedings will be made available in the ACM Digital Library.

Important Dates

6 February 2009: Deadline (optional, but helpful) for abstract submissions

13 February 2009: Deadline for paper submissions
20 or 21

April 2009: Date of the workshop

Incidentally, I am observing another new wave of spammers, marketers, and johnny-comes-late talking in “IR tongues” to gain some credibility from easy to impress folks. Ironically their audience mostly consists of their peer SEOs. I guess the fight against spammers disguised as marketers never ends.

I wish I can beat the crap out of all these self-proclaimed SEO “experts” every single day through this blog. Fortunately, I have better things to do like conducting research, advicing students, writing IRW The Newsletter, preparing a paper for SIDIM 2009, peer reviewing IR manuscripts, and (reality checks-to-pay-bills) taking on enterprise projects.

Data Mining Email Headers Part II

January 6, 2009

This is a follow up of yesterday’s post. The following trick, discussed in IRW newsletter, helps you to mining email headers from even automatic responders and failed delivery emails.

The trick is to read email headers from the bottom up. The last “Received” is more trusted than the others which are forgeable. The line corresponds to the original sender.

Here is a technique discussed in IRW that you can use to identify which headers are inserted by your ISP. Send an email to yourself through your ISP account and check the email headers of both documents. If your ISP is not using an SMTP proxy masquerade, chances it might leak the name of the workstation used to create the email in the HELO command along with your ISP name, IP, and possibly other interesting information.

Armed with this information, analyze the headers of emails you receive from automatic responders and “failed to deliver” email messages. Now you know which headers are from your ISP and which are inserted as the email traveled from servers to servers before reaching your inbox.

For instance, I recently got an automatic response wherein the HELO says

Received: from unknown (HELO UKMAIL.sportex.com) (213.86.197.130)
by server-3.tower-157.messagelabs.com with SMTP; 6 Jan 2009 05:04:59 -0000

An IP lookup reveals additional information. The rest of the headers also leaks interesting stuff.

Data Mining Email Headers

January 5, 2009

The featuring article of IRW explains how to access, read, and interpret email headers. Several techniques for tracking down spammers are also disclosed.

We show whether your ISP or email client might be adding headers that unnecessarily disclose important information like the name of the machine used to send an email, your isp name and IP, your email vendor, which antivirus software your isp might be using, etc.

For instance, this morning I received the following unsolicited emai asking for a link exchange:

Hi,

My name is David Stern, and I am contacting you on behalf of our client ***

*** is London’s most exclusive personal training and therapy centre.

I have visited your site and see that your site is sufficiently related to their domain. It would be great if we can have website *** linked to yours. In lieu of this link, we will provide a link back from one of our best directories and from same Google PageRank page.

The email headers identify in the HELO command the sender’s local machine. I’m disabling the link using asterisks.

Received: from [122.162.66.40] (helo=smtp.net4india.com)
by smtp.net4india.com with smtp (Exim 4.66) <*a href=*mailto:linkmanager@business-onlinedirectory.com”>linkmanager@business-onlinedirectory.com<*/a>)

If HELO is not present, there are plenty of data mining techniques to use.

IRW Sneak Preview: Fraudulent Web Analytics

October 31, 2008

Fraudulent Web Analytics

This post is the monthly sneak preview of the next issue of IR Watch Newsletter, now in its new format.

In this issue, the featuring article pretends to raise awareness on some of the schemes used to defraud those that make business decisions based on Web Analytics. If you are an advertiser or investor, you must read it. Don’t be gamed by unethical marketers and spammers. 

The article exposes how some marketers/spammers engineer the fraud by gaming the wisdom of crowds. We expose how traffic fraud, click-through injections, and form injections are used within viral networks to produce bogus Web Analytics advertisers might be paying for or using to make critical decisions.

The Question of the Month column is dedicated to precision vs. recall.

In the Who is Who in IR section, the late Karen Sparck Jones is featured.

In the Top CS Departments, the CS Dept of Stanford University is featured.

We have a new column dedicated to historical notes on computers, search engines, and IR. In the current column, Hewlett-Packard origins are highlighted.

Last, but not least, more IR blogs and graduate theses are listed.

Now, some great news! Please keep reading.

We are currently in negotiations with a local university to co-launch an interesting start-up at the intersection of IR, search engines, and business research.

The way we see it, a bad Economy presents opportunities. The time is right for such a unique project.

Getting Ready for AIRWeb2009

October 13, 2008

For the last few years I have served as PC member of AIRWeb. I just received and accepted invitation to be a PC for AIRWeb 2009.

For those of you not familiar with, the International Workshop on Adversarial Information Retrieval on the Web (AIRWeb) http://airweb.cse.lehigh.edu/ has been held four times: in conjunction with the WWW’05, SIGIR’06, WWW’07, and WWW’08.

Topics discussed at the workshops include all forms of search engine spamming and hacking practices. SEO spamming practices are exposed and countermeasures are tested. It is a lot of fun examining in advance manuscript describing these malicous practices, months before the accepted papers hit mainstream.

Incidentally, the next issue of the IR Watch newsletter features Fraudulent Web Analytics, an article on adversarial techniques. We expose several practices spammers and hackers use to produce fake analytics and to defraud advertisers.

Direct-to-Consumer Drug Advertising: A Waste?

September 2, 2008

According to these news articles.

http://www.washingtonpost.com/wp-dyn/content/article/2008/08/29/AR2008082902646.html

http://afp.google.com/article/ALeqM5iQtDzzee581Ixc7sEbUjWFvcMwjg

direct-to-consumer advertising (DTC ads) of drugs through TV, Internet, and other media channels is an about $5B big waste of money. DTC ads, the study finds out, is based on scant data that has a small effect on the sales of drugs.

Prescription drugs are not like selling aspirin, cereal, or popcorn.

Stephen Soumerai, head of the research group that worked on the study thinks DTC advertising has failed because the process of buying prescription drugs is not like buying over-the-counter medication.

According to Soumerai, a person has to see an ad, get motivated, contact their doctor, show up for an appointment, communicate both the condition and the drug, convince the doctor to prescribe it, and then actually fill the prescription, which is also likely to carry an out-of-pocket cost.

It is certainly not a waste for modern snakeoil sellers (marketers, spammers, and scammers) that eat from the $5B pie. Same pattern as usual: theories made out of thin air and scant data.

I will not be surprise if they soon send a paid researcher or someone with vested interests to write a rebuttal.

On Online Hackers, Marketers, and Criminals

August 19, 2008

Hackers that market themselves are fully getting into the crime scene.

We have seen marketers getting into hacking and vice versa: hackers getting into marketing. Designing web pages that rank high in the search engines for the sole purpose of using these to spread malicious resources and tools is one example. We call them hacketers = hackers + marketers.

Now hackers are getting physical.

Back in March, 2008 it was reported how hackers were causing harm to folks suffering from epilepsy. Some usability and accessibility marketers are using those incidents to better promote their own services a la your-problem-is-my-opportunity.

Other marketers are creating reputation management problems and then ‘go back through the kitchen’ to market “reputation management” solutions. A scam not any different from the click fraud scam promoted by marketers part of a mob organization. Hah, Hah.

Now, we have the news of a hacker allegedly kidnaping and torturing another alleged hacker.

These probably are the first cases of hackers physically hurting others.

What is next? Google worse than ISP Snooping? –as AT&T claims.

Some times controlling information is worse than physically controlling others.

Ah, the many faces of opportunism.

Claps and Slaps, the LSI Way

August 4, 2008

Claps

We are happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at the Indian Institute of Technology in Madras, India is using our SVD LSI tutorial as lecture material for his course: CS625, Memory Based Reasoning in AI. http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf 

Another investigator, this time from the cancer research field, congratulated us for the LSI tutorials. Jaime Fernandez Vera from Structural Biology and Biocomputing, Centro Nacional de Investigaciones Oncologicas, Madrid, Spain wrote (contact info removed):

Estimado Dr. García:

Muchas gracias por poner a disposición de la Comunidad sus magníficas guías prácticas y, en especial, la de LSI que es la que he seguido.

Un abrazo,

Jaime Fernández Vera

Biología Estructural y Biocomputación Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas

Our LSI/SVD tutorials are also listed in http://www-timc.imag.fr/Benoit.Lemaire/lsa.html huge repository of LSI research resources.

For additional IR resources quoting our tutorials, check the following link at http://www.miislita.com.

http://www.miislita.com/searchito/educational-links.html.

Slaps

Talking about LSI…

Spammers disguised as ethical SEOs and that promote LSI crap are now hidding. There is less talking on the blogosphere on “SEO LSI” and “LSI-friendly SEO Optimization” myths. As we always say, these crooks are a black eye to the ethical sector of the search marketing industry.

Their signature seems to be the promotion of crap tools and services like Keyword Density tools, Markov Chain generators (if you believe that crap), TFIDF rarity calculators, “semantic page strength” estimators, lookup lists based on “LSI operators”, etc. What will be their next effort at misleading the public? Latent Dirichlet Allocation (LDA) tools?

However, in an effort to save face, the usual suspects are still making gymnastic wording. They are desperate. It is clear that our efforts at exposing these crook marketers through IR knowledge are working.

Many are learning why they should stay away from the incorrect knowledge promoted by marketers that ocassionally use IR jargon to pretend they know what they are talking about. They often do these IR-like talking attempts to promote their image as “experts” before either naive or ignorant followers. We still cannot assess the dumbers, if the snakeoil sellers or their groupies. They even game each others.

When we expose SEO myths from their competitors they praise us as long as the debunkig works for them, but when their own myths are exposed they get angry at us. Ha, Ha.

 

Posts somehow related with this post

http://irthoughts.wordpress.com/2007/12/11/perpetuating-lsi-misconceptions/ 

http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/

http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/#comments

http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/

http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/

http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/

http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/

Gaming the Gamers: The SEOs Exposed Story

July 25, 2008

Edward Lewis, from SEO Consultants, is writing an interesting piece on how apparently SEOs are gaming each other. The Sphinn Expose article is available at http://www.seoconsultants.com/sphinn/expose/

We don’t take position on why he took the challenge or how he gathered the data, but we need to respect Lewis for the time and effort he has put into opening this apparent ”can of worms”.

If true, his findings show how SEOs allegedly abuse electronic outlets to promote themselves and or their peers.

Of course, there are two sides of a story. It will be interesting to hear the version of the alleged gamers.

For links about this SEO “soap opera”, check http://www.seroundtable.com/archives/017750.html

Again, if true, Lewis’ findings are a blow to the credibility of an industry already plagued with reputation problems and spammers disguised as pseudo experts. 

If true, more than a sad story, it is a disgrace for the ethical sector of the industry.

A Spammer on the Loose

July 24, 2008

According to http://www.networkworld.com/community/node/30231 , notorious Spam King Eddie Davidson and who earned millions through his company, Power Promoters, has escaped from a federal prision camp and is now on the loose. See http://denver.fbi.gov/dojpressrel/2008/spamking072208.htm

Davidson made a fortune by providing marketing services for companies by sending huge amounts of email spam. The news reads that “the spamming was designed to promote the visibility and sale of products offered by various companies. Davidson utilized the services and assistance of other individuals who he hired as “sub-contractors” to provide spamming at his direction on behalf of his client companies, the DOJ said”

I wonder who these companies and “sub-contractors” really are and when they are going to be held accountable. I also wonder if similar crooks are still behind the scene doing the usual activities (email spam, click through fraud, spamdexing, misrepresentation of products and services, etc).

Not insinuating anyone in particular, but whether you are a traditional marketer, a “bloggeter” (blogger + marketer), or a search engine optimization marketer, think about the consequences of becoming a spam marketer. All of your web trails are under constant watch by the IRS, DOJ, and few others…

Prospective clients: resist the temptation of doing business with such kind of people. You can also be held accountable for being an accessory.

SEOs and their Exhaustivity Search Myths

July 21, 2008

Some SEOs, in an effort to sell something, gain credibility, or save face, will come up with all sort of theories made out of thin air about search engines. When not citing themselves, they cite each other hearsays, often through their link farms. When caught with the pants down, they will lie or edit qualifiers in their posts. Can you guess who, according to Mike Duz, wrote this?

“Some of those well in the know attribute this to latent semantic indexing, which Google has been using for a while, but recently increased its weighting”. (From the Internet Archive)”

According to Duz, this guy later changed his categorical assertion into this:

“Even if they are not using LSI, Google has likely been using other word relationship technologies for a while, but recently increased its weighting”.

Note that in this case changing the qualifier (”had” to “if”) also changes the categorically asserted facts, which is not a minor thing since flies against Credibility. Thanks, Mike.

Answer: (Aaron Wall) 
http://www.internetbusiness.co.uk/09072008/different-google-algos-for-different-keywords/

What a saving face effort!

Instances of such kind of edits are not new across the Web.

We roast these folks simply because they sell search engine snake oil and lies often to promote themselves, their peers, or some kind of crap tool or service. We do this through IR knowledge. One of our goals is to warn the ethical sector of the search marketing industry about such pseudo experts.

We will hammer their myths any day of the year, which takes us to another persistent myth about how search engines work: the search exhaustivity myth.

SEOs have this idea that when a user submits a query, the system does an exhaustive search through the entire document collection or index to compute term weights and rank documents according to a particular similarity measure. Evidently these folks do not know how an inverted index works. One of the reasons (there are many) for using inverted indexes is to avoid searching through all the documents listed, present in a collection. “Jumping” and “intersecting” posting lists is one of the reasons why search engines return results so fast.

BTW, when we understand how positional inverted indexes work, the benefits of document linearization, a topic we have written on before, become clear.

How an inverted index works is a good topic for IR Watch – The Newsletter.

On the Stability of Global Weights in the Presence of Spam

June 30, 2008

In A Comparison of Document, Sentence, and Term Event Spaces, Blake analyses the stability of various global weight models (IDF, ISF, and ITF) with respect to document and journal corpuses. She uses stratified samples collected based on term frequency information. Abstracts and section partitions of full-text scientific articles were studied.

We are conducting studies along similar lines, but using web collections. Such studies are extremely important in a dynamic environment like the web, which is different from abstracts and scientific journal collections. Web collections often consist of documents of general interest or noisy content. Documents can be designed using dubious practices like keyword repetition techniques. Such techniques, commonly known as keyword spam, can introduce biased term frequency data.

It should be underscored that keywords relevant to products and services are also intentionally repeated across the web. In addition, not all document content found in web collections is valid. Some are near duplicates, have been artificially generated, or are the result of vested alliances like link exchange programs. Thus, the stability of IDF, ISF, and ITF with respect to web collections or subsets of these in such a noisy environment is still an open question.

Any pointer or reference research about this topic is greately appreciated.

SEOs and their IDF Myths

June 17, 2008

Now that the semester is over we can take on other projects. After a little break from the blog, it is good to be back. We are putting the final touches to this month issue of IR Watch – The Newsletter.  During the break dozen of new subscribers signed.

The piece takes on several IDF myths and misconceptions promoted by SEOs and on what IDF is/is not. Here is an excerpt:

One recurrent misconception found across online media channels (search marketing blogs, forums, etc) is the assertion that IDF can be used to assess how important or relevant a term might be to the content of a document. This claim has no basis.

 

It should be stressed that as a measure of term specificity over N, IDF is not a local, but a global measure. IDF evaluates the discriminating power of a term within a collection of documents. A term ti might be relevant or important to the content of a document. However, if this document is part of a collection wherein all documents repeat ti, the term loses its discriminating power since N = ni and IDFi = log(N/ni) = 0.

Somehow, these marketers are mistaking IDF for the RSJ model or who knows what to possibly, as is often the case, promote themselves or whatever they sell.

SEOs Scams: LSI, KW, and Markov Chains

June 3, 2008

I’m happy to learn that Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras, India is using my LSI and Term Vector tutorials for his graduate courses:

http://aidb.cs.iitm.ernet.in/cs625/11.SVD-LSI.pdf

http://aidb.cs.iitm.ernet.in/cs625/10.VectorSpace-model.pdf

It is great to see that more and more IRs and graduate students are realizing how certain SEOs have induced the public and their clients into error; that is, by selling their snakeoil in the form of “LSI optimization” and keyword density services. The most recent scam comes in the form of “markov chain” services. Like if they really know about matrix algebra and markov chain processes. Same old tricks…

It is not surprising to hear colleagues referring to these SEOs as vulgar crooks and scammers.

Cell Phone Spam

May 19, 2008

Cell phone spam: Hum. Nothing new, but it is more prevalent than ever.

Yesterday a local newspaper (El Nuevo Dia) featured the Los spams ahora atacan a los celulares article in which few local sources were inquired on the subject.

Unfortunately they all seem to miss the point.

Telephone companies are indoubtly making money from spam, and quite a lot. So, why kill the money making machine? Duh!

Don’t just take my word. Look around for a second opinion like this one:

Verizon Won’t Help You Filter Out SMS Spam Because It Makes Them Money

If that is not enough, then check why

Angry Customers Sue T-Mobile Over Texting Charges.

Indeed, cell phone spam “is the perfect storm of annoying attributes. It audibly interrupts your life like telemarketing”.

Pink Keywords: Optimization of Resumes and Job Applications

May 5, 2008

The current slump in the US and PR economy and so many local employers giving pink slips induces me to think of the importance of pink keywords.

These are keywords one would use to optimize resumes and job applications.

Now than ever recruiters, middle management, and HR departments need to look through zillion of resumes, looking for specific clues in the form of pinky keywords. This means that resumes and job applications must be optimized for such terms.

http://career-advice.monster.com/resume-writing-basics/Keyword-Challenge/home.aspx

The best way of finding good pinky keywords consists in selling to employers their own crappy ads and job offers; that is, by scanning employment ads, job offerings, and classifieds relevant to the target position one is interested in and then using the target terms in your own resume. Another thing one can do is to expand these with related or contextual terms; of couse, using those that match your own experience and skills.

I see here an opportunity for ethical SEO companies to provide a valuable and noble service: Pinky Optimization. At the same time I see an opportunity for crook SEOs and spammers to prey on other people’s misfortune. Since many in the seophere have being disposed by fat cats and sold(soul)-outs, these folks are also job searching. Life ironies.

For SEO Spammers: AIRWeb 2008 Presentations

April 29, 2008

To facilitate mainstream dissemination of the manuscripts presented at AIRWeb 2008 here are the papers as listed over at http://airweb.cse.lehigh.edu/2008/program.html

SEO spammers, whether your life gravitates around a “social network circus” or ”link building” or not, it is time to revisit your drawing board.

8:30 – 10:00

10:30 – 12:00

13:30 – 15:00

15:30 – 17:00

  • Web Spam Challenge
    • (5 min.) Description of the challenge
    • (12 min.) Data Analysis School, Moscow slides
      Konstantin Bauman, Alexey Brodskiy, Sergey Kacher, Elmira Kalimulina, Ruslan Kovalev, Mikhail Lebedev, Dmitry Orlov, Pavel Sushin, Pavel Zryumov, Dmitry Leshchiner and Ilya Muchnik
    • (12 min.) Computer and Automation Research Institute, Hungarian Academy of Sciences slides
      David Siklosi, Andras Benczur
    • (12 min.) Institute of Automation, Chinese Academy of Sciences, Beijing slides
      Guanggang Geng, Xiaobo Jin and Chunheng Wang
    • (5 min.) Announcement of results
  • Panel
    • (45 min.): The Future of Adversarial IR on the Web
      Amit Aggarwal, Zoltán Gyöngyi, Alexandros Ntoulas, Erik Selberg, and Andrew Tomkins

SEOs – Desperate Seeking Clients

April 24, 2008

From time to time I receive unsolicited emails from SEOs offering me their services, to list my site in the major search engines and directories. They often send templates-like automatic messages (”Dear website owner”) and appear not to even bother to check if recipients need the service. 

These SEOs often look desperate and sound like snakeoil sellers and crooks. They even claim to be better than other SEOs.

They often pitch the same crap:

  • “I recently visited your site” (Really? Why then send this crap?).
  • “you are not listed in the top search engines and directories” (Really? How do they know?).
  • “we can increase your traffic by X astronomical amount” (Really? Could you double X for me, please?).
  • “we can help you get top rankings in Google” (Really? For which keywords?).
  • “our link building program” (Really? Read here link exchange and link spam).
  • “we have proprietary crap, blah, blah, …” (Really? Sell it or get a patent!).

I just received one of such emails last night, even when my site is known in the IR/SEO spheres and has been listed for many years in the top search engines and directories, and ranking well.

Dear website owner,

I visited your website and noticed that you are not listed in many of the major search engines and directories. If our company can increase your traffic up to 500% by getting you top ranking results on the search engines such as Google would you be interested? We specialize in link building content writing and programming. We have proprietary techniques that work better and are less expensive than any other SEO firm.

Please let me send you a proposal and show you how we can make your website profitable.

Sincerely,

Christian Frank

2060 AVENIDA DE LOS ARBOLES, STE D
THOUSAND OAKS,
CA 91362-1361 – USA

These are the type of companies that give a black eye to the SEO industry. If SEOs send you this type of crap, I feel your pain. Stay away from their businesses or whatever they claim or seem to offer.

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than what web page creators or a KD tool tags as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the ones that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing and comparing against a whole new and extra ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

AIRWeb-2008 Last Call for Papers & Extension Deadline

February 25, 2008

AIRWEB organizers have instructed me to disseminate the following Final Call For Papers and deadline extension. Let’s fight the spammers and those disguised as SEOs.

We have extended the submission deadline until March 2, 2008. We would appreciate any assistance in disseminating this extension. The text version of the final call for papers is below and the pdf version is attached.

Best regards.
Carlos Castillo
Kumar Chellapilla
Dennis Fetterly

FINAL CALL FOR PAPERS and 9 day extension
Fourth International Workshop on
Adversarial Information Retrieval on the Web
http://airweb.cse.lehigh.edu/2008/

IMPORTANT DATES

02/March/2008 : Deadline for research articles
31/March/2008 : Deadline for challenge submissions
22/April/2008 : Workshop at the WWW 2008 conference in Beijing, China

Contents:

1. AIRWeb’08 Topics
2. Web Spam Challenge
3. Timeline
4. Organizers and Program Committee

1. AIRWEB’08 TOPICS

Adversarial Information Retrieval addresses tasks such as gathering,
indexing, filtering, retrieving and ranking information from
collections wherein a subset has been manipulated maliciously. On the
Web, the predominant form of such manipulation is “search engine
spamming” or spamdexing, i.e., malicious attempts to influence the
outcome of ranking algorithms, aimed at getting an undeserved high
ranking for some items in the collection.

We solicit both full and short papers on any aspect of adversarial
information retrieval on the Web. Particular areas of interest
include, but are not limited to:

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging
* Ping spam

Proceedings of the workshop will be included in the ACM Digital
Library. Full papers are limited to 8 pages; work-in progress will be
permitted 4 pages. Papers should be formatted using the WWW2008
proceedings style and submitted via
http://www.easychair.org/conferences/?conf=airweb2008

For more information, see http://airweb.cse.lehigh.edu/2008/

2. WEB SPAM CHALLENGE

Last year we introduced a novel element at the workshop: a Web Spam
Challenge for testing web spam detection systems. We will be holding
the Web Spam Challenge again this year, using the WEBSPAM-UK2007
collection for Web Spam Detection http://www.yr-bcn.es/webspam

The collection includes large set of web pages, a web graph, and
human-provided labels for a set of hosts. We will also provide a set
of features extracted from the contents and links in the collection,
which may be used by the participant teams in addition to any
automatic technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions
(normal/spam) for all unlabeled hosts in the collection. Predictions
will be evaluated and results will be announced at the AIRWeb 2008
workshop.

For more information, see

3. TIMELINE

- 15 February 2008: E-mail intention to submit a workshop paper
  (optional, but helpful)
- 02 March 2008: Deadline for workshop paper submissions (all day
- 24 March 2008: Notification of acceptance of workshop papers
- 31 March 2008: Challenge submissions due
- 07 April 2008: Camera-ready copy due
- 22 April 2008: Date of workshop

4. ORGANIZERS AND PROGRAM COMMITTEE

Organizers

- Carlos Castillo, Yahoo! Research
- Kumar Chellapilla, Microsoft Live Labs
- Dennis Fetterly, Microsoft Research

Program Committee

- Einat Amitay, IBM
- Andras Benczar, Hungarian Academy of Sciences
- Paul-Alexandru Chiri, Uni Hannover
- James Caverlee, Texas A&M University
- Gordon Cormack, University of Waterloo
- Nick Craswell, Microsoft Research
- Matt Cutts, Google
- Brian Davison, Lehigh University
- Ludovic Denoyer, University Paris 6
- Aaron D’Souza, Google
- Edel Garcia, Mi Islita.com
- Natalie Glance, Nielsen BuzzMetrics
- Antonio Gulli, Ask.com
- Zoltan Gyongyi, Stanford University
- Monika Henzinger, Google
- Pranam Kolari, Yahoo! Applied Research
- Mark Manasse, Microsoft Research
- Marc Najork, Microsoft Research
- Alexandros Ntoulas, Microsoft Search Labs
- Jan Pedersen, Yahoo! Search
- Erik Selberg, Amazon
- Torsten Suel, Polytechnic University
- Mike Thelwall, University of Wolverhampton
- Baoning Wu, Snap
- Tao Yang, Ask.com

Search Engines for Penetration Testing

February 21, 2008

Well, I’m getting ready for my talk this afternoon at University of Turabo. I’ve organized the talk in three parts:

 Part 1: Spam and Fraud through Search Engines

Part 2: Gathering Intelligence through Search Engines

Part 3: Identity Theft through Search Engines

A disclaimer will be necessary to indicate that the information to be presented is for educational purposes, only.

This gonna be a nice one. I hope to see old friends.

Web Mining, Search Engines, and Information Security

February 15, 2008

This thursday the 21st I’ll be presenting before the faculty of University of Turabo, Gurabo, PR the talk:

Web Mining, Search Engines, and Information Security

I hope to see old friends there. Here is the abstract of my talk:

Web Mining is a research area of Data Mining wherein the Web is the “database” and search engines are the “user’s interface”. End-users can resource to search engines for all sorts of things. For instance, marketers can use search engines to gain traffic derived from ranking high Web pages for specific queries, hence enhancing the online presence of businesses, products, and services (search engine optimization, SEO). Spammers can inundate search engine indexes to deceive searchers (spamdexing). Hackers can attempt to rank high documents that lead to security risks (hacketers, hacketering) or use all form of injections (links, forms, scripts, redirections, etc). Terrorists and criminals can use search engines to commit all sort of crime-enabling activities, for instance, by stealing private information like SSNs, passwords, students and users’s IDs, gaining access to “private” documentation, stalking people, etc.

This talk covers these and other aspects of search engines: the Good, the Bad, and the Ugly. The speaker will then talk about his own research projects in the area of Web Mining, Search Engines, and Intelligence. A disclaimer will be necessary to indicate that the information to be presented is for educational purposes only.