Random notes prior to 4th July weekend

July 3, 2009 by E. Garcia

As the 4th of July weekend approaches, here are some notes before hitting to planet oblivious.

1. Yesterday we had an interesting business entrepreneur meeting with the CIO of the Government of Puerto Rico at El Palacio Rojo, Fortaleza.

2. IRW should be out by Monday. Main article: Data Mining Texting.

3. Only monkeys still believe in KD Myths. Ha, Ha.

Official: MIC Puerto Rico

June 23, 2009 by E. Garcia

Back in April, I mentioned that Microsoft will be co-launching with Interamerican University of Puerto Rico, Metropolitan Campus the Microsoft Innovation Center (MIC) of Puerto Rico.

Well, tomorrow is the official inauguration. the university generously has provided me with lab and office space to start an interesting research project within the MIC building. These are exciting news. I cannot comment much about the project, except to say that it is at the interface of search engines, social networks, and information security.

It looks like I will have my hands full between workig at two universities, blogging, and doing consulting work.

IR Videos in Spanish

June 22, 2009 by E. Garcia

I normally do not put online my lecture notes (ppt, pdf, videos). However, there are two public conferences that event organizers taped. Both last over 1 hour and are in Spanish, but with slides in English. Here are the links. The quality of the videos is so-so.

Since the videos were made available few months later after the events, these are not properly dated. I have included below the actual date of the events. If you don’t know Spanish, you are out of luck.

1. Understanding Search Engines (Entendiendo a los Buscadores), University of Puerto Rico, Bayamon, 4-23-2008

http://video.google.com/videoplay?docid=-653964730907023811

This one last for about two hours. The audience consisted of grad students and researchers. Unfortunately, the video has an audio-visual mismatch of about one slide. If you can coupe with this, I hope you like it.

2. Demystifying LSI (Desmitificando LSI)- OJOBuscador Congress, Madrid, Spain, 3-09-2007.

http://www.ojotube.com/videos/congreso-ojobuscador-2007-ponencia-desmitificando-lsi-de-dr-e-garcia/

This one last for over one hour. Since it was for a non-scientific audience  (most Spanish SEOs)  I tried to talk very slow.

What is a Similarity Matrix?

June 16, 2009 by E. Garcia

Soon or later CS students, in particularly those in IR, will need to deal with similarity matrices.

In simple terms, any matrix M that exhibits the following five characteristics is a similarity matrix.

Squaredness = M must have the same number of rows and columns.
Non-Negativity = all elements of M must be real, non-negative numbers.
Boundedness = all elements of M must adopt values between 0 and 1.
Reflexivity = all diagonal elements of M (i.e. from left to bottom) must be filled with 1.
Symmetry = all ij elements must be identical to all ji elements.

A matrix that fails to exhibit any of these characteristics is not a similarity matrix.

Accordingly, some matrices found in the literature on LSI and whose elements have been referred to as similarities are not so since the corresponding matrix does not conform to the above definition.

Note. This information will help those that took the IR Quiz on Matrices to realize how well they did.

Computing Co-Occurrence Matrices with Excel

June 5, 2009 by E. Garcia

The QA column of the current issue of IR Watch – The Newsletter features the following question:

Question: In Excel, how do you convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?

Answer:

Let A be a matrix populated with term occurrences (frequencies).
Let AT be its transpose.

Then, T = AAT is a term-term co-occurrence matrix, and D = ATA is a document-document co-occurrence matrix.

The following table emulates an Excel spreadsheet.

 

A

B

C

D

1  A =

d1

d2

d3

2

t1

0

1

0

3

t2

0

0

1

4

t3

1

1

1

5

 

 

 

 

6

T = AAT

t1

t2

t3

7

t1

1

0

1

8

t2

0

1

1

9

t3

1

1

3

10

 

 

 

 

11

D = ATA

d1

d2

d3

12

d1

1

1

1

13

d2

1

2

1

14

d3

1

1

2

In the table, T was computed by selecting a destination array, entering in its first empty cell (B7) the formula =MMULT(B2:D4,TRANSPOSE(B2:D4)), pressing the f2 key and then the Ctrl+Shift+Enter keys.

Similarly, D was computed by selecting a destination array, entering in its first empty cell (B12) the formula =MMULT(TRANSPOSE(B2:D4),B2:D4), pressing the f2 key and then the Ctrl+Shift+Enter keys.

That was easy!

Note that none of these are similarity matrices. Can you tell why?

IRW-2009-6:Hackers: Taxonomy & Writing Styles

June 1, 2009 by E. Garcia

hackers

The current issue of IRW should reach subscribers inbox during the day or at the latest, tomorrow.

In this issue:

  • Featuring article: Hackers: Taxonomy and Writing Styles
    Due to the increasing interest in developing Information Retrieval and Data Mining courses at the intersection of Information Security, this issue of the newsletter covers a brief taxonomy on hackers and their writing styles.
  • QA: Excel Matrix Multiplications: How to convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?
  • Vacuum Tubes & Transistors Historical
  • Who is Who in IR: Thomas K. Landauer
  • Top CS Departments: Dartmouth College
  • Outstanding Graduate Theses
  • Calls and Events
  • IR Blogs
  • and more…

On Term Repetition and Local Models

May 27, 2009 by E. Garcia

I’m putting together a piece on several local term weight models. It should be ready in few weeks.

It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.

The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:

<last nail in KD coffin  style=”intensity:100%;”>

“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:

 Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.

Aboutness (A); e.g., that the document is x times more about the term.

Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.

Relevance (R);i..e., that a document repeating a term x times is x times more relevant.

Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”

</last nail in KD coffin>

Defining Data Mining and Database

May 25, 2009 by E. Garcia

What is the (^H^H^H) best definition for data mining and database? It depends on who you ask and in which context.

According to Section 126 of the USA Patriot Act,

(1) DATA-MINING- The term `data-mining’ means a query or search or other analysis of one or more electronic databases, where

(A) at least one of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement;

(B) the search does not use personal identifiers of a specific individual or does not utilize inputs that appear on their face to identify or be associated with a specified individual to acquire information; and

(C) a department or agency of the Federal Government is conducting the query or search or other analysis to find a pattern indicating terrorist or other criminal activity.

(2) DATABASE- The term `database’ does not include telephone directories, information publicly available via the Internet or available by any other means to any member of the public, any databases maintained, operated, or controlled by a State, local, or tribal government (such as a State motor vehicle database), or databases of judicial and administrative opinions.

Asking the government or a KDDM researcher the question and using LSI to clusters results for the above question can be a futile exercise.

It is like asking President Obama or Vice President Cheney to agree on: “What is Torture?”

When Noise is a Good Thing.

May 22, 2009 by E. Garcia

Today, a reader (name removed to protect confidentiality) asked me:

My name is **** ****. I working as a junior research fellow in a project in India. I red the SVD techniques from the web page http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html#right-eigenvectors. I found it is quite satisfactory for me. Now I can understand how SVD works. But I have a query as follows.

query:

As mentioned in this tutorial that we have arrange these eigen-values in descending order. Cold you please tell me if I put these values in ascending order or arbitrary what will be wrong with the SVD.

Looking forward your early kind response.

Thanking you.

With best regards.

*******

My answer follows.

It depends on what you are trying to address.

SVD is used to identify singular values interpreted as dimensions. When used as a dimensionality reduction technique, the largest N singular values are normally retained and thus retaining the smaller singular values is meaningless.  The largest singular values capture most of the information of the original data set and is therefore a noise minimization approach.

If the retention criterion used is reversed (smaller singular values are retained) this implies retaining the more noisy dimensions such that the reconstructed matrix will be a matrix of the hidden (latent) data noise. This is a noise maximization approach.

If the retention criterion is based on a random selection, the resultant reconstructed matrix might be one representing a data structure with randomized noise.

These scenarios depend on the original data under examination. 

In Image Compression, these approaches have been already explored. If the goal is a stability study and not just SVD dimensionality reduction, “the ratio between the highest singular value and the lowest singular value of the Jacobian matrix quantifies the spread of the Jacobian’s singular values, which in practice, reflects the extent of the solution’s instability with respect to small changes in the observation”  (Horesh’s Thesis )

Having said all that, we should not render noise in a data set as something that must be discarded at all cost.

This is intimate linked with the so-called Inverse Problem. Incorporating noise and a priori SVD information can provide the complete information in a linear sense. Qianqian Fang has a beautiful PPT presentation “Look Closer to Inverse Problem” on the subject. If you want to visualize the MATRIX Problem, this presentation is for you.

I’m thinking in putting together a tutorial on the Singular Value Expansion algorithm (SVE), if I ever find the time.

I hope this helps.

Ethical Hacking: An Oxymoron, a Misnomer, or Both?

May 18, 2009 by E. Garcia

According to a report from the British Computer Society (BCS) covering a Security Panel Strategic Forum, “ethical hacking” is an oxymoron.

The report highligths do’s and don’t when it comes to defining terms like “hacker”, “ethical hacking”, “penetration tester”, “white/black hats”, and derivatives terms. These labels are frequently used in the IT industry. The report also underscores which terms should not be used by schools offering IT courses.

The problem with defining and redefining such labels is that there will always be others disagreeing with/circumventing said definitions.

For instance, in the December 1986 issue of MicroTimes, Bob Bickford wrote:

“A Hacker is any person who derives joy from discovering ways to circumvent limitations.”

If we accept this definition then a person that doesn’t derive any joy from discovering ways to circumvent limitations is not a hacker. Similarly a spouse cheater, an SEO, a spammer, a politician, a mobster, or a kid trying to get some candies from mom is a hacker.

I am taking this extreme, off-topic interpretation to illustrate the problem of semantics when it comes to defining things.

Whether you agree or disagree partial or totally with the report, it is a good read. For sure it will be a good piece for students planning to take my AIRWeb graduate course.

Google Accused of Conversion-Inflation Syndication Fraud

May 15, 2009 by E. Garcia

According to Ben Edelman, Google is engaged in a conversion-inflation syndyication fraud.

These tactics are nothing new.

In the featuring article of the November 2008 issue of IR Watch, “Fraudulent Web Analytics – Engineering the Fraud“, we covered how in-the-middle mechanisms are part of Web Analytic Frauds and Business Collusion Schemes.

As in man-in-the-middle attacks found in information security settings, the underlying goal is the same: the crafting of deceiving intermediary events.

Expect soon a pr damage control campaign from the useful idiots/moles.

What is next? A class action lawsuit?

Still, I have a little taste of satisfaction in my mouth when crooks disguised as advertisers/search marketers are gamed. Gaming the gamers: Life ironies!

IR Quiz: Matrices

May 13, 2009 by E. Garcia

Explain and give example for the following matrices used in IR:

1. Term-document occurrence matrix.

2. Term-term cooccurrence matrix.

3. Term-term correlation matrix.

4. Term-term similarity matrix.

5. Term-term coweights matrix.

6. Term-term distance matrix (*).

7. Covariance matrix (*).

 

(*) PS. I forgot to list these other matices.

Vector Normalization with Excel – Part II

May 7, 2009 by E. Garcia

Back in March, we explained how to normalize column vectors with Excel. But, what about normalizing row vectors? This question is addressed in the current QA column of IRW. I think it might be useful sharing the answer with readers since many of these are students struggling with similar questions. So, here we go.

The following table emulates an Excel array consisting of three columns (A, B, and C) and six rows (1-6).

  A B C
1 1 2 3
2 4 5 6
3 7 8 9
4 0.27 0.53 0.80
5 0.46 0.57 0.68
6 0.50 0.57 0.65

Rows 1, 2, and 3 are row vectors. Rows 4, 5, and 6 are the corresponding normalized vectors, also known as unit vectors because their length is 1. To compute these, do as follows:

1. In cell A4, enter the formula =A1/(SQRT(SUMSQ($A1:$C1))). The result should be as given in this cell.

2. Copy this formula, select cells A5 and A6 and paste the formula in these.

 3. Finally, copy at once cells A4 through A6, select the remaining empty cells of the array, i.e., cells B4 through C6 and paste the formulas in these.

NSA/DHS Designates PUPR as a CAE

May 5, 2009 by E. Garcia

As blogged yesterday, the current issue of IRW should reach subscribers inbox today. The Top CS Departments column features Polytechnic University of Puerto Rico, where I teach graduate courses. As mentioned few days ago, PUPR has been designated a CAE. This is a great news that is making a splash across academic centers within the U.S., the Caribbean Region and Latin America, and whose mission is research relevant to homeland security.

Associate Director for Computer Science, Dr. Alfredo Cruz, sent me an  official announcement, which I am reproducing.

Polytechnic University of Puerto Rico (PUPR) is Designated National Center of Academic Excellence in Information Assurance Education by NSA and DHS. PUPR was recently designated as a National Center of Academic Excellence in Information Assurance Education (CAE/IAE) by the National Security Agency (NSA) and the Department of Homeland Security (DHS) on April 22, 2009. The goal of these centers is to reduce the vulnerability of the national information infrastructure by promoting higher education and research in Information Assurance (IA) and Security through the development of a growing number of professionals with IA expertise in various related disciplines. PUPR will be recognized as the first institution in Puerto Rico to be designated as a CAE/IAE on June 3, 2009 in Seattle, Washington. Dr. Alfredo Cruz from the Department of Electrical & Computer Engineering and Computer Science will be present to receive the designation. He is the Director of the Center of Information Assurance for Research and Education (CIARE) at PUPR. Dr. Cruz is the person responsible for this designation. PUPR is of the very few Hispanic serving institution (HSI) in the Nation to receive this designation, and to become one of the first 100 institutions nationwide; this is a very special recognition. This designation requires that the President of the United States send the Governor of Puerto Rico a certification that should be handed to the president of PUPR designating the Institution as a CAE/IAE at a National level. The Congress and all the respective Congressional Committees are also notified.

Some of the benefits of the CAE/IAE designation are:
• PUPR will receive formal recognition from the U.S. Government as well as opportunities for prestige and publicity for our roll in securing the Nation’s information systems.
• This designation increases collaboration opportunities between designated and aspiring institutions at local and national levels. This includes internships, faculty and student exchange, research, and publications, among other activities.
• With this designation as a CAE/IAE PUPR can obtain scholarships that can help outstanding students to pursue graduate studies in IA, enabling them to work with the Federal Government or other federal institutions and agencies.
• PUPR can compete and benefit from proposal calls (RFP) that are specifically for designated CAE/IAE institutions. These proposals offer millions of dollars from the DoD, NSF, NSA and “Homeland Security”, among others, for research and infrastructure.
• Student scholarships offered under the NSF’s Scholarship for Service (SFS) program. The SFS scholarship offers the following:
–2-year scholarship, includes 8K stipend (12K for graduate students), plus tuition and nominal room and board expenses.
–Paid summer internship in a federal agency.
–Placement in federal government at the end of the scholarship period.

IRW: RIA Vulnerabilities

May 4, 2009 by E. Garcia

The current of issue of IRW should reach subscribers inbox tomorrow.

In this issue:

Featuring article: RIA Vulnerabilities

This issue of the newsletter discusses how hackers might be exploiting Web vulnerabilities found in Rich Internet Applications (RIAs). As mentioned in our previous issue, some RIAs are based on Adobe’s technologies like Flash, Flex, or AIR. Some are designed to be run online or offline. Their rising popularity has attracted developers and marketers, and -as expected- hackers and spammers.

QA: Excel Vector Normalization: How do I convert a row vector into a unit vector?
Who is Who in IR: C.J. van Rijsbergen
Top CS Departments: Polytechnic University of Puerto Rico
Historical Notes: ENIAC Computer
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…

No-Caching is Spammers Best Friend

April 30, 2009 by E. Garcia

Today I feel like giving a piece of advise to spammers, so this will force raising the bar in the “we versus them” in the Spam War. Think of this as a love-hate relationship.

C’mon spammers, I know you can do better. Don’t make our IR life easy at neutralizing your tactics. He, He.

At the recent AIRWeb Workshops, Brian Davison presented the paper Looking into the Past to Better Classify Web Spam, which received high reviews from referees and the audience.

Wannabe spammers, if you are really committed to spamdexing, at least know the how-tos. Don’t leave a temporal fingerprint of your web presence. Try this:

1. Prevent online resources from caching your web pages, like the Wayback Machine and commercial search engines.

2. Use No-Cache and No-Archive.

3. Switch hosts whenever you can.

4. Constantly mutate your link structure.

5. Don’t profile yourself with easy to detect/predictable honeypots, link swapping, strongly-connected component structures, etc.

Why giving these advices? Check current AIRWeb “gems”.

Microsoft, Inter-Metro to Co-Launch a MIC

April 29, 2009 by E. Garcia

This afternoon, Microsoft in partnership with The Interamerican University of Puerto Rico, Metropolitan Campus (Inter-Metro) will announce that they are officially co-launching the Microsoft Innovation Center (MIC) of Puerto Rico.

This will be the first MIC in the region. A two stores building has been abilitated within the Inter-Metro campus for this project. As member of the MIC steering committee, I have been invited to the presentation by President, Manuel J. Fernos.

They have also provided me with office and lab space in the MIC building to put together the Internet Business Development Center (IBDC). The objectives of the MIC is the development and commercialization of ecommerce-related software tools. Emphasis will be given to egovernment and ebusiness solutions.

It looks like I will split my schedules between being the IBDC principal investigator, MIC meetings, doing research at Inter-Metro, teaching at PUPR, and writing IRWs. These are exciting news. Let see how things go, especially with the other great news  that PUPR’s ECE&CS department has been accredited by NSA as a CAE.

AIRWeb 2009 Proceedings

April 28, 2009 by E. Garcia

Here are the proceeding papers of AIRWeb 2009, available at http://airweb.cse.lehigh.edu/2009/proceedings.html

OK, SEOs, Spammers, and Hackers: start your engines and let the fun begin.

If you are a PUPR graduate student and are planning to take my AIR course, it might be a good idea to start browsing through these “gems”. Check also previous proceedings of AIRWeb.

Invited Talks

The Potential for Research and Development in Adversarial Information Retrieval — slides

Brian D. Davison

Web Spam Challenges: Looking Backward and Forward — slides

Carlos Castillo

Temporal Analysis

Looking into the Past to Better slides

Na Dai, Brian D. Davison and Xiaoguang Qi

Classify Web Spam

Web spamming techniques aim to achieve undeserved rankings in
search results. Research has been widely conducted on identifying
such spam and neutralizing its influence. However, existing spam
detection work only considers current information. We argue that
historical web page information may also be important in spam
classification. In this paper, we use content features from historical
versions of web pages to improve spam classification. We use
supervised learning techniques to combine classifiers based on
current page content with classifiers based on temporal features.
Experiments on the WEBSPAM-UK2007 dataset show that our
approach improves spam classification F-measure performance by
30% compared to a baseline classifier which only considers current
page content.

A Study of Link Farm Distribution slides

Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa

and Evolution Using a Time Series of Web Snapshots

In this paper, we study the overall link-based spam structure
and its evolution which would be helpful for the development
of robust analysis tools and research for Web spamming as a
social activity in the cyber space. First, we use strongly connected
component (SCC) decomposition to separate many
link farms from the largest SCC, so called the core. We
show that denser link farms in the core can be extracted by
node filtering and recursive application of SCC decomposition
to the core. Surprisingly, we can find new large link
farms during each iteration and this trend continues until at
least 10 iterations. In addition, we measure the spamicity
of such link farms. Next, the evolution of link farms is examined
over two years. Results show that almost all large
link farms do not grow anymore while some of them shrink,
and many large link farms are created in one year.

Web Spam Filtering in Internet slides

Miklós Erdélyi, András A. Benczúr, Julien Masanes and

Archives

Dávid Siklósi

While Web spam is targeted for the high commercial value of topranked
search-engine results, Web archives observe quality deterioration
and resource waste as a side effect. So far Web spam filtering
technologies are rarely used by Web archivists but planned in the
future as indicated in a survey with responses from more than 20
institutions worldwide. These archives typically operate on a modest
level of budget that prohibits the operation of standalone Web
spam filtering but collaborative efforts could lead to a high quality
solution for them.
In this paper we illustrate spam filtering needs, opportunities and
blockers for Internet archives via analyzing several crawl snapshots
and the difficulty of migrating filter models across different
crawls via the example of the 13 .uk snapshots performed
by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.

Content Analysis

Web Spam Identification slides

Juan Martinez-Romo and Lourdes Araujo

Through Language Model Analysis

This paper applies a language model approach to different
sources of information extracted from a Web page, in order
to provide high quality indicators in the detection of
Web Spam. Two pages linked by a hyperlink should be
topically related, even though this were a weak contextual
relation. For this reason we have analysed different sources
of information of a Web page that belongs to the context of
a link and we have applied Kullback-Leibler divergence on
them for characterising the relationship between two linked
pages. Moreover, we combine some of these sources of information
in order to obtain richer language models. Given
the different nature of internal and external links, in our
study we also distinguished these types of links getting a
significant improvement in classification tasks. The result
is a system that improves the detection of Web Spam on
two large and public datasets such as WEBSPAM-UK2006 and
WEBSPAM-UK2007.

An Empirical Study on slides

Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and

Selective Sampling in Active Learning for Splog Detection

Tomohiro Fukuhara

This paper studies how to reduce the amount of human supervision
for identifying splogs / authentic blogs in the context
of continuously updating splog data sets year by year.
Following the previous works on active learning, against the
task of splog / authentic blog detection, this paper empirically
examines several strategies for selective sampling in
active learning by Support Vector Machines (SVMs). As a
confidence measure of SVMs learning, we employ the distance
from the separating hyperplane to each test instance,
which have been well studied in active learning for text classification.
Unlike those results of applying active learning
to text classification tasks, in the task of splog / authentic
blog detection of this paper, it is not the case that adding
least confident samples performs best.

Linked Latent Dirichlet Allocation slides

István Bíró, Dávid Siklósi, Jácint Szabó

in Web Spam Filtering

and András Benczúr

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003)
is a fully generative statistical language model on the content
and topics of a corpus of documents. In this paper
we apply an extension of LDA for web spam classification.
Our linked LDA technique takes also linkage into account:
topics are propagated along links in such a way that the
linked document directly influences the words in the linking
document. The inferred LDA model can be applied for
classification as dimensionality reduction similarly to latent
semantic indexing. We test linked LDA on the WEBSPAM-UK2007
corpus. By using BayesNet classifier, in terms of
the AUC of classification, we achieve 3% improvement over
plain LDA with BayesNet, and 8% over the public link features
with C4.5. The addition of this method to a log-odds
based combination of strong link and content baseline classifiers
results in a 3% improvement in AUC. Our method
even slightly improves over the best Web Spam Challenge
2008 result.

Social Spam

Social Spam Detection

slides

Benjamin Markines, Ciro Cattuto and Filippo Menczer

The popularity of social bookmarking sites has made them prime
targets for spammers. Many of these systems require an administrator’s
time and energy to manually filter or remove spam. Here
we discuss the motivations of social spam, and present a study
of automatic detection of spammers in a social tagging system.
We identify and analyze six distinct features that address various
properties of social spam, finding that each of these features provides
for a helpful signal to discriminate spammers from legitimate
users. These features are then used in various machine learning
algorithms for classification, achieving over 98% accuracy in detecting
social spammers with 2% false positives. These promising
results provide a new baseline for future efforts on social spam. We
make our dataset publicly available to the research community.

Tag Spam Creates Large Non-slides

Nicolas Neubauer, Robert Wetzker and Klaus Obermayer

Giant Connected Components

Spammers in social bookmarking systems try to mimick
bookmarking behaviour of real users to gain the attention
of other users or search engines. Several methods have been
proposed for the detection of such spam, including domain specific
features (like URL terms) or similarity of users to
previously identified spammers. However, as shown in our
previous work, it is possible to identify a large fraction of
spam users based on purely structural features. The hypergraph
connecting documents, users, and tags can be decomposed
into connected components, and any large, but non-giant
components turned out to be almost entirely inhabited
by spam users in the examined dataset. Here, we test
to what degree the decomposition of the complete hypergraph
is really necessary, examining the component structure
of the induced user/document and user/tag graphs.
While the user/tag graph’s connectivity does not help in
classifying spammers, the user/document graph’s connectivity
is already highly informative. It can however be augmented
with connectivity information from the hypergraph.
In our view, spam detection based on structural features, like
the one proposed here, requires complex adaptation strategies
from spammers and may complement other, more traditional
detection approaches.

Spam Research Collections

Nullification Test Collections
slides

Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell

for Web Spam and SEO

Research in the area of adversarial information retrieval has
been facilitated by the availability of the UK-2006/UK-2007
collections, comprising crawl data, link graph, and spam labels.
However, research into nullifying the negative effect
of spam or excessive search engine optimisation (SEO) on
the ranking of non-spam pages is not well supported by
these resources. Nor is the study of cloaking techniques
or of click spam. Finally, the domain-restricted nature of a
.uk crawl means that only parts of link-farm icebergs may
be visible in these crawls. We introduce the term nullification
which we define as “preventing problem pages from
negatively affecting search results”. We show some important
differences between properties of current .uk-restricted
crawls and those previously reported for the Web as a whole.
We identify a need for an adversarial IR collection which is
not domain-restricted and which is supported by a set of
appropriate query sets and (optimistically) user-behaviour
data. The billion-page unrestricted crawl being conducted
by CMU (web09-bst) and which will be used in the 2009
TREC Web Track is assessed as a possible basis for a new
AIR test collection. We discuss the pros and cons of its scale,
and the feasibility of adding resources such as query lists to
enhance the utility of the collection for AIR research.

Web Spam Challenge Proposal for slides

András A. Benczúr, Miklós Erdélyi, Julien Masanes and

Filtering in Archives

Dávid Siklósi

In this paper we propose new tasks for a possible future Web Spam
Challenge motivated by the needs of the archival community. The
Web archival community consists of several relatively small institutions
that operate independently and possibly over different top
level domains (TLDs). Each of them may have a large set of historic
crawls. Efficient filtering would hence require (1) enhanced
use of the time series of domain snapshots and (2) collaboration by
transferring models across different TLDs. Corresponding Challenge
tasks could hence include the distribution of crawl snapshot
data for feature generation as well as classification of unlabeled
new crawls of the same or even different TLDs.

Marketing Professor Kills Three, Hurts Two

April 26, 2009 by E. Garcia

George M. Zinkhan III, from Terry College of Business at the University of Georgia allegedly went into a killing rampage, killing his ex-wife and two others, and hurting two.

According to his university page (accessible at the time of writing), Zinkhan is a Coca-Cola Company Professor Department of Marketing and Distribution. Zinkhan is well known in the academic marketing research circles, having served as editor of the JOURNAL OF THE ACADEMY OF MARKETING SCIENCE.

His 40-page CV reveals he conducted extensive research on Marketing and Net Advertising.

In 2008 he was part of an American Marketing Association committee that redefined marketing. The new definition reads:

“Marketing is the activity, set of institutions, and processes for creating, communicating, delivering, and exchanging offerings that have value for customers, clients, partners, and society at large.”

According to the AMA committee,

“Marketing is no longer a function — it is an educational process.”.

Zinkhan published extensively with Yue Pan, associate professor of marketing, University of Dayton. He published on the concept of Netvertising (”Netvertising Characteristics, Opportunities and Challenges: A Research Agenda,” International Journal of Internet Marketing & Advertising, 1(3), 283-299.). According to their abstract:

“Netvertising, or “advertising on the internet”, is attracting much attention from advertising and marketing researchers. However, surprisingly little is known about its new features as compared to other forms of advertising and the implications of the new medium for advertisers. Here, we focus on the following issues: the opportunities and challenges associated with internet advertising; the differences of netvertising from other forms of communication; banner ads – the most popular type of netvertising. Applying this framing perspective, we propose a research agenda for the study of netvertising.”

Netvertising is something search marketers do using different out-of-the-thin-air theories/naming conventions.

Read more about the Netvertising Image Communication Model (NICM)

That was then. Today Zinkhan’s name is associated with a Negative Image on the Net. It will be a matter of time before others will dissassociate themselves with such an image. Life ironies!

He didn’t seem to fit the academic stereotype.

Unfortunately as in any profession, some people cannot coupe with their personal misfortunes and end up doing bad things.

Hackers Hit Pentagon

April 22, 2009 by E. Garcia

It happened again: Thanks to Web vulnerabilities, hackers were able to hit the Pentagon. 

According to CCN (http://www.cnn.com/2009/US/04/21/pentagon.hacked/),

Thousands of confidential files on the U.S. military’s most technologically advanced fighter aircraft have been compromised by unknown computer hackers over the past two years, according to senior defense officials.

The Internet intruders were able to gain access to data related to the design and electronics systems of the Joint Strike Fighter through computers of Pentagon contractors in charge of designing and building the aircraft, according to the officials, who did not want to be identified because of the sensitivity of the issue.

In addition to files relating to the aircraft, hackers gained entry into the Air Force’s air traffic control systems, according to the officials. Once they got in, the Internet hackers were able to see such information as the locations of U.S. military aircraft in flight.

This news is quite relevant to my Fall 2009 Web Vulnerability graduate course (http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf)

BTW. Associate Director of the CS Department at PUPR.edu, also a colleague and friend, Dr. Alfredo Cruz, called me two days ago with some great news: The department has been accredited for 2009-2014 as a National Center of Academic Excellence in Information Assurance Education. Soon they will be listed with members of this exclusive “club” in the National Securing Agency web site (http://www.nsa.gov/ia/academic_outreach/nat_cae/institutions.shtml)

An official press release and formal presentation before the pertinent authorities is being coordinated for within the next few weeks or so.

The next issue of IR Watch – The Newsletter provides additional coverage of such an exciting news.

I have tied these two news in a single post to underscore the need for IR/data mining courses at the intersection of Information Security, which is precisely the mission statement of IRW, reaching now more than 300 investigators/research centers.

McAfee Report: Email Spam and the Environment

April 16, 2009 by E. Garcia

According to a McAfee report,

Until now, spam’s impact has been measured in time, money, and aggravation. It turns out there is a massive environmental impact as well. McAfee recently commissioned climate-change consultant ICF International and spam expert Richi Jennings to calculate the environmental impact of spam. The results that came back were startling: The energy consumed in transmitting and deleting spam is equivalent to the electricity used in 2.4 million U.S. homes, with greenhouse gas (GHG) emissions equivalent to 3.1 million passenger cars(http://resources.mcafee.com/content/NACarbonFootprintSpam)

I first learned about these findings through ABC. Essentially,

Anything powered by electricity also emits greenshouse gases. McAfee researchers say each junk e-mail emits 0.3 grams of the greenhouse gas carbon dioxide (CO2). That may not sound like much, but when you consider the volume of global annual spam, it all adds up. (http://abcnews.go.com/Technology/GlobalWarming/story?id=7343518&page=1).

Following that reasoning, spamdexing search engines and any adversarial information retrieval (AIR) practice is also an insult to injury, so as too many things that comes to my mind.

I will tell that to students of my Fall 2009 AIRWeb Course.

Humm, shocking: AIR vs. Environment.

I never thought about such an obvious connection.  :)

Why IDF is Expressed Using Logs

April 15, 2009 by E. Garcia

Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.

I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.

Questions 1 and 3 are related so I will answer both now. After that, I will answer question 2.

1) Why is a log function used for calculating IDF?
3) Would it be accurate to describe IDF as “the ratio of documents in a collection to documents in that collection with a given term”? I’m guessing your answer would be, IDF is the [LOG of " the ratio of documents in a collection to documents in that collection with a given term"]? Which brings us back to question, I guess? hehe

These are recurrent questions students asked me before. The reason for using logs is due to two assumptions frequently made in most IR models; i.e.

I. that scoring functions are additive.
II. that terms are independent.

While in some models II might not be present, both (I and II) play well with logs since these also are additive.

These functions and why the use of logs is explained in the recent RSJ-PM Tutorial http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf

Document Frequency (DF) is defined as d/D, where d is number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).

But since often D > d the log of d/D, that is log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared. Now log(D/d) is conveniently called Inverse Document Frequency.

Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B

p(AB) = p(A)p(B).

Taking logs we can write

log[p(AB)] = log[p(A)]+ log[p(B)]

It is easy to show that for two terms

log(d12/D) = log(d1/D) + log(d2/D)

Inverting and using the definition of IDF we end up with

IDF12 = IDF1 + IDF2

validating assumption I; that IDF as a scoring function is additive.

That is the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,

p(AB) > p(A) + p(B)

or

p(AB) < p(A) + p(B)

and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think in order to promote some dumb keyword research tools is plain snakeoil.

2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a
collection.”

This is legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents. If a a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.

The RSJ-PM Tutorial mentioned above was written to kill for good some misconceptions regarding IDF. In it we explain why IDF is considered by Robertson and Jones a particular RSJ weight in the absence of relevance information.

In a nutshell, IDF is a collection wide estimate and as such the information on whether documents containing the terms being queried are relevant to these is unknown. Similarly, the information on whether documents not containing the query terms are relevant or not is unknown and often remains unscrambled when we just look at the d/D and d/(D – d) collection-wide ratios. All we can say is that relevant documents might have a higher probability of containing query terms in comparison with other documents from the collection as a whole. But we could make such assertion without resourcing to IDF as well.

In the case of Web documents, often these are about multiple topics. Many documents aggregate content from dissimilar sources (news headlines, rss, blogs, etc) and said document content might change in time. The mere mention of a term (regardless of repetition) is not a proof of its relevancy or of its importance with respect to the topics discussed in a document.

Thus, the idea that we can assess if terms are relevant to a document by simply comparing IDF values is missing the whole point and defeats the purpose for which the RSJ-PM model and many of its variants (e.g., BM25) were developed.

I hope this helps to clear up some SEO misconceptions on the topic.

Finally SEOs are getting the LSI Myth!

April 9, 2009 by E. Garcia

If you search this blog (IRThoughts) for LSI or visit its Latent Semantic Indexing category you will find many posts wherein SEO LSI Myths are debunked. Prior to this wordpress blog I used to maintain a personal blog wherein SEO myths regarding LSI were also debunked.

Over the years, many realized they were taken by the usual agents of misinformation, at least when it comes to “SEO LSI” and “LSI-Friendly” documents.

Recently, I found traffic coming from a blog discussion about a video (http://www.stomperblog.com/warning-advanced-seo-technique-does-not-work/) wherein LSI in relation with Google is debunked.

The video also discusses one flavor of LSI; i.e. one wherein weights are tf-IDF weights. This flavor does not incorporate relevance information or entropy information, like other LSI variants.

The video does a good job at debunking LSI Myths. However, it has at least a factually incorrect argument in relation to how the SVD algorithm works.

The video gives an example implying that SVD works by reducing a large set of words to a few words, such that, for example thousand of words are reduced to, let say 300 words.  This is incorrect and certainly is not a trivial flaw.

SVD does not work by reducing a vocabulary, but by reducing dimensions, and there are as many dimensions as singular values. This is why is called a dimensionality-reduction and not a vocabulary-reduction algorithm.  I should stress that an LSI Space is not like a Term Space wherein each term is a dimension such that there is a 1:1 correspondence.

In LSI, the SVD algorithm is used to reduce the dimensions of a matrix; the number of singular values of the matrix.

For instance in our SVD and LSI Tutorial series at

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html

we present an LSI problem example consisting of many words and few initial dimensions such that for the initial matrix

#words >> # initial dimensions

more specific, we used 11 words and 3 dimensions

After truncation, we ended up with 11 words and 2 dimensions.

Other than this, the video is fun to watch, but ended up as an introductory promotion for another SEO proposal.

 PS.

After reviewing several times the video, unfortunately I found the video has another incorrect argumentation.

When objecting to that Google might not use LSI, an argument is made in the sense that LSI has to return same results when word variants are used like plurals and tenses. This might be the case if stemming is heavily used in an LSI implementation, but the use of stemming is not a requirement for implementing LSI at all.

When stemming is not implemented, for sure the SVD reduction will return different results since these will be entered in the original term-doc matrix to be undergo decomposition as different tokens.

The video also misses what the power of LSI comes from: higher order co-occurrence connectivity path hidden (latent) in the original matrix. Whether terms have to be synonyms, related terms, or even of non-derivative forms is not a requirement for observing these hidden paths in LSI.

Terms no need to be related terms either to end up clustered with LSI. It is the hidden co-occurrence patterns what is behind the clustering. For example, in our SVD and LSI tutorial above, we intentionally used stopwords and zero synonyms/related terms and these ended-up in their corresponding clusters, without being necessarily semantically related. This simple example shows that in LSI the SVD algorithm produces an output based on crushing numbers, not on making sense out of meaning or intelligence, and contradicts the generalized opinion that LSI works at the level of meaning. 

I have to conclude that while the video is intended to debunk LSI SEO myths (a noble effort), it uses incorrect arguments and hearsays lines from around the Web. Debunking hearsay with more hearsay: What a shame.

 

IRW Newsletter: Web & Data Mining with RIAs

April 8, 2009 by E. Garcia

RIAs

The current issue of IRW should be in subscribers inbox today or tomorrow, at the latest.

In this issue of the newsletter we cover Rich Internet Applications (RIAs) and how these can be used for Web/Data Mining. A RIA is a browser-independent application that can be compiled and run from the desktop.

In this issue:

Featuring article: Web & Data Mining with RIAs
QA: Recommended RIAs
Who is Who in IR: Bruce Croft
Top CS Departments: UMass, Amherst
Historical Notes: John von Neumann and Bugs
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…

IRW currently reaches a fine audience of university and government researchers and their labs. If you are a graduate student or IR practitioner and want to be known within this exclusive circle, submit a short article (2, 3 pages, IRW format, free from marketing and sale pitches) for its consideration

Vector Space, Probabilistic LSI, and LDA

April 3, 2009 by E. Garcia

 lda
source: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.

From VSM to LSI

Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate  values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.

VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.

In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan’s 1965 SVD algorithm to “documents” exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).

LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The “documents” were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.

It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.

Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.

There is now a crew of SEOs claiming that they can design documents “LSI-friendly” by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as “experts”). Same goes for those that claim “PLSI-SEO” strategies.

Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why “LSI-friendly” documents is plain SEO Snakeoil. Again, the same goes for those that claim “PLSI-SEO” strategies. Keep reading.

Enters Probabilistic Latent Semantic Indexing (PLSI) model

In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?

In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.

Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected “reduced description” associated with the document.

But there is a problem.

Enters Latent Dirichlet Allocation Model (LDA)

By 2003 Hofman’s PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI “is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. “

Blei and co-workers then stated that this leads to two problems:

1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting

2. it is not clear how to assign probability to a document outside of the training set.

Thus, it is not true that PLSI is the preferred model to work with in IR, as some have claimed. In addition, the model has non-trivial theoretical flaws and limitations.

In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a “bag of words”. However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for college junior or junior college produce far different results.

To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.

Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.

In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.

I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere “bags of words”. The former is IR’s Original Sin and the later is its copycat.

The image above gives me a flash back on research work I conducted in the late ’80s on sequential simplex optimization methods.

AIRWeb Course Announcement

April 2, 2009 by E. Garcia

During the Fall of 2009, I will be teaching 

 Adversarial Information Retrieval on the Web:  A Graduate Course on Web Spam and Internet Vulnerabilities

This a new one-full semester graduate course to be offered at Polytechnic University Puerto Rico. It is based on the material presented at the annual AIRWeb Workshops. KDDM graduate students are encouraged to enroll. An early announcement and preliminary syllabus is available at

http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf

BTW, In November 5 of 2008 PUPR became the First Academic Institution in the Caribbean to be Certified by the Committee on National Security Systems (CNSS). Additional information is available at http://www.pupr.edu/ias.html

Their goal is to become a Center of Academic Excellence in Information Assurance Education (CAE/IAE). These are great news. Nationwide, how many universities you know that are in such an exclusive ”club”?

RSJ-PM: Probabilistic Model Tutorial

March 30, 2009 by E. Garcia

As promised, I am pleased to announce the publication of the Robertson-Sparck Jones Probabilistic Model Tutorial.

It is available in Mi Islita.com in the Tutorials Section. A link is provided in the index page.

The tutorial guides you through the intricasies of RSJ-PM. It is a great start for CS students and teachers interested in probabilistic models in information retrieval.

Enjoy it.

Due to the time spent on it, the April issue of the IR Watch newsletter will be a bit delayed.

W3C 2009 Conference

March 26, 2009 by E. Garcia

Here is the final list conforming the 18 International Conference of the W3C, WWW2009, of which AIRWeb2009 is a workshop.

http://www.webshine.org/2009reg.html

A lot of good stuff to please IRs, CS students, spammers/SEOs, and hackers.

SEOs and Their IDF Myths: Part 3

March 20, 2009 by E. Garcia

In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).

In SEOs and their IDF Myths: Part 2, we exposed some of these folks.

In Understanding TFIDF, we wrote a rebuttal.

We are still seeing so many bloggers mistaking IDF for something that is not. We have to conclude these pseudo-teachers either are just trying to sell something or they don’t really understand what term specificity stands for. They should know that IDF is a small pixel section within the bigger picture of the Robertson-Sparck Jones Probabilistic Model for information retrieval.

Thus, we are writing a tutorial on RSJ-PM to kill for good their intentionally misleading efforts. Hopefully, the tutorial will be ready before the month ends. It will be a great way of putting to rest all the false information flying around from the usual agents of misinformation (mostly SEOs). CS students interested in knowing about the pros and cons of probability models in IR will find it useful.

A CBR Sharing Search Engine System

March 17, 2009 by E. Garcia

I’m reading with great interest the paper
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System
, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,

I’m happy to read they referenced our Tutorial on Cosine Similarity Measures. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.

Their abstract follows:

ABSTRACT
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.