Archive for February, 2008

SEOs and University Education

February 29, 2008

Posters at SEOmoz are debating why the Internet is not taught at schools.

One poster claims: “I think all Universities are quite a ways off from this.” Others simply think this will never happen or that if it does, it will not be worth it.

These opinions are understandable, especially when universities have offered courses with “ecommerce”, “web marketing”, “ebusiness” and similar terms in their course titles when most of these are soooo outdated. Many are limited to explaining what is a cookie, bayesian and game theorems, and few other topics that are not really that useful in the real world.

Here is a first hand story. Back in 2002 I was hired by Graduate School of Business of University of Turabo in PR to teach the graduate course ECommerce Technology. It was the first time the course was offered as a core course for students pursuing a master degree. The problem: the syllabus and textbooks were sooo outdated, with case studies of companies that no longer existed. I was forced to redesign the entire course and material.  

Here is another first hand story. Many students that took data mining at Polytechnic University of Puerto Rico (PUPR) before I was hired by the university were complaining that they did not learn anything useful because lectures emphasized theory and no practice. This is something I tried to avoid when back in October I started to teach the Web Mining graduate course. It is the same approach I use for my other courses.

As for studying the Internet as SEOmoz posters argue, it is not possible to study “The Internet”. When they say “Internet” probably they refer to studying search marketing, SEO, or Web Analytics. It looks like an opportunity for other marketers to make some money out of their peers’s ignorance.

I know there are some seos already trying to squeeze money from their peers by offering college-type courses dictated by “experts”. Don’t be gamed by these folks. Their “colleges” and “institutes” are not certified by any higher education body, like The Middle States Association, or by research funding organizations like NSF. These mostly look like scams and their diplomas are not worth even the ink these hold.

As for the above claim of teaching SEO in colleges, there is a list of traditional schools already teaching updated web marketing, design, usability, and even accessibility courses. In fact, more and more grad schools are developing Web Mining and Web Marketing courses.

At PUPR I’m in the curriculum development arena, developing and teaching the following hands-on courses, all at the graduate level:

Search Engines Architecture (Spring - classes start next week; lectures and lab)
Web Mining (Winter - semester just ended; lectures only) *
Search Engines for Penetration Testing and Intelligence (Fall - next fall; lectures and lab) **

* This was a course on Web Mining, Business Intelligence, and Search Engines. Agenda and syllabus is available online.

**Just asked by the head of EE&EC and CS Dept to teach this one.

These ARE NOT paperless, online courses. The class meets in the computer lab building. We have plenty of computers and software to play with. I offer all lectures using powerpoint and smartboards. We study which Web business models work and which one don’t. We check case studies from the Web. We dissect SEO myths. We teach why and how search engine algorithms and web analytics work, etc, etc.I have grad students conducting projects or theses supported by grants from gov agencies like DoD, etc. Some of these projects interface with SEO, Web Analytics, Business Intelligence, and Homeland Security.

In addition, we have an upcoming conference on these topics (October). I’m also pushing for a 2-year certificate on Web Marketing & Analytics with a local college.

And how about AIRWeb wherein , as scholars and researchers, we dissect and test search engine spam strategies and find new ways to neutralize, minimize, or “kill” these techniques–many promoted by some among you?

SEOS: Definitely, we are not oblivious to Web marketing and your “world”.

Google-IBM-NSF Partnership in Data-Intensive Computing

February 28, 2008

We received few days ago the following Call for Proposals:

Dear Colleague:

We have some exciting news to share with all of you: NSF is partnering with Google and IBM to explore data-intensive computing. Through NSF’s reach, Google and IBM are providing software and services running on a large cluster to the broad academic community to explore innovative research and education ideas in data-intensive computing. Google and IBM launched the Academic Cluster Computing Initiative [1] last October with instructional programs at six pilot universities, and the NSF will be joining this initiative as the first research-oriented pilot partner. We are calling the NSF program to provide access to these types of resources the Cluster Exploratory (CluE).

Please read more about data-intensive computing, NSF’s partnership with Google and IBM, and the upcoming CISE solicitation for CluE.

Jeannette M. Wing, Assistant Director, Computer and Information Science and Engineering Directorate Dan Atkins, Director, Office of Cyberinfrastructure

Data-Intensive Computing

Data-intensive computing is a computational paradigm in which the sheer volume of data is the dominant performance parameter. Storage and computation are co-located, enabling large-scale parallelism over terabytes of data. For example, Google runs an average of 100,000 MapReduce jobs per day on its clusters, processing over 20 petabytes daily [2]. This scale of computing effectively supports applications specified in high-level programming primitives, where the run-time system manages parallelism and data access. The architecture is extremely fault-tolerant and exhibits high degrees of reliability and availability.

Data intensive computing raises important research challenges:

O For science
  - What are the fundamental capabilities and limitations of this paradigm?
  - What new programming abstractions (including models, languages, algorithms) does this computational model suggest?

O For technology
  - How can we automatically manage the hardware and software of these systems?
  - How can we reduce their power consumption?
O For society
  - What (new) applications can best exploit this computing paradigm?

Data-intensive computing is at the forefront of ultra-large-scale commercial data processing. A July 2006 New York Times article [3] notes that “Google, Microsoft and Yahoo are spending vast sums of capital to build out their computing capabilities.” Not only is there an increasing need for advances in data-intensive computing systems software and hardware, but also an increasing demand for a trained workforce to operate and use these systems. To date, however, the academic community has had limited access to such systems.

Enter Google and IBM

On October 8, 2007, Google and IBM announced they had teamed to provide six universities access to a large-scale computing cluster together with the software and services to use it effectively [4]. After several months of discussions, the NSF will be joining this initiative and will be partnering with Google and IBM to broaden the reach of this powerful computing resource to foster more innovation than might be possible in the initial pilot.

Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do fundamental, disruptive research in data-intensive computing and to explore powerful new applications. This facility can also serve as a tool for educating the next generation of scientists and engineers. This partnership is an excellent example of an academic-industry-government relationship that is a win-win-win situation for all.

System Description

The Google-IBM cluster contains well over a thousand processors connected to terabytes of memory and hundreds of terabytes of storage with internal networking as well as a substantial external network connection. The system will be configured with open source software to include Linux and Apache Hadoop [5] - a large-scale distributed computing platform inspired by Google’s MapReduce [6] and the Google File System [7]. IBM’s Tivoli [8] software will also be used for management, monitoring and dynamic resource provisioning of the cluster.

The system will provide a powerful resource for large-scale data analysis, mining and visualization in addition to support for Internet-scale computing applications. Tutorial information describing the programming environment of the Google-IBM academic cluster available via the CluE program can be found on the Google Code for Educators website [9]. Much of this material was developed in collaboration with the University of Washington, and all of it is available under permissive licenses such as the Creative Commons Attribution License.

Upcoming Solicitation

CISE is currently developing a program solicitation that will invite researchers to submit proposals requesting allocations of the Google-IBM cluster for any new, innovative use of the system and to probe the possibilities and fundamental limits of this new computing paradigm. The emphasis of the program will be to develop new approaches and applications that are outside the typical high-performance computing applications running on today’s supercomputers.

The challenge to the academic community is three-fold: to use existing tools and to develop new programming abstractions for such a “computer” to solve problems unsolvable any other way; to solve old problems in simpler or more efficient ways; and to enable new applications. This resource will also provide an opportunity to teach students how to build, use and manage data-intensive computing systems systems that are already being used widely in industry, but are largely unavailable to students and faculty today.

Please look for the new solicitation which will be posted on the CISE web site. CISE looks forward to your bold, creative proposals for CluE!

[1] Official Google Blog: http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html

[2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. of the ACM, 51(1), January 2008, pp. 107-113.

[3] http://www.nytimes.com/2006/06/14/technology/14search.html?_r=1&pagewanted=2&oref=slogin

[4] See http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html or http://www-03.ibm.com/press/us/en/pressrelease/22414.wss for the text of the press release.

[5] http://hadoop.apache.org/

[6] http://labs.google.com/papers/mapreduce.html

[7] http://labs.google.com/papers/gfs.html

[8] http://www.ibm.com/software/tivoli/

[9] http://code.google.com/edu/content/parallel.html

Search Engines Architecture Graduate Course

February 27, 2008

Search Engines Architecture - Lecturer: Dr. Edel Garcia
Code: CECS 7804/31 Special Topics in C&E
Time: Saturday 8:00 A.M. – 12:00 N, Spring 2008
Room: Software Testing Lab, L-210
Academic Calendar: http://www.pupr.edu/academiccalendar/ac-wi05.pdf
Final Examination Date: May 24, 2008

Description: This is a hands-on, one-full semester course on search engines architecture and their algorithms. Each class consists of lecture and lab sessions. Students are expected to build and test their own search engines and related components on a server dedicated for this purpose.

Course Communication: All communications, upgrades/changes to this syllabus to fulfill class needs, answer to questions, clarifications, etc will be made available through this blog. Posts will be indexed in the Search Engine Architecture Course category. Thus, students must read this category on a regular basis. To access the category, just click on the link listed in the Categories section at the right of this blog home page.

Classroom Policies: Taping is not allowed. Lecture Notes are not published online. Students are required to take notes in the traditional way. All lecture material is copyrighted.

Important Note: Students registered in this course automatically receive an invitation to present their projects developed during this course at the Search Engines and Information Security Conference to be held at Polytechnic University, San Juan Campus during October 3 and 4, 2008. Contact Dr. Garcia at admin@miislita.com for additional details or questions regarding the conference.

Target: Students in Business, Engineering, and Computer Sciences and from other disciplines are encouraged to register for this special course.

Requirements: Permission from advisor or department and knowledge of matrix algebra.

Grading: Weekly Lab Reports, Project Presentations, and a Final Exam. The following scoring system will be used:

Course Grade = Ave*(1 - w) + Fin*w

where

Ave = average of all lab reports and group projects. The lowest 2 partial grades are eliminated.

Fin = final exam score
w = an adjustable weight

The letter grade scale is as follows:

A = 100 - 90; B = 89 - 80; C = 79 - 65; D = 64 - 55; F = 54 - 0

Topics: Although not necessarily in this order, some of the topics to be covered, include, but are not limited to the followings:

Linear Algebra Fast Track Tutorial: Brief tutorial on matrix operations with emphasis on vector theory and Singular Value Decomposition

Parser Building: Use of regular expressions to build, test, and use a parser.

Crawler Building: Use of AJAX to build a client-side crawler and a dedicated server-side crawler.

Look-Up Directories: Implementing a look-up directory and pseudo site search tool.

Search Interfaces: Developing and testing search interfaces, their search modes, and advanced search features.

Index and Database Building: Data fragmentation and storing.

Sanitizing and Ranking Answer Sets: Filtering, De-duplicating, and ranking answer set results.

Textbook: There is no official textbook. Open source components will be used or developed by the students. However, the following reference books are recommended for research. Additional references and extended syllabus will be provided in class.

Modern Information Retrieval (Baeza-Yates and Ribeiro-Neto; Addison Wesley).
Information Retrieval – Algorithms and Heuristics (Grossman and Frieder; Springer).

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than to what web page creators or a KD tool tag as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the one that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing (and comparing against) a whole new ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

AIRWeb-2008 Last Call for Papers & Extension Deadline

February 25, 2008

AIRWEB organizers have instructed me to disseminate the following Final Call For Papers and deadline extension. Let’s fight the spammers and those disguised as SEOs.

We have extended the submission deadline until March 2, 2008. We would appreciate any assistance in disseminating this extension. The text version of the final call for papers is below and the pdf version is attached.

Best regards.
Carlos Castillo
Kumar Chellapilla
Dennis Fetterly

FINAL CALL FOR PAPERS and 9 day extension
Fourth International Workshop on
Adversarial Information Retrieval on the Web
http://airweb.cse.lehigh.edu/2008/

IMPORTANT DATES

02/March/2008 : Deadline for research articles
31/March/2008 : Deadline for challenge submissions
22/April/2008 : Workshop at the WWW 2008 conference in Beijing, China

Contents:

1. AIRWeb’08 Topics
2. Web Spam Challenge
3. Timeline
4. Organizers and Program Committee

1. AIRWEB’08 TOPICS

Adversarial Information Retrieval addresses tasks such as gathering,
indexing, filtering, retrieving and ranking information from
collections wherein a subset has been manipulated maliciously. On the
Web, the predominant form of such manipulation is “search engine
spamming” or spamdexing, i.e., malicious attempts to influence the
outcome of ranking algorithms, aimed at getting an undeserved high
ranking for some items in the collection.

We solicit both full and short papers on any aspect of adversarial
information retrieval on the Web. Particular areas of interest
include, but are not limited to:

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging
* Ping spam

Proceedings of the workshop will be included in the ACM Digital
Library. Full papers are limited to 8 pages; work-in progress will be
permitted 4 pages. Papers should be formatted using the WWW2008
proceedings style and submitted via
http://www.easychair.org/conferences/?conf=airweb2008

For more information, see http://airweb.cse.lehigh.edu/2008/

2. WEB SPAM CHALLENGE

Last year we introduced a novel element at the workshop: a Web Spam
Challenge for testing web spam detection systems. We will be holding
the Web Spam Challenge again this year, using the WEBSPAM-UK2007
collection for Web Spam Detection http://www.yr-bcn.es/webspam

The collection includes large set of web pages, a web graph, and
human-provided labels for a set of hosts. We will also provide a set
of features extracted from the contents and links in the collection,
which may be used by the participant teams in addition to any
automatic technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions
(normal/spam) for all unlabeled hosts in the collection. Predictions
will be evaluated and results will be announced at the AIRWeb 2008
workshop.

For more information, see

3. TIMELINE

- 15 February 2008: E-mail intention to submit a workshop paper
  (optional, but helpful)
- 02 March 2008: Deadline for workshop paper submissions (all day
- 24 March 2008: Notification of acceptance of workshop papers
- 31 March 2008: Challenge submissions due
- 07 April 2008: Camera-ready copy due
- 22 April 2008: Date of workshop

4. ORGANIZERS AND PROGRAM COMMITTEE

Organizers

- Carlos Castillo, Yahoo! Research
- Kumar Chellapilla, Microsoft Live Labs
- Dennis Fetterly, Microsoft Research

Program Committee

- Einat Amitay, IBM
- Andras Benczar, Hungarian Academy of Sciences
- Paul-Alexandru Chiri, Uni Hannover
- James Caverlee, Texas A&M University
- Gordon Cormack, University of Waterloo
- Nick Craswell, Microsoft Research
- Matt Cutts, Google
- Brian Davison, Lehigh University
- Ludovic Denoyer, University Paris 6
- Aaron D’Souza, Google
- Edel Garcia, Mi Islita.com
- Natalie Glance, Nielsen BuzzMetrics
- Antonio Gulli, Ask.com
- Zoltan Gyongyi, Stanford University
- Monika Henzinger, Google
- Pranam Kolari, Yahoo! Applied Research
- Mark Manasse, Microsoft Research
- Marc Najork, Microsoft Research
- Alexandros Ntoulas, Microsoft Search Labs
- Jan Pedersen, Yahoo! Search
- Erik Selberg, Amazon
- Torsten Suel, Polytechnic University
- Mike Thelwall, University of Wolverhampton
- Baoning Wu, Snap
- Tao Yang, Ask.com

Search Engines for Penetration Testing

February 21, 2008

Well, I’m getting ready for my talk this afternoon at University of Turabo. I’ve organized the talk in three parts:

 Part 1: Spam and Fraud through Search Engines

Part 2: Gathering Intelligence through Search Engines

Part 3: Identity Theft through Search Engines

A disclaimer will be necessary to indicate that the information to be presented is for educational purposes, only.

This gonna be a nice one. I hope to see old friends.

IRW-2008-2 Snake Preview

February 18, 2008

Scalar Clusters

Masking Effects in Similarity Matrices:
Topics with strong similarities can mask other topics,
making these invisible to the clustering process.

Here is the snake preview of the current issue of IR Watch - The Newsletter. Due to academic duties, it is running late. It should be in subscribers’s inboxes during the day. The topic is Scalar Clusters and Back Mapping and is based on lecture material covered in the Web Mining course. The following topics are covered:

Introduction
On Association and Scalar Clusters
The Neighborhood Similarity Matrix, Mn
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Term Clusters to Documents
Masking Effects in Similarity Matrices
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Not a subscribers? What are you waiting for?

Web Mining, Search Engines, and Information Security

February 15, 2008

This thursday the 21st I’ll be presenting before the faculty of University of Turabo, Gurabo, PR the talk:

Web Mining, Search Engines, and Information Security

I hope to see old friends there. Here is the abstract of my talk:

Web Mining is a research area of Data Mining wherein the Web is the “database” and search engines are the “user’s interface”. End-users can resource to search engines for all sorts of things. For instance, marketers can use search engines to gain traffic derived from ranking high Web pages for specific queries, hence enhancing the online presence of businesses, products, and services (search engine optimization, SEO). Spammers can inundate search engine indexes to deceive searchers (spamdexing). Hackers can attempt to rank high documents that lead to security risks (hacketers, hacketering) or use all form of injections (links, forms, scripts, redirections, etc). Terrorists and criminals can use search engines to commit all sort of crime-enabling activities, for instance, by stealing private information like SSNs, passwords, students and users’s IDs, gaining access to “private” documentation, stalking people, etc.

This talk covers these and other aspects of search engines: the Good, the Bad, and the Ugly. The speaker will then talk about his own research projects in the area of Web Mining, Search Engines, and Intelligence. A disclaimer will be necessary to indicate that the information to be presented is for educational purposes only.

Predating Link Models and PageRank

February 14, 2008

Did you know that before PageRank and current link models, and even before CLEVER and HITS, there were some actually working on hyper search engines and hyper text?

In

The Quest for Correct Information on the Web: Hyper Search Engines

Massimo Marchiori described just that: components and elements now in use by many current link models, including PageRank. It is clear that there is nothing new under the Sun, but is nice to have a flashback and read about search engines that no longer are among us.

Marchiori’s abstract follows:

Finding the right information on the World Wide Web is becoming a fundamental problem, since the amount of global information that the WWW contains is growing at an incredible rate. In this paper, we present a novel method for extracting from a Web object its “hyper” informative content, in contrast with current search engines, which only deal with the “textual” informative content. This method is not only valuable per se, but it is shown to be able to considerably increase the precision of current search engines, Moreover, it integrates smoothly with existing search engine technology since it can be implemented on top of every search engine, acting as a post-processor, thus automatically transforming a search engine into its corresponding “hyper” version. We also show how, interestingly, the hyper information can be usefully employed to face the search engine persuasion problem.

Some highlights from the WWW6 paper follows (emphasis added). Even when some statements in it are no longer valid and other are, I prefer readers to dissect these in the light of the current state of the art:

The problem is that visibility says nothing about the informative content of a Web object. The misleading assumption is that if a Web object has high visibility, then this is a sign of importance and consideration, and so de facto its informative content must be more valuable than other Web objects that have fewer links pointing to them….In a nutshell, visibility is likely to be a synonym of popularity, which is something completely different than quality, and thus using it to gain higher score from search engines is a rather poor choice.

What is really missing

As said, what is really missing in the evaluation of the score of a Web object is its hyper part, that is the dynamic information content which is provided by hyperlinks (henceforth, simply links).

We call this kind of information hyper information: this information should be added to the textual information of the Web object, giving its (overall) information to the World Wide Web. We indicate these three kinds of information as HYPERINFO, TEXTINFO and INFORMATION, respectively. So, for every Web object A we have that INFORMATION(A) = TEXTINFO(A) + HYPERINFO(A) (note that in general these information functions depend on a specific query, that is to say they measure the informative content of a Web object with respect to a certain query: in the sequel; we will always consider such a query to be understood).

The presence of links in a Web object clearly augments the informative content with the information contained in the pointed Web objects (although we have to establish to what extent).

Recursively, links present in the pointed Web objects further contribute, and so on. Thus, in principle, the analysis of the informative content of a Web object A should involve all the Web objects that are reachable from it via hyperlinks (i.e., “navigating” in the World Wide Web).

Regarding SEP (now called SEO)

A big problem that search engines have to face is the phenomenon of so-called sep (search engines persuasion). Indeed, search engines have become so important in the advertising market that it has become essential for companies to have their pages listed in top positions of search engines, in order to get a significant Web-based promotion. Starting with the already pioneering work of Rhodes ([5]), this phenomenon is now boosting at such a rate as to have provoked serious problems for search engines, and has revolutioned the Web design companies, which are now specifically asked not only to design good Web sites, but also to make them rank high in search engines. A vast number of new companies was born just to make customer Web pages as visible as possible. More and more companies, like Exploit, Allwilk, Northern Webs, Ryley & Associates, etc., explicitly study ways to rank a page highly in search engines. OpenText arrived even to sell “preferred listings;” i.e., assuring that a particular entry will stay in the top ten for some time, a policy that has provoked some controversies (cf. [9]).

Interesting pictures

Regarding information depth:

Regarding the cost of non determinism:

Regarding back links:

12 years later, no much has changed.

LSI: How Many Dimensions to Keep?

February 13, 2008

In How to Populate a Matrix for SVD I referred readers to Igvita’s great blog posts on SVD. A recent visit to the blog shows it is still very much alive and equally interesting. The issues been discussed are not really new, though.

When we lecture on SVD an issue that soon or later arises is how many dimensions k to keep. A recent visitor of the aforementioned blog finally raised the same question.

Can you pls give me a clue as to how we decide how many dimensions to project our data onto when using SVD?

How many dimenisions to keep is the so-called Rank k Approximation that often leads to the dreaded dimensionality reduction curse in which performance can be compromised.

In the Latest SEO Incoherences (LSI) post we mentioned that this issue was already addressed by Dr. Susan Dumais, many times, and througout her first papers and talks on LSI. In that post we referred readers to Dumais’s talk Transcription of the Application presentation by Susan Dumais, Bellcore (now at Microsoft). That talk is now a classic in the history of LSI.

In those days Dumais approach was simply “by seat of the pants“:

Let me end, as my time is running out, with some of the statistical issues that we have encountered and that I hope you have some hints about. The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it doesn’t work. You know when you have too few dimensions. We would like some better methods for doing this, things like the scree test don’t seem to correspond very well to behavioral data that we have.

Later during the QA session participants revisited this issue. Let us reproduce participants-Dumais QA:

PARTICIPANT: Thank you, Susan. Questions from the floor?

PARTICIPANT: I’m a little nervous that if someone was browsing the Web and we hoped to put some of this material in the Web, that we’re in trouble. We’re talking about seat of the pants and underwear models, that people are going to get the wrong context for why we’re here. But that is part of the big problem that Susan is talking about.

PARTICIPANT: I thought I would just mention an entirely different approach to this problem, with Joe (word lost) at EDS. What we’re doing is –

PARTICIPANT: Can you get to a mike?

PARTICIPANT: We are using a poisson model for the word counts. Then we’re interested in finding maximum likelihood estimates for the clustering, and we found various combinations of simulated annealing and markup chain Monte Carlo to work very well with funding these things.

One of the nice things in a model based approach is that you get natural measures of association rather than just SVD types of things, although it could be slower.

PARTICIPANT: I think one thing we will try to ask everyone after the conference is to send us electronically two or three references of relevant work that we can disseminate in this way, because we do hope to learn about new approaches and new methods and related work. So keep that in mind as the discussion progresses. We will send out E-mail requesting those in electronic form.

PARTICIPANT: (Comments off mike.)

DR. DUMAIS: It is first of all not clear that the 300 or 400 dimensions we have used for the trek databases is optimal. We find that performance is still increasing up to 400 dimensions; it may well increase beyond that.

In fact, I should mention that if you plot performance as a function of number of dimensions, what you get is an inverted U function that is heavily skewed. That is, performance increases dramatically as you go from 20 or 30 up to several hundred dimensions, and then it tails off gradually through the level of performance that you see with raw key word matching, which is the full dimensional solution.

We don’t know that we have reached the peak. In problems where we know what the optimal number of dimensions is, we have found that the peak is not so sharp.

Twenty years later (first LSI papers saw the light in 1988, not in 1990 as some SEOs have incorrectly claimed) a lot of research advances on SVD in relation to LSI have been published. Old IR ideas regarding LSI have been dropped and new ones have been adopted. That is what research is all about.

Still, the issue of how many dimensions to keep is still an open issue and a “by seat of the pants” one. All kind of things and guidelines have been tried. But at the end we need to test and retest the system under examination.

I even have tested my own guideline: keep the top k singular values that amount to more than X percent of the trace of the S matrix; where

S is the matrix of singular values.
X is a threshold value, usually 80-85%

But, again, some would ask: why 80%? Why not 90%, 70%, 60%, etc?

While the above guideline works for many systems, I have trepidated on some systems in which the above threshold is not good. So I always come to “find X experimentally or by seat of the pants”.

We could inspect this as an optimization problem and use Nelder-Mead Multivariative Sequential Simplex Optimization, but I haven’t tried this yet. I’m not sure if this is the way to go either, but might be worth to test.

Another idea is to iteratively update-test-update-test the matrix using any of the current SVD updating methods for several X values. I need to spare some time on this one to see what comes out.

 I’m also open to suggestions.

For those interested, a 1.0 Mb download of Dumais’s 1995 presentation is available. If you have problems downloading it, let me know. I can send you a zip file.

April Kontostathis, from Ursinus College, in Essential Dimensions of Latent Semantic Indexing (LSI), proposes an interesting approach to address aspects of this problem. She illustrates her approach with a model wherein term weights are computed using a well known base-2 LOG model for local weights combined with the ENTROPY model for global weights.

More work is still needed along these lines.

ECIR Tutorial on Terrier

February 12, 2008

Craig Macdonald, University of Glasgow, sent me the Call for Participation given below. Looking at the key speakers, I see that the great Amit Singhal, from Google will be there. These days when many marketers are busy trying to manipulate PageRank through nofollow gymnastics, they all miss the bigger picture and the work of the faces behind it.

I wish I can be at ECIR.

ECIR 2008 Tutorial - Researching and building IR applications using Terrier

Description:
This tutorial introduces the main design of a large-scale IR system, and uses the Terrier platform as an example of how one should be built. We detail the architecture and data structures of Terrier, as well as the weighting models included, and describe, with examples, how Terrier can be used to perform experiments and extended to facilitate new research and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

For more details, see http://ecir2008.dcs.gla.ac.uk/tutorial_rb.html.

Organisers:
Craig Macdonald, University of Glasgow, UK
Ben He, University of Glasgow, UK

Registration:
http://ecir2008.dcs.gla.ac.uk/registration.html

Web Mining Week 11

February 11, 2008

Week 11 Agenda

Web Mining Course Final Examination

Please be on time and bring with you:

1. a #2 pencil and eraser. Pens are not allowed.
2. a scientific calculator. PCs, laptops, software are not allowed.

Notes, books, or additional material are not allowed.

The test is based on material covered in the lectures, take home tests, and reading material provided. It consists of two parts and as follows:

Part 1 consists of multiple choice exercises (40%).
Part 2 consists of problem solving exercises (60%).

Please review how-to linear algebra calculations.

LSI data will be provided so you won’t need to use an SVD calculator during the test.

Good luck and thank you for taking this graduate course.

Cheers

Dr. E. Garcia

Keyword Density, SEOs, and the Deception War

February 7, 2008

I’m happy that at this Sphinnessed post: http://www.searchenginepeople.com/blog/how-search-really-works-the-keyword-density-myth.html , several SEOs are finally waking up and getting the Keyword Density Myth.

Great to see that they are realizing what IR grad students already know (http://irthoughts.wordpress.com/2008/01/25/the-power-of-document-linearization/ ):

That KD is not even a cost-effective tool for detecting spam, as search engines can use local term weight models that in addition to scoring terms can attenuate word frequencies. More on this here http://irthoughts.wordpress.com/2007/05/09/keyword-density-the-devils-advocate/  and here http://irthoughts.wordpress.com/2007/05/07/keyword-density-kd-revisiting-an-seo-myth/   

Such models effectively minimize spurious effects/advantages derived from keyword repetition. Some of these are LOGA, LOGN, ATF1, and SQRT. One could even use the idea of global ENTROPY and propose a local ENTROPY model to neutralize any attempt to misrepeating terms. All these have been discussed in my Web Mining course.

Based on the aforementioned links, it is clear that the Search Engine War never ends, especially when in addition to their spam tactics, marketers are proposing soooo many theories made out of thin air. What is worse is that from time to time they induce their peers and cheerleaders to buy into their Latest SEO Incoherences (”LSI”).

The recent round of nonsense surprisingly comes from alleged “SEO experts”. From claims about sculping PageRank (http://sphinn.com/story/26410 ) to the usual LSI non-sense (http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) to Keyword Density (http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/ ), the Deception War never ends.

I’m glad at AIRWEB we fight these folks (http://airweb.cse.lehigh.edu/2008/cfp.html ). Irronically, what Google’s Cutts calls “proper user” of nofollow link attributes and “sculping” is just another way of saying end user’s human manipulation attempts. Great pr exercises.

ECIR 2008 2nd Call for Papers

February 6, 2008

Dr. Ellen Voorhees, from TREC over at NIST sent me this Call for Papers:

Subject: ECIR 2008 Call For Participation

2nd CALL FOR PARTICIPATION

*Full Programme and Accommodation Details Now Online*
** Early Bird Registration Deadline: 29th February 2008 **

30th European Conference on Information Retrieval (ECIR 200 8) 30th March - 3rd April University of Glasgow
http://ecir2008.dcs.gla.ac.uk/ 

Organized by: University of Glasgow
In cooperation with: BCS-IRSG, ACM SIGIR

* Conference programme (31st March - 2nd April) The technical program will include 33 full research papers, 19 short papers, and 28 posters. Full programme is available at
http://ecir2008.dcs.gla.ac.uk/programme.html     

Tutorials and workshops (30th March)
http://ecir2008.dcs.gla.ac.uk/tutorials_workshops.html    
ECIR 2008 will host three half-day tutorials and three full-day workshops
Tutorial: Advanced language modeling approaches (case study: expert
search) -  Djoerd Hiemstra, University of Twente, The Netherlands
Tutorial: Search and discovery in user-generate text content - Maarten de Rijke and Wouter Weerkamp, ISLA, University of Amsterdam, The Netherlands
Tutorial: Research and building IR applications using Terrier - Craig Macdonald and Ben He, University of Glasgow, UK
Workshop: Workshop on novel methodologies for evaluation in Information Retrieval - organized by Mark Sanderson, University of Sheffield, UK, Martin Braschler, Zurich University of Applied Sciences, Switzerland, Nicola Ferro, University of Padova, Italy, and Julio Gonzalo, UNED, Spain
Workshop: Efficiency issues in Information Retrieval Workshop - organized by Roi Blanco, Universidade da Coruña, Spain and Fabrizio Silvestri, ISTI CNR, Italy
Workshop: Exploiting semantic annotations in Information Retrieval - Omar Alonso, A9.com, USA and Hugo Zaragoza, Yahoo! Research Barcelona, Spain

*Industry day (3rd April)
http://ecir2008.dcs.gla.ac.uk/industry.html     
A day of presentations and discussion dedicated to the interests and needs of Information Retrieval practitioners.

* Keynote speakers
ECIR 2008 is pleased to announce the following three keynote speakers:
Professor Nicholas J. Belkin Rutgers, University, USA - Some(what) Grand Challenges for Information Retrieval Professor Bettina Berendt, K.U. Leuven, Belgium - You are a document
too: Web mining and IR for next-generation information literacy Dr Amit Singhal, Google Research Fellow, Google, USA - Web Search:
Challenges and Directions

* Paper awards
Best paper award, best student paper award and best poster award, all sponsored by Yahoo! Research

* Social events
A social event every day, including a Banquet reception on 31st March at the Kelvingrove Museum and a civic reception at the City Chambers on 1st April. Further details will be posted on the ECIR 2008 Web site.

* Registration
For registration and other information, please visit the ECIR 2008
home page at http://ecir2008.dcs.gla.ac.uk/registration.html   

* Main sponsors
Google  http://www.google.com     
Microsoft Research  http://research.microsoft.com    
Yahoo! Research  http://research.yahoo.com     
MatrixWare Information Services  http://www.matrixware.com    

We are looking forward to welcoming you in Glasgow,

The ECIR 2008 Chairs
(Ian Ruthven, Vassilis Plachouras, Ryen White and Iadh Ounis)

Web Mining Week 10

February 4, 2008

Week 10 Agenda 

Intelligence through JavaScript Spam Techniques: From Doorways to Click Throughs Fraud (PPT presentation) - Covers several scripts for injecting redirection mechanisms to the Web surfing experience of end users. Includes also click fraud scripts and scripts for faking Web traffic.

Co-Occurrence Theory: From C-Indices to EF-Ratios (PPT presentation) - Covers the latest on co-occurrence theory in relation to c-indices and EF-Ratios and how to use these metrics for site optimization and keyword-brand association campaigns.

Review of Test 3 and Bonus.

Required Reading Material

1. http://airweb.cse.lehigh.edu/2007/papers/paper_115.pdf

2. http://www.miislita.com/semantics/c-index-2.html

Searchmageddon: Microsoft to Buy Yahoo!

February 1, 2008

As we mentioned few days ago,

http://irthoughts.wordpress.com/2008/01/23/microsofts-black-cloud-on-yahoos-seo-tag-clouds/ ,

Microsoft is finally buying Yahoo! Check here:

http://online.wsj.com/article/SB120186786283735047.html?mod=hpp_us_whats_news

As Jeremy Sawodny, I predict many old folks will walk rather than working for Bill Gates while duplicated positions will be eliminated. With already thousand being terminated at Yahoo!, more will soon follow.

The Final Search Battle is coming and its name is SEARCHMAGEDDON.

“And I saw from the mouth of the operating system, and the mouth of the database     and the mouth of the paid advertisers, three unclean spirits… the spirits of marketers, spammers, and hackers working signs, and they go forth to the kings of the whole network, to gather them to battle against the great day of the Almighty GoOGLE.  And Stanford shall gather them together into a place which in digital business is called SEARCHMAGEDDON.” –Internet 2:1, 2008.