• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: December 2007

Until Next Year

28 Friday Dec 2007

Posted by egarcia in Miscellaneous, Newsletters

≈ Leave a Comment

Well, this was an incredible year.

I participated of several international conferences, changed ISPs, went back to teaching at the graduate school, and to conduct academic research; I also gained new friends from all over the world.

Next year I have several conferences and activities to take care of, teach next Spring a new graduate course, titled Search Engines Architecture, and take care of few consulting projects.

The IRW Newsletter should arrive to subscribers inbox early: Today.

I’m taking few days off. Until Next Year.

Cheers,

Dr. E. Garcia

Preview of IRW-2008-1: Association Clusters

27 Thursday Dec 2007

Posted by egarcia in Newsletters

≈ Leave a Comment

Here is a preview of IRW-2008-1.

In this issue of the IRW newsletter we discuss association clusters in the context of keyword discovery.

Introduction
What are Clusters?
A “Reaction” Equation Approach
The Term-Document Matrix
The Document-Term Matrix
The Term-Term Co-Weight Matrix
What is a Similarity Matrix?
A Nomenclature Convention
Computing the Similarity Matrix
Identifying Association Clusters
Association Clusters: Some Applications
News, Research, and Events
Terms of Use and Copyright

The material covered is an adaptation of one of my Web Mining Course lectures.

Kendall’s Test and PageRank

26 Wednesday Dec 2007

Posted by egarcia in Machine Learning, SEO Myths

≈ Leave a Comment

For years marketers have put PageRank-based models into question, via “sandbox” theories and similar tales.

This old paper might help to address the issue on the biased/unbiased nature of these models: Paradoxical Effects in PageRank Incremental Computations.

What interest me the most about the paper is how the authors put into use Kendall’s t theory. Written by Paolo Boldi, Massimo Santini, and Sebastiano Vigna the abstract states:

“Abstract. Deciding which kind of visiting strategy accumulates high-quality pages morequickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectivenessfor crawl strategies: whether the graph obtained after a partial visit is in some senserepresentative of the underlying web graph as far as the computation of PageRank isconcerned. More precisely, we are interested in determining how rapidly the computationof PageRank over the visited subgraph yields node orders that agree with theones computed in the complete graph; orders are compared using Kendall’s t .We describe a number of large-scale experiments that show the following paradoxicaleffect: visits that gather PageRank more quickly (e.g., highest-quality first) arealso those that tend to miscalculate PageRank. Finally, we perform the same kind ofexperimental analysis on some synthetic random graphs, generated using well-knownweb-graph models: the results are almost opposite to those obtained on real webgraphs.”

The authors also state:

“The most classical visit strategies are the following:

• Depth-first order: the crawler chooses the next page as the last that wasadded to the frontier; in other words, the visit proceeds in a LIFO fashion.
• Random order: the crawler chooses randomly the next page from the frontier.
• Breadth-first order: the crawler chooses the next page as the first that wasadded to the frontier; in other words, the visit proceeds in a FIFO fashion.
• Omniscient order (or quality-first order): the crawler uses a queue prioritisedby PageRank values [Cho et al. 98]; in other words, it chooses to visitthe page with highest quality among the ones in the frontier. This visit ismeaningless unless a previous PageRank computation of the entire graphhas been performed before the visit, but it is useful for comparisons. A variant of this strategy may also be adopted if we have already performeda crawl and thus we have the (old) PageRank values of (at least some ofthe) pages.”

“Both common sense and experiments (see, in particular, [Boldi et al. 05])suggest that the visits listed above accumulate PageRank in an increasinglyquicker way. This is to be expected, as the omniscient visit will point immediatelyto pages of high quality. The fact that the breadth-first visit yields high-qualitypages was noted in [Najork and Wiener 01].”

“There is, however, a different and also quite relevant problem that has beenpreviously overlooked in the literature: if we assume that the crawler has noprevious knowledge of the web region it has to crawl, it is natural that it will try to detect page quality during the crawl itself, by computing PageRank onthe region it has just seen. We would like to know whether a crawler doing sowill obtain reasonable results or not.”

Local Newspapers Spamming Search Engines

23 Sunday Dec 2007

Posted by egarcia in Spam

≈ Leave a Comment

Back in the ’90s spammers used to stuff keywords in web page meta tags, aimed at inducing search engines to believe that these were relevant. Then search engines found a way to neutralize this type of keyword stuffing/spam technique.

I was wondering which of our local newspapers are sticking to the idea of stuffing keywords in meta tags. So, I conducted a little exercise.

For this report, I used an experimental crawler I’m working on. The crawler was instructed to capture meta tags, only.

Here is a comparative as of 12-23-07. The first and the last newspapers redirect users, from their index page to a secondary page. It is clear from these results that local web designers in Puerto Rico are sticking to old fashioned spamming techniques like keyword stuffing, which BTW no longer work with search engines.

Newspaper: El Nuevo Dia ( http://www.elnuevodia.com/noticias )

Meta Report

4 results found.

1. meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″
2. meta content=”600″ http-equiv=”refresh”
3. meta name=”Description”
4. meta name=”Keywords” content=”ENDI El Nuevo Dia, periodico, Puerto, Rico, internet, noticias, boricua, Clima, Horoscopo, Coqui, Sapo,Concho,Telefonica,Coqui net,RonNueva York ,Horoscopo,Coqui,Sapo Concho,Telefonica,Coqui net,Ron,Bacardi,Estatus,Sila Maria Calderon,Menudo,Carlos Romero Barcelo,Pedro Rossello,Anibal Acevedo Vila,Daddy Yankee,Tego lderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Marcony,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne,Educacion,Cine,Entretenimiento,Ejercicio,Bienestar,Recetas,

Recetarios,Musica,Boricuas,Carlos Arroyo, Islanders,TUTV,Motoristas,Internautas ,Medicos,Historiadores,Venezuela,Santo Domingo,Republica Dominicana,Cuba ,Queens,Manhattan,Bronx,Nueva York ,Espiritualidad,Maestros,Televicentro,Telemundo,Univision,Bellas Artes,Telenovela,Comunidades,Isla,Culebrita,St. Thomas,Isla Nena,Culebra,Isla Mona,Vieques,Filiberto Ojeda,Capitolio,Turismo,Periodico,Periodismo,Veteranos,Marina,Soldados,

Senado,Energia Electrica,Plena,Bomba,Reggaeton,Wisin y Yandel,Casinos,Salsa,Construccion,Vacaciones,Jazz,Tito Trinidad,Oscar de la Hoya,Millie Corretjer,Roberto Clemente,San Juan,Miguel Cotto,Becas,Tercera Edad,Museos,Clasificados,Clasificados online,Clasificados en linea,Boletines,Servicios de noticias,Luis A. Ferre,Ferre Rangel,Hepatitis,Dengue,Diabetes,Obituarios,Obesidad,Librerias,Libros,Viejo San Juan,El Morro,Ballaja,Museo de Arte de Ponce,Ponce,Mayaguez,Parque Indigena,Observatorio de Arecibo,Tainos,El Yunque,El tunel Guajataca,Zoologico,RUM,UPR,Sagrado Corazon,Interamericana,Universidades,Colegios,Escuelas,Fajardo,Bahia,Luquillo,

Bioluminiscente,La Parguera,Carlos Delgado,Igor Gonzalez,Olga Tanon,Piculin Ortiz,Primera Hora,Zonai,Virtual,Beisbol,Mundial ,Raul Papaleo,Sondeos,Justas,Pavas,Palmas,Munoz Marin,Lyann Puig,Elaine Lopez,Javier Lopez,Remi,Poesia ,Olga Nolla,Rosario Ferrer,Mayra Montero”

Newspaper: El Vocero ( http://www.vocero.com )

Meta Report

3 results found.

1. meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″
2. meta name=”Description” content=”VOCERO”
3. meta name=”Keywords” content=”El Vocero, periodico, Puerto, Rico, internet, noticias, locales, Negocios, Deportes, Escenario, Extra, Triunfo, Habitat, Suplementos, Editorial, Comentarios, Juegos, Esquelas, Suscripcion, Galeria, Titulares, Cucubano, Humor, Cheo, Tiempo, Horoscopo, Coqui, Sila Calderon, Menudo,Daddy Yankee,Tego Calderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne, Educacion, Cine, Entretenimiento, Telenovela, Comunidades, Isla, Culebrita, St. Thomas, Isla Nena, Culebra, Isla Mona, Vieques, Filiberto Ojeda, Capitolio, Turismo, Periodico, Periodismo, Soldados, Senado, Energia Electrica, Plena, Bomba, Reggaeton, Casinos, Salsa, Construccion, Vacaciones, Jazz, Tito Trinidad, Oscar de la Hoya, Millie Corretjer, Roberto Clemente, San Juan, Miguel Cotto, Becas, Museos, Clasificados, Clasificados online, Clasificados en linea, Boletines, Servicios de noticias, Luis A. Ferre, Hepatitis, Dengue, Diabetes, Viejo San Juan, El Morro, Ballaja, Museo de Arte de Ponce, Roca, Ponce, Mayaguez, Parque Indigena, Observatorio, Arecibo, Tainos,El Yunque, Guajataca, Zoologico, RUM, UPR, Sagrado Corazon, Interamericana, Universidades, Colegios, Escuelas, Fajardo, Bahia, Luquillo, Bioluminiscente, La Parguera, Carlos Delgado, Igor Gonzalez, Olga Tanon, Piculin Ortiz, Beisbol, Mundial ,Raul Papaleo, Sondeo, Justas, Pavas, Palmas, Muñoz Marin, Coquito, Coco”

Newspaper: PrimeraHora ( http://www.primerahora.com/home )

Meta Report

3 results found.

1. meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″
2. meta name=”Description” content=”Primera Hora”
3. meta name=”Keywords” content=”Periódicos, Periódico Primera Hora, Puerto Rico, Isla, Caribe, San Juan, Viejo San Juan, Vieques, Regueton, Reggaeton, Daddy Yankee, Wisin y Yandel, Don Omar, Ricky Martin, Chayanne, Calle 13, JLo, Jennifer Lopez, Lucha Libre, Sexo, Boricua, Borinquen, Boriken, Noelia, Salsa, El Gran Combo, Playa, Isla Nena, Política, Gobernador, Estado Libre Asociado, ELA, Bombón, Bikinis, Modelos, Béisbol, Baloncesto, Volibol, Hipismo, Boxeo, Tito Trinidad, Miguel Cotto, Iván Calderón, De la Hoya, Héctor Lavoe, Marc Anthony, Cortometrajes, Web Concerts, Conciertos, Videos musicales, Video juegos, RBD, Miss Universe, Construcción, Casas, Apartamentos, Autos, Noticias, Asesinatos, Violencia, Shakira, Mujer, Salud, Niños, Mascotas, Restaurantes, Cine, Moda, Horóscopo, Paris Hilton, Britney Spears, Blogs, Tecnología, E Cards, Coqui, Clima, Recetas, Chats, Cielito Rosado, Rukmini, Juanes, El Nuevo Día, Grupo Ferre Rangel”

From Keyword Density to William Tutte’s Legacy

20 Thursday Dec 2007

Posted by egarcia in Latent Semantic Indexing, SEO Myths, Spam

≈ 3 Comments

From Keyword Density to Keyword Distribution

Finally we have the Christmas Break from graduate school.

In my last Web Mining Course lecture before the Christmas Break, I tried to explain to students the importance of incorporating word spacing in information retrieval algorithms and in document relevance assessments. I explained why ideas like SEOs’s keyword density (KD), the traditional local term weight model known as FREQ (Term Count) and used in early papers on Vector Space and LSI models, and the likes are poor estimators of document relevance.

Among other theoretical reasons, it was discussed that a term mentioned X times not necessarily is X times more important than other terms. In addition, KD and the term count model cannot attenuate frequencies. We then discussed several frequency attenuation models (keyword spam filters) that also work as term weight scoring models. These can dampen down the effect of abnormal repetition of terms, raise a spam flag, and do not require of any reference to KD “tales”.

We also discussed several scenarios in which one could use word distributions and co-occurrence to analyze textual information –far better than with the aforementioned “crapstimators”. For instance, word spacing can be used in encryption/steganographic algorithms to uncover hidden messages, profiling writing styles/people, imputate authorship of text, assess plagiarism, fraud, etc.

I’m happy that not all SEOs are buying into the keyword density of non-sense and similar “crapstimates”, as I can see from these SEOmoz posts.

From Keyword Distribution to William Tutte’s Legacy

This morning I came across a nice biography of one of those venerable giants: the late William Tutte. Beautifully written by Dan Younger, the biography is a tribute to Tutte’s greatness. Interesting to point out in relation to word spacing theory is this portion of Young’s writing (emphasis added):

“Tutte’s great contribution was to uncover, from samples of the messages alone, the structure of the machines which generated these codes. This came about as follows. In August 1941, a German operator sent a Fish-enciphered teleprinter message of some 4000 letters from Athens to Berlin. For some reason, the message was not received properly and so it was resent. Against all guidelines, it was sent with the same setting. It was identical in content, but it differed slightly, in word spacing and punctuation. John Tiltman of Bletchley was able to use this blunder to find both the message and the obscuring string that was added to make up the enciphered message. But that seemed to be all that could be found, when Tutte was presented with the case in October.”

“Tutte began by observing the machine generated obscuring string carefully. Splitting it up into various lengths, he noticed signs of periodicity. For the first of the five teleprinter tape positions, the regularity he supposed arose from a wheel of 41 sprockets. And then at the last position, one of 23 sprockets. Over the next months, Tutte and colleagues worked out the complete internal structure, that it had twelve wheels, two for each of the five teleprinter positions, and two with an executive function. They determined the number of sprockets on each wheel, and how the advancement of the wheels was interrelated. They had completely recreated the machine without ever having seen one. Tony Sale, who first described this work in a 1997 article in New Scientist, characterized it as the “greatest intellectual feat of the whole war.”

“Knowing the structure of the enciphering machine is a necessity for code-breaking, but it is only the first step. Tutte then put himself to creating an algorithm to find from the enciphered messages the initial settings of the machine wheels. The algorithm that he created, the “Statistical Method”, looked for certain types of resonances, but it had to consider far too many possibilities to be carried out by hand. So it was that, in 1943, the electronic computer COLOSSUS was designed and built by the British Post Office. It was to run the algorithms that Tutte; and his collaborators Max Newman and Ralph Tester; developed, that COLOSSUS was created. This man-machine combination was used to break Fish codes on a regular basis throughout the remainder of the War”.

I hope you understand now the title of this post.

 In today’s Web the enciphering machines are search engines, but the underlying principles driving the Search Engines War are the same.

Emphasized words should make sense to students of the Web Mining course.

TREC 2008 Call for Papers

19 Wednesday Dec 2007

Posted by egarcia in Conferences

≈ Leave a Comment

Ellen Voorhees, TREC Project Manager and Group Manager over at NIST.gov, sent me the TREC 2008 Call for Papers. To facilitate disemination and promote the event, I’m reproducing the Call below. 

CALL FOR PARTICIPATION

TEXT RETRIEVAL CONFERENCE (TREC)

February 2008 – November 2008

Conducted by:
  National Institute of Standards and Technology (NIST)

With support from:
  Intelligence Advanced Research Projects Activity (IARPA)

The Text Retrieval Conference (TREC) workshop series encourages
research in information retrieval and related applications by
providing a large test collection, uniform scoring procedures,
and a forum for organizations interested in comparing their
results. Now in its seventeenth year, the conference has become
the major experimental effort in the field. Participants in
the previous TREC conferences have examined a wide variety
of retrieval techniques and retrieval environments,
including cross-language retrieval, retrieval of web documents,
multimedia retrieval, and question answering. Details about TREC
can be found at the TREC web site, http://trec.nist.gov .

You are invited to participate in TREC 2008. TREC 2008 will
consist of a set of tasks known as “tracks”. Each track focuses
on a particular subproblem or variant of the retrieval task as
described below. Organizations may choose to participate in any or
all of the tracks. For most tracks, training and test materials are
available from NIST; a few tracks will use special collections that
are available from other organizations for a fee.

Dissemination of TREC work and results other than in the (publicly
available) conference proceedings is welcomed, but the conditions of
participation specifically preclude any advertising claims based
on TREC results. All retrieval results submitted to NIST are
published in the Proceedings and are archived on the TREC web site.
The workshop in November is open only to participating groups that
submit retrieval results for at least one track and to selected
government personnel from sponsoring agencies.

Schedule:
——–

By February 21 — submit application described below
  to NIST. Returning an application will add you to
  the active participants’ mailing list. On Feb 25,
  NIST will announce a new password for the “active
  participants” portion of the TREC web site. Included
  in this portion of the web site is information regarding
  the permission forms needed to obtain the TREC document
  disks.

Beginning March 1 — document disks used in some existing
  TREC collections distributed to participants who have
  returned the required forms. Please note that no disks
  will be shipped before March 1.

mid-July–mid-August — results submission deadline for most tracks
  (Results deadline may need to be even earlier for
  some tracks depending on assessor availability.
  Specific deadlines for each track will be included in
  the track guidelines, which will be finalized in the spring.)

September 9 (estimated) — speaker proposals due at NIST.

September 30 (estimated) — relevance judgments and individual
  evaluation scores due back to participants.

Nov 18-21 — TREC 2008 conference at NIST in Gaithersburg, Md. USA

Task Description:
—————-
Below is a brief summary of the tasks. Complete descriptions of
tasks performed in previous years are included in the Overview
papers in each of the TREC proceedings (in the Publications section
of the web site).

The exact definition of the tasks to be performed in each track for
TREC 2008 is still being formulated. Track discussion takes place
on the track mailing list. To be added to a track mailing list,
follow the instructions for contacting that mailing list as
given below. For questions about the track, send mail to the
track coordinator (or post the question to the track mailing list
once you join).

TREC 2008 will have one new track and four continuing tracks.
The new track is the relevance feedback track, a track
that will systematically explore the factors that affect
relevance feedback behavior. The blog, enterprise, legal, and
million query tracks will continue in TREC 2008, though the
specific tasks in a track may differ from year to year.
(Note that the QA track has been moved to the new Text Analysis
Conference (TAC); the call for participation in TAC
will be sent to the TREC friends mailing list.)

Blog Track — The purpose of the blog track is to explore information
  seeking behavior in the blogosphere.

Track coordinator: Iadh Ounis, ounis@cs.gla.ac.uk
  Mailing list: send a mail message to listproc@nist.gov
  such that the body consists of the line
  subscribe trec-blog

Enterprise Track — The purpose of the enterprise track is
  to study enterprise search: satisfying a user who is
  searching the data of an organization to complete some task.

Track coordinators: Nick Craswell, nickcr@microsoft.com
  Ian Soboroff, ian.soboroff@nist.gov
  Arjen de Vries, arjen@acm.org
  Track web page: http://www.ins.cwi.nl/projects/trec-ent
  Mailing list: send a mail message to listproc@nist.gov
  such that the body consists of the line
  subscribe trec-ent

Legal Track — The goal of the legal track is to develop search technology
  that meets the needs of lawyers to engage in effective discovery
  in digital document collections.

Track coordinators: Jason Baron, jason.baron@nara.gov
  Doug Oard, oard@umd.edu
  Track web page: http://trec-legal.umiacs.umd.edu
  Mailing list: contact oard@umd.edu to be added to the list.

Million Query Track — The goal of the “million query” track is to test
  the hypothesis that a test collection built from very many very
  incompletely judged topics is a better tool than a collection built
  using traditional TREC pooling.

Track web page: http://ciir.cs.umass.edu/research/million
  Mailing list: follow the instructions given on the track web page
  to join the email list ( million@cs.umass.edu )

Relevance Feedback Track — The goal of the relevance feedback track
  is to provide a framework for exploring the effects of different
  factors on the success of relevance feedback.

Track coordinators: Chris Buckley, cabuckley@sabir.com
  Steve Robertson, ser@microsoft.com
  Track web page: http://groups.google.com/group/trec-relfeed
  Mailing list: follow the instructions given on the track web page
  to join the email list

Conference Format
————————-

The conference itself will be used as a forum both for presentation
of results (including failure analyses and system comparisons),
and for more lengthy system presentations describing retrieval
techniques used, experiments run using the data, and other issues
of interest to researchers in information retrieval. As there
is a limited amount of time for these presentations, the TREC
program committee will determine which groups are asked to speak
and which groups will present in a poster session. Groups that
are interested in having a speaking slot during the workshop
should submit a 200-300 word abstract in September describing
the experiments they performed. The program committee will use
these abstracts to select speakers.

As some organizations may not wish to describe their proprietary
algorithms, TREC defines two categories of participation.

Category A: Full participation
  Participants will be expected to present full details of system
  algorithms and various experiments run using the data, either in
  a talk or in a poster session.

Category C: Evaluation only
  Participants in this category will be expected to submit results
  for common scoring and tabulation, and present their results in
  a poster session. They will not be expected to describe their
  systems in minute detail, but will be expected to describe the
  general approach and report on time and effort statistics.

Data
—-
Many of the existing TREC English collections (documents, topics,
and relevance judgments) are available for training purposes and
may also be used in some of the tracks. Parts of the training
collection (Disks 1-3) were assembled from Linguistic Data
Consortium (LDC) text, and a signed User Agreement will be required
from all participants. The documents are an assorted collection
of newspapers, newswires, journals, and technical abstracts.
The LDC has collected a more recent set of newswire material called
the AQUAINT collection (Disks 6&7); this collection will also
be available to TREC participants but is covered by a separate
User Agreement. A third Agreement is needed for the remaining
disks (4-5).

All documents are typical of those seen in a real-world situation
(i.e. there will not be arcane vocabulary, but there may be
missing pieces of text or typographical errors). For most tracks,
the relevance judgments against which each system’s output will be
scored will be made by experienced relevance assessors based on the
output of all TREC participants using a pooled relevance methodology.
See the Overview paper in the TREC-8 proceedings (on the TREC
web site) for a detailed discussion of pooling.

Response format and submission details:
—————————————

Organizations wishing to participate in TREC 2008 should respond
to this call for participation by submitting an application.
IMPORTANT NOTE: Participants in previous TRECs who will participate
in TREC 2008 must submit a new application.

An application consists of the following five parts:
1. Contact information:
  * The full name of the main contact person from your organization.

* The full name of your organization. If you are not
  participating as a member of an organization,
  please specify “self”. If you know there is another group
  from your organization that will also participate in TREC 2008
  (for example, two groups from the same university),
  please give enough qualification in the organization name
  to distinguish the different groups.

* An organization/team name (up to 20 characters) used as a
  unique identifier for your group. You will need to use this
  identifier on correspondence to NIST (when requesting
  data or sending permission forms, for example), so
  remember it. This identifier will also be used to tag your
  runs when you submit them.

* Complete organization physical mail address—sufficient
  such that mail sent to that address will be accepted
  by the post office.

* Fully qualified phone and fax numbers for the main contact person.

* Fully qualified, valid email address for the main contact person.

* Exactly ONE fully qualified, valid email address to use
  for the TREC 2008 participants’ mailing list.
  Because of the overhead involved in maintaining the mailing list
  of active participants, only one email address per participating
  group will be added to the TREC 2008 participants’ list.
  We strongly encourage the use of a local mailing list at
  your institution that distributes TREC mail internally to
  project participants so that all involved see the mail sent
  to this list. TREC is run solely be email, so it is
  important that this address be valid and that mail
  sent to the address is read in a timely fashion.
  You may use the email address of the main contact person
  as the address for the mailing list, but please give
  it twice in the application so we are sure of your intentions.

2. Whether you have participated in TREC before. If so, please
  indicate the years you participated, otherwise indicate that you
  are new to TREC.

3. A one-paragraph description of your retrieval approach.

4. Whether you will participate as a Category A or a
  Category C group.

5. A list of tracks that you are likely to participate in.
  This is non-binding, but is helpful to know for planning.

There is no application form as such; a simple text file consisting
of this information by number is the application.
Please respond using only simple text (i.e., no pdf, word,
rtf, postscript, etc. files). We will not process your application
to participate in TREC 2008 unless it is complete.

All responses should be mailed to Lori Buckland, lori.buckland@nist.gov .
Any questions about conference participation, response format, etc.
should be sent to the general TREC email address, trec@nist.gov  .

AIRWeb 2008 Call for Papers

18 Tuesday Dec 2007

Posted by egarcia in Conferences

≈ Leave a Comment

Search Engine Spammers, beware:

Here is the Call for Papers for The Fourth International Workshop on Adversarial Information Retrieval on the Web, to be held in April 22nd, 2008 in Beijing, China:

http://airweb.cse.lehigh.edu:80/2008/cfp.html

As in AIRWeb 2007, there will be a Web Spam Challenge. Let’s call it “Ethical Spamming”, a la “Ethical Hacking”. Indeed, to understand the spammer/hacker mentality you need to either act like one under controlled conditions or be one in a previous life, sort of speak. 

Once again, I’ve been appointed member of the Program Committee. To help promote the event, I’m reproducing below their Call for Papers.

Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is “search engine spamming” or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection.

We solicit both full and short papers on any aspect of adversarial information retrieval on the Web. Particular areas of interest include, but are not limited to:

Link spam
Content spam
Cloaking
Comment spam
Spam-oriented blogging
Click fraud detection
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging

Proceedings of the workshop will be included in the ACM Digital Library. Full papers are limited to 8 pages; work-in progress will be permitted 4 pages.

Web Spam Challenge
Last year we introduced a novel element at the workshop: a Web Spam Challenge for testing web spam detection systems. We will be holding the Web Spam Challenge again this year, using the WEBSPAM-UK2007 collection for Web Spam Detection which we anticipate being released in early January, 2008.

The collection includes large set of web pages, a web graph, and human-provided labels for a set of hosts. We will also provide a set of features extracted from the contents and links in the collection, which may be used by the participant teams in addition to any automatic technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions (normal/spam) for all unlabeled hosts in the collection. Predictions will be evaluated and results will be announced at the AIRWeb 2008 workshop.

More information will be posted to http://webspam.lip6.fr/  

Timeline
15 February 2008: E-mail intention to submit a workshop paper (optional, but helpful)
22 February 2008: Deadline for workshop paper submissions
14 March 2008: Notification of acceptance of workshop papers
31 March 2008: Camera-ready copy due
31 March 2008: Challenge submissions due
22 April 2008: Date of workshop

Organizers and Program Committee
Organizers:

Carlos Castillo, Yahoo! Research
Kumar Chellapilla, Microsoft Live Labs
Dennis Fetterly, Microsoft Research

Program Committee:

Einat Amitay, IBM
András Benczúr, Hungarian Academy of Sciences
Paul-Alexandru Chiri, Uni Hannover
James Caverlee, Texas A&M University
Gordon Cormack, University of Waterloo
Nick Craswell, Microsoft Research
Matt Cutts, Google
Brian Davison, Lehigh University
Ludovic Denoyer, University Paris 6
Aaron D’Souza, Google
Edel García, Mi Islita.com
Natalie Glance, Nielsen BuzzMetrics
Antonio Gulli, Ask.com
Zoltán Gyöngyi, Stanford University
Monika Henzinger, Google
Pranam Kolari, Yahoo! Applied Research
Mark Manasse, Microsoft Research
Marc Najork, Microsoft Research
Alexandros Ntoulas, Microsoft Search Labs
Jan Pedersen, Yahoo! Search
Erik Selberg, Amazon
Torsten Suel, Polytechnic University
Mike Thelwall, University of Wolverhampton
Baoning Wu, Snap
Tao Yang, Ask.com
E-mail: airweb2008@cse.lehigh.edu

Web Mining Week 6

17 Monday Dec 2007

Posted by egarcia in Web Mining Course

≈ Leave a Comment

Week 6 Agenda

Revisiting Understanding Ranking Algorithms (PPT Presentation)
 

Local Term Weight Models (PPT Presentation)
  Discussion of FREQ, BNRY, LOGN, LOGA, ATF1, SQRT, and other models.
 

Global Term Weight Models (PPT Presentation)
  Discussion of IDF, Prob IDF, Entropic, and other models.

Required Reading Material

http://csmr.ca.sandia.gov/~tgkolda/pubs/ornl-tm-13756.pdf  

Resources on the Dark Web

14 Friday Dec 2007

Posted by egarcia in Homeland Security

≈ Leave a Comment

Few days ago I reported on the Dark Web Project.

There is one section of that paper that reads (emphasis added):

“IV. Presentations in Seminars or Conferences (PowerPoint) – Password protected; please send request via email and provide a brief explanation of your interest.”

Clicking on the links that follow that statement triggers a history.go(-1) JavaScript event in the browser history. Looking at the source of the document shows a JavaScript asking for the password (which is given as “ailab”) and the following partial paths to the documents:

publications/conf/WriteprintsandInkBlots.pdf
publications/conf/data%20mining%20webometric%20analysis%203aug05.pdf
publications/conf/SeminarGroupAuthorship.pdf
publications/conf/comparative_03_25_05.pdf
publications/conf/Dark%20Web%20200502.pdf
publications/conf/AASlidesMod.pdf
publications/conf/WebForum0712_2007.ppt
publications/conf/SpecializedContent_2007.ppt
publications/conf/ClearGuidance_2006.ppt

Other than accessing the entries in the history.go array of end users, I’m not sure why they added this “password protected” feature since simply adding http://ai.arizona.edu/research/terror/ to the above paths allows one to access and download the documents, anyway.

The article also points to the following great resources:

Reid, E. and Chen, H., “Mapping the Contemporary Terrorism Research Domain.” International Journal of Human-Computer Studies, 65, Pages 42-56, 2007.

Qin, J., Zhou, Y., Reid, E., Lai, G., Chen, H., “Analyzing Terror Campaigns on the Internet: Technical Sophistication, Content Richness, and Web Interactivity,” International Journal of Human-Computer Studies, 65, Pages 71-84, 2007.

H. Chen and F. Wang, “Artificial Intelligence for Homeland Security“,IEEE Intelligent Systems, Special Issue on Artificial Intelligence for National and Homeland Security, pp. 12-16, September/October 2005.

A. Abbasi and H. Chen, “Applying Authorship Analysis to Extremist-Group Web Forum Messages“,IEEE Intelligent Systems, Special Issue on Artificial Intelligence for National and Homeland Security, pp. 67-75, September/October 2005.

Zhou, Y., Reid, E., Qin, J., Lai, G., Chen, H., “U.S. Domestic Extremist Groups on the Web: Link and Content Analysis,”IEEE Intelligent Systems, Special Issue on Artificial Intelligence for National and Homeland Security, pp. 44-51, September/October 2005.

A. Abbasi and H. Chen, “Visualizing Authorship for Identification,” In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.

J. Wang, T. Fu, H. Lin, and H. Chen, “A Framework for Exploring Gray Web Forums: Analysis of Forum-Based Communities in Taiwan,” In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.

Y. Zhou, J. Qin, G. Lai, E. Reid, and H. Chen, “Exploring the Dark Side of the Web: Collection and Analysis of U.S. Extremist Online Forums,” In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.

A. Salem, E. Reid, and H. Chen, “Content Analysis of Jihadi Extremist Groups’ Videos,” In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.

J. Xu, H. Chen, Y. Zhou, and J. Qin, “On the Topology of the Dark Web of Terrorist Groups,” In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.

Zhou, Y., Qin, J., Lai, G., Reid E. and Chen, H., “Building Knowledge Management System for Researching Terrorist Groups on the Web,” Proceedings of the AIS Americas Conference on Information Systems (AMCIS 2005) , Omaha, NE, USA, August 11-14, 2005.

Mapping the Contemporary Terrorism Research Domain: Researchers, Publications, and Institutions Analysis,” ISI Conference 2005, Atlanta, GA, May, 2005.

Reid, E., Qin, J., Zhou, Y., Lai, G., Sageman, M., Weimann, G., and Chen, H., “Collecting and Analyzing the Presence of Terrorists on the Web: A Case Study of Jihad Websites,” IEEE International Conference on Intelligence and Security (ISI 2005), Atlanta, Georgia, 2005.

Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi, W., Lai, G., Bonillas, A. and Sageman, M., “The Dark Web Portal: Collecting and Analyzing the Presence of Domestic and International Terrorist Groups on the Web,” Proceedings of the 7th International Conference on Intelligent Transportation Systems (ITSC), Washington D.C., October 3-6, 2004.

Perpetuating LSI Misconceptions

11 Tuesday Dec 2007

Posted by egarcia in Latent Semantic Indexing

≈ 3 Comments

Mr. Nick Yorchak from Fusionbox and an alleged SEO “expert” has written this Sitepronews.com article about LSI, which perpetuates myths and wrong statements about LSI, similar to those claimed by Mr. Aaron Wall at this SearchEngineJournal article, and by Valerie DiCarlo in this unfortunate article.

Mike Duz has written a quick rebuttal to Yorchak.

Wishful Thinking: Let us hope that in 2008 SEOs learn how SVD works so they stop spreading misinformation about what is LSI.

To learn about SEO misconceptions regarding LSI, check my tutorial series on the topic, starting with

Tutorial 1: Understanding SVD and LSI

Fortunately, more and more SEOs, like Andy Beal here (MarketingPilgrim.com) and Melissa Fach here (SEOAware.com), are realizing what is not LSI.

BTW, here is an “invitation” issued by Mike Grehan and me back in July, 2007: A Call to SEOs Claiming to Sell LSI.

← Older posts
December 2007
M T W T F S S
« Nov   Jan »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.