Archive for the ‘Spam’ Category

Pink Keywords: Optimization of Resumes and Job Applications

May 5, 2008

The current slump in the US and PR economy and so many local employers giving pink slips induces me to think of the importance of pink keywords.

These are keywords one would use to optimize resumes and job applications.

Now than ever recruiters, middle management, and HR departments need to look through zillion of resumes, looking for specific clues in the form of pinky keywords. This means that resumes and job applications must be optimized for such terms.

http://career-advice.monster.com/resume-writing-basics/Keyword-Challenge/home.aspx

The best way of finding good pinky keywords consists in selling to employers their own crappy ads and job offers; that is, by scanning employment ads, job offerings, and classifieds relevant to the target position one is interested in and then using the target terms in your own resume. Another thing one can do is to expand these with related or contextual terms; of couse, using those that match your own experience and skills.

I see here an opportunity for ethical SEO companies to provide a valuable and noble service: Pinky Optimization. At the same time I see an opportunity for crook SEOs and spammers to prey on other people’s misfortune. Since many in the seophere have being disposed by fat cats and sold(soul)-outs, these folks are also job searching. Life ironies.

For SEO Spammers: AIRWeb 2008 Presentations

April 29, 2008

To facilitate mainstream dissemination of the manuscripts presented at AIRWeb 2008 here are the papers as listed over at http://airweb.cse.lehigh.edu/2008/program.html

SEO spammers, whether your life gravitates around a “social network circus” or ”link building” or not, it is time to revisit your drawing board.

8:30 - 10:00

10:30 - 12:00

13:30 - 15:00

15:30 - 17:00

  • Web Spam Challenge
    • (5 min.) Description of the challenge
    • (12 min.) Data Analysis School, Moscow slides
      Konstantin Bauman, Alexey Brodskiy, Sergey Kacher, Elmira Kalimulina, Ruslan Kovalev, Mikhail Lebedev, Dmitry Orlov, Pavel Sushin, Pavel Zryumov, Dmitry Leshchiner and Ilya Muchnik
    • (12 min.) Computer and Automation Research Institute, Hungarian Academy of Sciences slides
      David Siklosi, Andras Benczur
    • (12 min.) Institute of Automation, Chinese Academy of Sciences, Beijing slides
      Guanggang Geng, Xiaobo Jin and Chunheng Wang
    • (5 min.) Announcement of results
  • Panel
    • (45 min.): The Future of Adversarial IR on the Web
      Amit Aggarwal, Zoltán Gyöngyi, Alexandros Ntoulas, Erik Selberg, and Andrew Tomkins

SEOs - Desperate Seeking Clients

April 24, 2008

From time to time I receive unsolicited emails from SEOs offering me their services, to list my site in the major search engines and directories. They often send templates-like automatic messages (”Dear website owner”) and appear not to even bother to check if recipients need the service. 

These SEOs often look desperate and sound like snakeoil sellers and crooks. They even claim to be better than other SEOs.

They often pitch the same crap:

  • “I recently visited your site” (Really? Why then send this crap?).
  • “you are not listed in the top search engines and directories” (Really? How do they know?).
  • “we can increase your traffic by X astronomical amount” (Really? Could you double X for me, please?).
  • “we can help you get top rankings in Google” (Really? For which keywords?).
  • “our link building program” (Really? Read here link exchange and link spam).
  • “we have proprietary crap, blah, blah, …” (Really? Sell it or get a patent!).

I just received one of such emails last night, even when my site is known in the IR/SEO spheres and has been listed for many years in the top search engines and directories, and ranking well.

Dear website owner,

I visited your website and noticed that you are not listed in many of the major search engines and directories. If our company can increase your traffic up to 500% by getting you top ranking results on the search engines such as Google would you be interested? We specialize in link building content writing and programming. We have proprietary techniques that work better and are less expensive than any other SEO firm.

Please let me send you a proposal and show you how we can make your website profitable.

Sincerely,

Christian Frank

2060 AVENIDA DE LOS ARBOLES, STE D
THOUSAND OAKS,
CA 91362-1361 - USA

These are the type of companies that give a black eye to the SEO industry. If SEOs send you this type of crap, I feel your pain. Stay away from their businesses or whatever they claim or seem to offer.

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than to what web page creators or a KD tool tag as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the one that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing (and comparing against) a whole new ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

AIRWeb-2008 Last Call for Papers & Extension Deadline

February 25, 2008

AIRWEB organizers have instructed me to disseminate the following Final Call For Papers and deadline extension. Let’s fight the spammers and those disguised as SEOs.

We have extended the submission deadline until March 2, 2008. We would appreciate any assistance in disseminating this extension. The text version of the final call for papers is below and the pdf version is attached.

Best regards.
Carlos Castillo
Kumar Chellapilla
Dennis Fetterly

FINAL CALL FOR PAPERS and 9 day extension
Fourth International Workshop on
Adversarial Information Retrieval on the Web
http://airweb.cse.lehigh.edu/2008/

IMPORTANT DATES

02/March/2008 : Deadline for research articles
31/March/2008 : Deadline for challenge submissions
22/April/2008 : Workshop at the WWW 2008 conference in Beijing, China

Contents:

1. AIRWeb’08 Topics
2. Web Spam Challenge
3. Timeline
4. Organizers and Program Committee

1. AIRWEB’08 TOPICS

Adversarial Information Retrieval addresses tasks such as gathering,
indexing, filtering, retrieving and ranking information from
collections wherein a subset has been manipulated maliciously. On the
Web, the predominant form of such manipulation is “search engine
spamming” or spamdexing, i.e., malicious attempts to influence the
outcome of ranking algorithms, aimed at getting an undeserved high
ranking for some items in the collection.

We solicit both full and short papers on any aspect of adversarial
information retrieval on the Web. Particular areas of interest
include, but are not limited to:

* Link spam
* Content spam
* Cloaking
* Comment spam
* Spam-oriented blogging
* Click fraud detection
* Reverse engineering of ranking algorithms
* Web content filtering
* Advertisement blocking
* Stealth crawling
* Malicious tagging
* Ping spam

Proceedings of the workshop will be included in the ACM Digital
Library. Full papers are limited to 8 pages; work-in progress will be
permitted 4 pages. Papers should be formatted using the WWW2008
proceedings style and submitted via
http://www.easychair.org/conferences/?conf=airweb2008

For more information, see http://airweb.cse.lehigh.edu/2008/

2. WEB SPAM CHALLENGE

Last year we introduced a novel element at the workshop: a Web Spam
Challenge for testing web spam detection systems. We will be holding
the Web Spam Challenge again this year, using the WEBSPAM-UK2007
collection for Web Spam Detection http://www.yr-bcn.es/webspam

The collection includes large set of web pages, a web graph, and
human-provided labels for a set of hosts. We will also provide a set
of features extracted from the contents and links in the collection,
which may be used by the participant teams in addition to any
automatic technique they choose to use.

We ask that participants of the Web Spam Challenge submit predictions
(normal/spam) for all unlabeled hosts in the collection. Predictions
will be evaluated and results will be announced at the AIRWeb 2008
workshop.

For more information, see

3. TIMELINE

- 15 February 2008: E-mail intention to submit a workshop paper
  (optional, but helpful)
- 02 March 2008: Deadline for workshop paper submissions (all day
- 24 March 2008: Notification of acceptance of workshop papers
- 31 March 2008: Challenge submissions due
- 07 April 2008: Camera-ready copy due
- 22 April 2008: Date of workshop

4. ORGANIZERS AND PROGRAM COMMITTEE

Organizers

- Carlos Castillo, Yahoo! Research
- Kumar Chellapilla, Microsoft Live Labs
- Dennis Fetterly, Microsoft Research

Program Committee

- Einat Amitay, IBM
- Andras Benczar, Hungarian Academy of Sciences
- Paul-Alexandru Chiri, Uni Hannover
- James Caverlee, Texas A&M University
- Gordon Cormack, University of Waterloo
- Nick Craswell, Microsoft Research
- Matt Cutts, Google
- Brian Davison, Lehigh University
- Ludovic Denoyer, University Paris 6
- Aaron D’Souza, Google
- Edel Garcia, Mi Islita.com
- Natalie Glance, Nielsen BuzzMetrics
- Antonio Gulli, Ask.com
- Zoltan Gyongyi, Stanford University
- Monika Henzinger, Google
- Pranam Kolari, Yahoo! Applied Research
- Mark Manasse, Microsoft Research
- Marc Najork, Microsoft Research
- Alexandros Ntoulas, Microsoft Search Labs
- Jan Pedersen, Yahoo! Search
- Erik Selberg, Amazon
- Torsten Suel, Polytechnic University
- Mike Thelwall, University of Wolverhampton
- Baoning Wu, Snap
- Tao Yang, Ask.com

Search Engines for Penetration Testing

February 21, 2008

Well, I’m getting ready for my talk this afternoon at University of Turabo. I’ve organized the talk in three parts:

 Part 1: Spam and Fraud through Search Engines

Part 2: Gathering Intelligence through Search Engines

Part 3: Identity Theft through Search Engines

A disclaimer will be necessary to indicate that the information to be presented is for educational purposes, only.

This gonna be a nice one. I hope to see old friends.

Web Mining, Search Engines, and Information Security

February 15, 2008

This thursday the 21st I’ll be presenting before the faculty of University of Turabo, Gurabo, PR the talk:

Web Mining, Search Engines, and Information Security

I hope to see old friends there. Here is the abstract of my talk:

Web Mining is a research area of Data Mining wherein the Web is the “database” and search engines are the “user’s interface”. End-users can resource to search engines for all sorts of things. For instance, marketers can use search engines to gain traffic derived from ranking high Web pages for specific queries, hence enhancing the online presence of businesses, products, and services (search engine optimization, SEO). Spammers can inundate search engine indexes to deceive searchers (spamdexing). Hackers can attempt to rank high documents that lead to security risks (hacketers, hacketering) or use all form of injections (links, forms, scripts, redirections, etc). Terrorists and criminals can use search engines to commit all sort of crime-enabling activities, for instance, by stealing private information like SSNs, passwords, students and users’s IDs, gaining access to “private” documentation, stalking people, etc.

This talk covers these and other aspects of search engines: the Good, the Bad, and the Ugly. The speaker will then talk about his own research projects in the area of Web Mining, Search Engines, and Intelligence. A disclaimer will be necessary to indicate that the information to be presented is for educational purposes only.

Keyword Density, SEOs, and the Deception War

February 7, 2008

I’m happy that at this Sphinnessed post: http://www.searchenginepeople.com/blog/how-search-really-works-the-keyword-density-myth.html , several SEOs are finally waking up and getting the Keyword Density Myth.

Great to see that they are realizing what IR grad students already know (http://irthoughts.wordpress.com/2008/01/25/the-power-of-document-linearization/ ):

That KD is not even a cost-effective tool for detecting spam, as search engines can use local term weight models that in addition to scoring terms can attenuate word frequencies. More on this here http://irthoughts.wordpress.com/2007/05/09/keyword-density-the-devils-advocate/  and here http://irthoughts.wordpress.com/2007/05/07/keyword-density-kd-revisiting-an-seo-myth/   

Such models effectively minimize spurious effects/advantages derived from keyword repetition. Some of these are LOGA, LOGN, ATF1, and SQRT. One could even use the idea of global ENTROPY and propose a local ENTROPY model to neutralize any attempt to misrepeating terms. All these have been discussed in my Web Mining course.

Based on the aforementioned links, it is clear that the Search Engine War never ends, especially when in addition to their spam tactics, marketers are proposing soooo many theories made out of thin air. What is worse is that from time to time they induce their peers and cheerleaders to buy into their Latest SEO Incoherences (”LSI”).

The recent round of nonsense surprisingly comes from alleged “SEO experts”. From claims about sculping PageRank (http://sphinn.com/story/26410 ) to the usual LSI non-sense (http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) to Keyword Density (http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/ ), the Deception War never ends.

I’m glad at AIRWEB we fight these folks (http://airweb.cse.lehigh.edu/2008/cfp.html ). Irronically, what Google’s Cutts calls “proper user” of nofollow link attributes and “sculping” is just another way of saying end user’s human manipulation attempts. Great pr exercises.

Triangular Link Swapping

January 10, 2008

Over the past few days we received emails requesting what the senders call “triangular link swapping”. Here is the most recent template-type request (I’m removing the links):

Dear Info,

I am writing on behalf of __________________.

We are looking for triangular link swapping with some good qualitysites as  yours. You must already aware that triangular link swapping is much more popular and beneficial than a reciprocal link exchange. This way both of us will be benefited. I would request you to place my link at your site and in return I will have a link / exclusive page created for your site on our Directory.

Here are details of my site :

URL: __________________

An EXACT search in Google for “triangular link swapping” returns results with almost the same wording.

This is nothing new, but the building blocks of link farms and link islands. Good try.

Local Newspapers Spamming Search Engines

December 23, 2007

Back in the ’90s spammers used to stuff keywords in web page meta tags, aimed at inducing search engines to believe that these were relevant. Then search engines found a way to neutralize this type of keyword stuffing/spam technique.

I was wondering which of our local newspapers are sticking to the idea of stuffing keywords in meta tags. So, I conducted a little exercise.

For this report, I used an experimental crawler I’m working on. The crawler was instructed to capture meta tags, only.

Here is a comparative as of 12-23-07. The first and the last newspapers redirect users, from their index page to a secondary page. It is clear from these results that local web designers in Puerto Rico are sticking to old fashioned spamming techniques like keyword stuffing, which BTW no longer work with search engines.

Newspaper: El Nuevo Dia ( http://www.elnuevodia.com/noticias )

Meta Report

4 results found.

1. meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″
2. meta content=”600″ http-equiv=”refresh”
3. meta name=”Description”
4. meta name=”Keywords” content=”ENDI El Nuevo Dia, periodico, Puerto, Rico, internet, noticias, boricua, Clima, Horoscopo, Coqui, Sapo,Concho,Telefonica,Coqui net,RonNueva York ,Horoscopo,Coqui,Sapo Concho,Telefonica,Coqui net,Ron,Bacardi,Estatus,Sila Maria Calderon,Menudo,Carlos Romero Barcelo,Pedro Rossello,Anibal Acevedo Vila,Daddy Yankee,Tego lderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Marcony,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne,Educacion,Cine,Entretenimiento,Ejercicio,Bienestar,Recetas,

Recetarios,Musica,Boricuas,Carlos Arroyo, Islanders,TUTV,Motoristas,Internautas ,Medicos,Historiadores,Venezuela,Santo Domingo,Republica Dominicana,Cuba ,Queens,Manhattan,Bronx,Nueva York ,Espiritualidad,Maestros,Televicentro,Telemundo,Univision,Bellas Artes,Telenovela,Comunidades,Isla,Culebrita,St. Thomas,Isla Nena,Culebra,Isla Mona,Vieques,Filiberto Ojeda,Capitolio,Turismo,Periodico,Periodismo,Veteranos,Marina,Soldados,

Senado,Energia Electrica,Plena,Bomba,Reggaeton,Wisin y Yandel,Casinos,Salsa,Construccion,Vacaciones,Jazz,Tito Trinidad,Oscar de la Hoya,Millie Corretjer,Roberto Clemente,San Juan,Miguel Cotto,Becas,Tercera Edad,Museos,Clasificados,Clasificados online,Clasificados en linea,Boletines,Servicios de noticias,Luis A. Ferre,Ferre Rangel,Hepatitis,Dengue,Diabetes,Obituarios,Obesidad,Librerias,Libros,Viejo San Juan,El Morro,Ballaja,Museo de Arte de Ponce,Ponce,Mayaguez,Parque Indigena,Observatorio de Arecibo,Tainos,El Yunque,El tunel Guajataca,Zoologico,RUM,UPR,Sagrado Corazon,Interamericana,Universidades,Colegios,Escuelas,Fajardo,Bahia,Luquillo,

Bioluminiscente,La Parguera,Carlos Delgado,Igor Gonzalez,Olga Tanon,Piculin Ortiz,Primera Hora,Zonai,Virtual,Beisbol,Mundial ,Raul Papaleo,Sondeos,Justas,Pavas,Palmas,Munoz Marin,Lyann Puig,Elaine Lopez,Javier Lopez,Remi,Poesia ,Olga Nolla,Rosario Ferrer,Mayra Montero”

Newspaper: El Vocero ( http://www.vocero.com )

Meta Report

3 results found.

1. meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″
2. meta name=”Description” content=”VOCERO”
3. meta name=”Keywords” content=”El Vocero, periodico, Puerto, Rico, internet, noticias, locales, Negocios, Deportes, Escenario, Extra, Triunfo, Habitat, Suplementos, Editorial, Comentarios, Juegos, Esquelas, Suscripcion, Galeria, Titulares, Cucubano, Humor, Cheo, Tiempo, Horoscopo, Coqui, Sila Calderon, Menudo,Daddy Yankee,Tego Calderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne, Educacion, Cine, Entretenimiento, Telenovela, Comunidades, Isla, Culebrita, St. Thomas, Isla Nena, Culebra, Isla Mona, Vieques, Filiberto Ojeda, Capitolio, Turismo, Periodico, Periodismo, Soldados, Senado, Energia Electrica, Plena, Bomba, Reggaeton, Casinos, Salsa, Construccion, Vacaciones, Jazz, Tito Trinidad, Oscar de la Hoya, Millie Corretjer, Roberto Clemente, San Juan, Miguel Cotto, Becas, Museos, Clasificados, Clasificados online, Clasificados en linea, Boletines, Servicios de noticias, Luis A. Ferre, Hepatitis, Dengue, Diabetes, Viejo San Juan, El Morro, Ballaja, Museo de Arte de Ponce, Roca, Ponce, Mayaguez, Parque Indigena, Observatorio, Arecibo, Tainos,El Yunque, Guajataca, Zoologico, RUM, UPR, Sagrado Corazon, Interamericana, Universidades, Colegios, Escuelas, Fajardo, Bahia, Luquillo, Bioluminiscente, La Parguera, Carlos Delgado, Igor Gonzalez, Olga Tanon, Piculin Ortiz, Beisbol, Mundial ,Raul Papaleo, Sondeo, Justas, Pavas, Palmas, Muñoz Marin, Coquito, Coco”

Newspaper: PrimeraHora ( http://www.primerahora.com/home )

Meta Report

3 results found.

1. meta http-equiv=”Content-Type” content=”text/html; charset=iso-8859-1″
2. meta name=”Description” content=”Primera Hora”
3. meta name=”Keywords” content=”Periódicos, Periódico Primera Hora, Puerto Rico, Isla, Caribe, San Juan, Viejo San Juan, Vieques, Regueton, Reggaeton, Daddy Yankee, Wisin y Yandel, Don Omar, Ricky Martin, Chayanne, Calle 13, JLo, Jennifer Lopez, Lucha Libre, Sexo, Boricua, Borinquen, Boriken, Noelia, Salsa, El Gran Combo, Playa, Isla Nena, Política, Gobernador, Estado Libre Asociado, ELA, Bombón, Bikinis, Modelos, Béisbol, Baloncesto, Volibol, Hipismo, Boxeo, Tito Trinidad, Miguel Cotto, Iván Calderón, De la Hoya, Héctor Lavoe, Marc Anthony, Cortometrajes, Web Concerts, Conciertos, Videos musicales, Video juegos, RBD, Miss Universe, Construcción, Casas, Apartamentos, Autos, Noticias, Asesinatos, Violencia, Shakira, Mujer, Salud, Niños, Mascotas, Restaurantes, Cine, Moda, Horóscopo, Paris Hilton, Britney Spears, Blogs, Tecnología, E Cards, Coqui, Clima, Recetas, Chats, Cielito Rosado, Rukmini, Juanes, El Nuevo Día, Grupo Ferre Rangel”

From Keyword Density to William Tutte’s Legacy

December 20, 2007

From Keyword Density to Keyword Distribution

Finally we have the Christmas Break from graduate school.

In my last Web Mining Course lecture before the Christmas Break, I tried to explain to students the importance of incorporating word spacing in information retrieval algorithms and in document relevance assessments. I explained why ideas like SEOs’s keyword density (KD), the traditional local term weight model known as FREQ (Term Count) and used in early papers on Vector Space and LSI models, and the likes are poor estimators of document relevance.

Among other theoretical reasons, it was discussed that a term mentioned X times not necessarily is X times more important than other terms. In addition, KD and the term count model cannot attenuate frequencies. We then discussed several frequency attenuation models (keyword spam filters) that also work as term weight scoring models. These can dampen down the effect of abnormal repetition of terms, raise a spam flag, and do not require of any reference to KD “tales”.

We also discussed several scenarios in which one could use word distributions and co-occurrence to analyze textual information –far better than with the aforementioned “crapstimators”. For instance, word spacing can be used in encryption/steganographic algorithms to uncover hidden messages, profiling writing styles/people, imputate authorship of text, assess plagiarism, fraud, etc.

I’m happy that not all SEOs are buying into the keyword density of non-sense and similar “crapstimates”, as I can see from these SEOmoz posts.

From Keyword Distribution to William Tutte’s Legacy

This morning I came across a nice biography of one of those venerable giants: the late William Tutte. Beautifully written by Dan Younger, the biography is a tribute to Tutte’s greatness. Interesting to point out in relation to word spacing theory is this portion of Young’s writing (emphasis added):

“Tutte’s great contribution was to uncover, from samples of the messages alone, the structure of the machines which generated these codes. This came about as follows. In August 1941, a German operator sent a Fish-enciphered teleprinter message of some 4000 letters from Athens to Berlin. For some reason, the message was not received properly and so it was resent. Against all guidelines, it was sent with the same setting. It was identical in content, but it differed slightly, in word spacing and punctuation. John Tiltman of Bletchley was able to use this blunder to find both the message and the obscuring string that was added to make up the enciphered message. But that seemed to be all that could be found, when Tutte was presented with the case in October.”

“Tutte began by observing the machine generated obscuring string carefully. Splitting it up into various lengths, he noticed signs of periodicity. For the first of the five teleprinter tape positions, the regularity he supposed arose from a wheel of 41 sprockets. And then at the last position, one of 23 sprockets. Over the next months, Tutte and colleagues worked out the complete internal structure, that it had twelve wheels, two for each of the five teleprinter positions, and two with an executive function. They determined the number of sprockets on each wheel, and how the advancement of the wheels was interrelated. They had completely recreated the machine without ever having seen one. Tony Sale, who first described this work in a 1997 article in New Scientist, characterized it as the “greatest intellectual feat of the whole war.”

“Knowing the structure of the enciphering machine is a necessity for code-breaking, but it is only the first step. Tutte then put himself to creating an algorithm to find from the enciphered messages the initial settings of the machine wheels. The algorithm that he created, the “Statistical Method”, looked for certain types of resonances, but it had to consider far too many possibilities to be carried out by hand. So it was that, in 1943, the electronic computer COLOSSUS was designed and built by the British Post Office. It was to run the algorithms that Tutte; and his collaborators Max Newman and Ralph Tester; developed, that COLOSSUS was created. This man-machine combination was used to break Fish codes on a regular basis throughout the remainder of the War”.

I hope you understand now the title of this post.

 In today’s Web the enciphering machines are search engines, but the underlying principles driving the Search Engines War are the same.

Emphasized words should make sense to students of the Web Mining course.

IRSeek, Polymorphic JavaScript, and Hacketers

December 6, 2007

According to a DarkReading report IRSeek is a start-up designed to target hackers and their IRC anonymous chat activities. Hacking the hackers?

The report states:

“Hackers favor IRC because it allows them to protect their identities and cover their tracks. But a new search engine startup called IRSeek is now calling those features into question…”

“This could all be bad news for hackers, who don’t want their conversations indexed or searchable by nickname. While they could partially beat the system by simply changing their nicknames frequently, hackers may eventually feel that IRSeek threatens their anonymity, and ultimately, their privacy.”

Here is more on the topic.

Well, this can be fun to watch/test for those that conduct Web Mining for security purposes.

Meanwhile, according to a CNN report Search Engine-based hacking attacks are on the rise and becoming a preferred targeting method. This includes link-based spam, polymorphic JavaScript scripts also referred to as “Polyscripts”, and or combined with dark marketing practices. Here is a Top 10 List to watch.

1. Phishing
2. Malicious link injections through forums, blogs to rank high in search engines.
3. Attackers use Web’s ‘weakest links’ to launch attacks.
4. Compromised Web sites will surpass number of created malicious sites.
5. Cross-platform Web attacks .
6. Web 2.0-based attacks.
7. Polymorphic JavaScripts, designed to evade anti-virus scanners.
8. Data concealment methods.
9. Key hacker groups.
10.Vishing and voice spam.

Hackers + Spammers + Crook marketers/SEOs = What A Killer Combination. Compromised sites ranking high means trapping more users in the mess. I wonder how many of the folks from the seophere are involved and making few bucks. The usual suspects?

Perhaps not all are real SEOs, but as we say in Spanish: “Ante la duda, saluda.”

Here is a nice one: Hacking Duke University to rank high via link injection

And some how related, how about cracking passwords with Google?

Welcome to an-on-the-rise new breed:

Hacketers = Hackers + Marketers

PS. I coined the name after noticing with the Levenshtein Edit Distance Calculator that it only requires of two edits between hacketers and marketers.

http://www.miislita.com/searchito/levenshtein-edit-distance.html

Heh, Heh. Apparently “peer” pressure forced IRSeek to shutdown. Nevertheless, it is still a great concept: I wonder how many of these mole  projects are in place all over the Web. Check the whole deadpool story here:

http://www.techcrunch.com/2007/12/03/fastest-deadpool-ever-irseek-shuts-down/#comment-1813205

http://www.irseek.com/blog/