Archive for the ‘SEO Myths’ Category

Vector Space Models and Search Engines

April 21, 2008

This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/

That said, today’s post is in reaction to the article at
http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.

Blockquoted passages and my comments are next.

Blockquoted Passage

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (”the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Comments

At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?

Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as

IDF = log(1/p) = log(N/n)

Logs at a given base (often 2 or 10) are used.

Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.

Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.

The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.

For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.

IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.

For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 - p)) instead of just p. Inverting and taking logs,

IDFP = ((N - n)/n)

If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.

Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html  

There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.

Blockquoted Passage

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Comments

Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.

This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?

The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.

L = f/fmax
L = 1 + log(f)
L = log(1 + f)

etc, etc, etc.

Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.

There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:

W = L*G*N

A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.

As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.

Blockquoted Passage

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

Comments

In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.

However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.

Blockquoted Passage (from one commenter)

It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

Comments

Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.

There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.

The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).

From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.

For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.

For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.

To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.

Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.

Note

Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:

http://www.webpronews.com/topnews/2001/09/05/google-interview-by-fredrick-marckini

One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html  which incidentally mentions tf and IDF.

Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.

Blockquoted Passage

Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

Comments
Why agree to SEO hearsay? See previous comments

Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.

Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/  or just do a search for “phrase idf”.

Conclusion

To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.

Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.

How Search Engines Do Not Work

April 17, 2008

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

Demystifying LSI Video

April 7, 2008

Here is a video of my presentation, Demystifying LSI, at the OJOBuscador Congress 2.0, Madrid, Spain, 2007. One year later, nothing has changed. Many of the same crook SEOs exposed during the congress are still deceiving the public about what is LSI.

Unfortunately, the quality of the video and lights are not good enough to see the pdf slides, plus the presentation is in Spanish. Since attendees were not scientists, I talked very slow for over an hour.

Want to get bored for the next hour? View the video.

Thanks to N. Valenzuela Alonso, Director of SEO and Search Engine Marketing of Media Bit, S.L. for the link (www.ithinksearch.com/2008/03/31/video-lsi-de-edel-garcia-desmitificando-lsi/).

Here is also the presentation of Carlos Castillo (Chato), from Yahoo! Research Spain:

Adversarial IR with Web Spam, parts 1 and 2 
(http://www.ojobuscador.com/2007/06/14/ir-con-adversario-y-webspam-videopost/).

I spent great time talking with Carlos, a former grad student of Ricardo Baeza-Yates.

Baeza-Yates, Andrei Broder, Gerald Salton, and Keith van Rijsbergen and few others have helped to shape what is today known as Information Retrieval Research

Talking about Andrei Broder (one of the main researchers behind the old mighty Altavista), here is also a great interview, thanks to ojobuscador site: 
http://www.ojobuscador.com/2006/05/20/entrevista-a-andrei-broder/

 

How Many SEO Myths In One Sentence?

March 12, 2008

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than to what web page creators or a KD tool tag as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the one that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing (and comparing against) a whole new ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

Keyword Density, SEOs, and the Deception War

February 7, 2008

I’m happy that at this Sphinnessed post: http://www.searchenginepeople.com/blog/how-search-really-works-the-keyword-density-myth.html , several SEOs are finally waking up and getting the Keyword Density Myth.

Great to see that they are realizing what IR grad students already know (http://irthoughts.wordpress.com/2008/01/25/the-power-of-document-linearization/ ):

That KD is not even a cost-effective tool for detecting spam, as search engines can use local term weight models that in addition to scoring terms can attenuate word frequencies. More on this here http://irthoughts.wordpress.com/2007/05/09/keyword-density-the-devils-advocate/  and here http://irthoughts.wordpress.com/2007/05/07/keyword-density-kd-revisiting-an-seo-myth/   

Such models effectively minimize spurious effects/advantages derived from keyword repetition. Some of these are LOGA, LOGN, ATF1, and SQRT. One could even use the idea of global ENTROPY and propose a local ENTROPY model to neutralize any attempt to misrepeating terms. All these have been discussed in my Web Mining course.

Based on the aforementioned links, it is clear that the Search Engine War never ends, especially when in addition to their spam tactics, marketers are proposing soooo many theories made out of thin air. What is worse is that from time to time they induce their peers and cheerleaders to buy into their Latest SEO Incoherences (”LSI”).

The recent round of nonsense surprisingly comes from alleged “SEO experts”. From claims about sculping PageRank (http://sphinn.com/story/26410 ) to the usual LSI non-sense (http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) to Keyword Density (http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/ ), the Deception War never ends.

I’m glad at AIRWEB we fight these folks (http://airweb.cse.lehigh.edu/2008/cfp.html ). Irronically, what Google’s Cutts calls “proper user” of nofollow link attributes and “sculping” is just another way of saying end user’s human manipulation attempts. Great pr exercises.

The Power of Document Linearization

January 25, 2008

In http://www.miislita.com/fractals/keyword-density-optimization.html  I explained to the SEO community the concept of document linearization as part of document GAP analysis. Marketers learned what IR graduate students already know: that document linearization (i.e., markup removal) is just one component of document indexing.

Keyword distribution, word distances, phrase matching, etc. are obtained from the text stream that results from linearization, not from the apparent position of text that is rendered by a browser and visually inspected by average end users. Document linearization debunks the common SEO Keyword Density Myth. One thing is the apparent distribution of words as perceived when end users visually scan a document and another thing is the actual word distribution as parsed by a search engine. The futility of computing KD values is quite obvious.

Here is a report of another recent SEO that discovered the power of document linearization:

http://seo-gw.blogspot.com/2008/01/fractal-semantics-linearization.html

The testimonial is worth to read.

The post http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/  is also relevant these days.

Search for posts on keyword density: http://irthoughts.wordpress.com/?s=keyword+density

Microsoft’s Black Cloud on Yahoo! & SEO Tag Clouds

January 23, 2008

From time to time rumors spread of the black cloud of Microsoft over Yahoo!; i.e., of Microsoft buying Yahoo!. This time things are less cloudy, especially now that Yahoo! is about to cut jobs.

Early this year, Jeremy Zawodny from Yahoo!, wrote:

“Sure, there would be cultural problems, integration challenges, and many people who’d likely walk. But at the end of the day, Microsoft would end up with a much larger set of online services, a better advertising network, and people who know how to build, brand, and market web stuff that people actually use.”

Talking about clouds:

A student asked me about some SEOs claiming that text tag clouds are a kind of LSI technology.

Pure non sense coming from many SEOs, as usual.

These clouds are easy to construct. No LSI is needed:

1. Sort terms from a document or lookup list by frequencies.
2. Normalize frequencies to run between the 0,1 interval.
3. Use normalized frequencies as parameters to be passed as font sizes.

For pizzaz, store terms into array to be sorted or randomized and or use some CSS.

We can do the same with hit counts assigned to blog categories, links, etc. No special technology is needed.

Finding Topic-Specific Posts

January 18, 2008

Web Mining Week 7

January 14, 2008

Week 7 Agenda

Review of Association and Scalar Clusters
Review of Vector Space Models
LSI & SVD: Demystifying LSI SEO Myths (OJOBuscador Congress, Madrid; PDF Presentation)
LSI & Keyword Research (PDF Presentation)
SVD Noise Filtering: Principal Component Analysis (PCA)

Required Reading Material

Tutorial Series
This is part one of a five-part tutorial series:
http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

Fast Tracks
These are quick tutorials, with to-the-point calculations:
http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Blog Posts
These are IR blog posts designed to fight back against misinformation promoted by unethical SEOs and Spammers:
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/page/1/?s=lsi
http://irthoughts.wordpress.com/page/2/?s=lsi

Blog Category
This is a blog category pointing to a collage of posts that demystify SEO non sense about LSI. Some are about topics that overlap with LSI:
http://irthoughts.wordpress.com/category/latent-semantic-indexing/   

Kendall’s Test and PageRank

December 26, 2007

For years marketers have put PageRank-based models into question, via “sandbox” theories and similar tales.

This old paper might help to address the issue on the biased/unbiased nature of these models: Paradoxical Effects in PageRank Incremental Computations.

What interest me the most about the paper is how the authors put into use Kendall’s t theory. Written by Paolo Boldi, Massimo Santini, and Sebastiano Vigna the abstract states:

“Abstract. Deciding which kind of visiting strategy accumulates high-quality pages morequickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectivenessfor crawl strategies: whether the graph obtained after a partial visit is in some senserepresentative of the underlying web graph as far as the computation of PageRank isconcerned. More precisely, we are interested in determining how rapidly the computationof PageRank over the visited subgraph yields node orders that agree with theones computed in the complete graph; orders are compared using Kendall’s t .We describe a number of large-scale experiments that show the following paradoxicaleffect: visits that gather PageRank more quickly (e.g., highest-quality first) arealso those that tend to miscalculate PageRank. Finally, we perform the same kind ofexperimental analysis on some synthetic random graphs, generated using well-knownweb-graph models: the results are almost opposite to those obtained on real webgraphs.”

The authors also state:

“The most classical visit strategies are the following:

Depth-first order: the crawler chooses the next page as the last that wasadded to the frontier; in other words, the visit proceeds in a LIFO fashion.
Random order: the crawler chooses randomly the next page from the frontier.
Breadth-first order: the crawler chooses the next page as the first that wasadded to the frontier; in other words, the visit proceeds in a FIFO fashion.
Omniscient order (or quality-first order): the crawler uses a queue prioritisedby PageRank values [Cho et al. 98]; in other words, it chooses to visitthe page with highest quality among the ones in the frontier. This visit ismeaningless unless a previous PageRank computation of the entire graphhas been performed before the visit, but it is useful for comparisons. A variant of this strategy may also be adopted if we have already performeda crawl and thus we have the (old) PageRank values of (at least some ofthe) pages.”

“Both common sense and experiments (see, in particular, [Boldi et al. 05])suggest that the visits listed above accumulate PageRank in an increasinglyquicker way. This is to be expected, as the omniscient visit will point immediatelyto pages of high quality. The fact that the breadth-first visit yields high-qualitypages was noted in [Najork and Wiener 01].”

“There is, however, a different and also quite relevant problem that has beenpreviously overlooked in the literature: if we assume that the crawler has noprevious knowledge of the web region it has to crawl, it is natural that it will try to detect page quality during the crawl itself, by computing PageRank onthe region it has just seen. We would like to know whether a crawler doing sowill obtain reasonable results or not.”

From Keyword Density to William Tutte’s Legacy

December 20, 2007

From Keyword Density to Keyword Distribution

Finally we have the Christmas Break from graduate school.

In my last Web Mining Course lecture before the Christmas Break, I tried to explain to students the importance of incorporating word spacing in information retrieval algorithms and in document relevance assessments. I explained why ideas like SEOs’s keyword density (KD), the traditional local term weight model known as FREQ (Term Count) and used in early papers on Vector Space and LSI models, and the likes are poor estimators of document relevance.

Among other theoretical reasons, it was discussed that a term mentioned X times not necessarily is X times more important than other terms. In addition, KD and the term count model cannot attenuate frequencies. We then discussed several frequency attenuation models (keyword spam filters) that also work as term weight scoring models. These can dampen down the effect of abnormal repetition of terms, raise a spam flag, and do not require of any reference to KD “tales”.

We also discussed several scenarios in which one could use word distributions and co-occurrence to analyze textual information –far better than with the aforementioned “crapstimators”. For instance, word spacing can be used in encryption/steganographic algorithms to uncover hidden messages, profiling writing styles/people, imputate authorship of text, assess plagiarism, fraud, etc.

I’m happy that not all SEOs are buying into the keyword density of non-sense and similar “crapstimates”, as I can see from these SEOmoz posts.

From Keyword Distribution to William Tutte’s Legacy

This morning I came across a nice biography of one of those venerable giants: the late William Tutte. Beautifully written by Dan Younger, the biography is a tribute to Tutte’s greatness. Interesting to point out in relation to word spacing theory is this portion of Young’s writing (emphasis added):

“Tutte’s great contribution was to uncover, from samples of the messages alone, the structure of the machines which generated these codes. This came about as follows. In August 1941, a German operator sent a Fish-enciphered teleprinter message of some 4000 letters from Athens to Berlin. For some reason, the message was not received properly and so it was resent. Against all guidelines, it was sent with the same setting. It was identical in content, but it differed slightly, in word spacing and punctuation. John Tiltman of Bletchley was able to use this blunder to find both the message and the obscuring string that was added to make up the enciphered message. But that seemed to be all that could be found, when Tutte was presented with the case in October.”

“Tutte began by observing the machine generated obscuring string carefully. Splitting it up into various lengths, he noticed signs of periodicity. For the first of the five teleprinter tape positions, the regularity he supposed arose from a wheel of 41 sprockets. And then at the last position, one of 23 sprockets. Over the next months, Tutte and colleagues worked out the complete internal structure, that it had twelve wheels, two for each of the five teleprinter positions, and two with an executive function. They determined the number of sprockets on each wheel, and how the advancement of the wheels was interrelated. They had completely recreated the machine without ever having seen one. Tony Sale, who first described this work in a 1997 article in New Scientist, characterized it as the “greatest intellectual feat of the whole war.”

“Knowing the structure of the enciphering machine is a necessity for code-breaking, but it is only the first step. Tutte then put himself to creating an algorithm to find from the enciphered messages the initial settings of the machine wheels. The algorithm that he created, the “Statistical Method”, looked for certain types of resonances, but it had to consider far too many possibilities to be carried out by hand. So it was that, in 1943, the electronic computer COLOSSUS was designed and built by the British Post Office. It was to run the algorithms that Tutte; and his collaborators Max Newman and Ralph Tester; developed, that COLOSSUS was created. This man-machine combination was used to break Fish codes on a regular basis throughout the remainder of the War”.

I hope you understand now the title of this post.

 In today’s Web the enciphering machines are search engines, but the underlying principles driving the Search Engines War are the same.

Emphasized words should make sense to students of the Web Mining course.

A Call to Expose SEO Liars

August 29, 2007

Since A Call to SEOs claiming to Sell LSI many are finally realizing they were taken/gamed by crook SEOs selling snake oil in the form of spurious LSI arguments. It is now time to issue a call to expose all these sinisters marketers that are giving a black eye to the search marketing industry. So, you are welcome to join great guys like Mike Duz, David Petar, and Mike Grehan and expose these people.

If you prefer, do like Dan Thies and blog about their myths. In Lies, Damn Lies, Thies has exposed another old SEO myth: keyword density. Here are additional reasons against this myth many marketers are still hanging around:

2007/05/09 Keyword Density Myth - The Devil’s Advocate

2007/05/07 Keyword Density (KD): Revisiting an SEO Myth

On the Evolution of SEO Myths

The evolution of KD myths and KD tools within SEO circles is anecdotal. It is quite similar to the evolution of LSI-based SEO myths promoted by almost the same marketers. There is a clear pattern of deception:

Repeat a hearsay many times, spin it, play with words, convince cheerleaders to repeat like parrots your hearsays and then repeat everything again until many cheerleaders, peers, and “experts” repeat your nonsense in blogs and seo books. Invent formulas out of thin air and tools that support these, etc, etc, etc.

If you prefer, misquote or copy/dump IR papers and patents in your blog to give the impression you know about information retrieval. Then, stretch these IR papers or patents to your heart needs or to whatever you are trying to sale or promote. That can be your own image or other crooks services.

Two wings of the same bird

That’s how the KD and LSI SEO myths have survived all these years. These are two wings of the same bird. Unfortunately the very same marketers go to fancy search marketing conferences, blogs, forums and few other channels to spread the same misinformation or to induce others into error. No wonder Mike Grehan has called these ‘hot air’.

Take for instance, those marketers that have preached about LSI for years or selling “LSI-like services” without even having a clue on how SVD actually works. They either do so to build an image as “experts” or to intentionally deceive their peers and clients, because of vested interests.

When caught with the pants off they often have two choices:

1. recanting.
2. recoiling.

The few raise the royal “we” and “honest” flag and then resource to throwing dirt rather than prove their case regarding their LSI claims. As far as I’m concern they can throw dirt or scream like babies all they want. They deserve their head to be hammered away any day of the week.

These are the very same folks that give a black eye to the damn search marketing industry, by deceiving the public and prospective clients while posing as honest business guys. No wonder so many IR folks perceive SEOs just as vulgar spammers.

As I always say to peer IRs and graduate students, not all SEOs are deceivers. Some are indeed ethical and quite honest. However, the bad apples are easy to spot.

More likely the more vocal “SEO experts” are the less they know about information retrieval and search engines. To be on the safe side, stay away from those that peer marketers call “SEO experts”. As we say in Spanish: ‘Ante la duda, saluda’.

Many of these have been exposed many times and in different places. Here are some references for your perusal:

SVD and LSI Tutorial 1: Understanding SVD and LSI

SEOs and their LSI Misconceptions

LSI Blog Posts and SEOs

When SEOs are caught in Lies

SEOs and Still Their LSI Misconceptions

July 19, 2007

I just came across this article

http://seo-and-google.blogspot.com/2007/07/5-tips-to-effective-seo-keyword.html

by Valerie DiCarlo and honestly don’t know from where these marketers learn all these misconceptions regarding LSI. Perhaps she has been misled as well by the usual suspects and is just trying to make some honest comments she truly believes. Unfortunately most of her’s are incorrect. I’m commenting her lines, one by one.

(more…)

Random Notes

July 10, 2007

I’m putting the final touches to this month issue of IRW, which is running late –reasons all subscribers know by now. It should be out tomorrow.

Amazing how many are still perpetuating so many misconceptions about “LSI tools”. Here is another example, forwarded to me by Melissa Fach, one of several SEOs that are discovering how many “LSI-based” SEO lies are out there thanks to the usual suspects:

http://courtneytuttle.com/2007/07/05/taking-seo-to-the-next-level-lsi/

(more…)

A Call to SEOs Claiming to Sell LSI

July 9, 2007

Mike Grehan finished his great ClickZ column of June 11, 2007 SEO Is Dead. Long Live, er, the Other SEO, as follows:

“I’ve run out of space again. I’ll come back to the stupidity of the latent semantic indexing issue in my next column.”

(more…)

Snake Preview of IR Watch

July 6, 2007

As mentioned the July issue of IR Watch is running late due to the backend changes we made last week to our main site (http://www.miislita.com). If you are a subscriber, IRW should arrive to your inbox in few days. This issue is dedicated to Market Basket Analysis and Keyword Research. Some portions are adaptations from Tan, Steinbach, and Kumar book “Introduction to Data Mining”.

(more…)

Is SEO Dead?

June 12, 2007

With the catchy title, SEO Is Dead. Long Live, er, the Other SEO, once again, my friend Mike Grehan has a great ClickZ column wherein he comments on Google and ASK new approaches to satisfy users’ information needs.

He ends the article as follows:

(more…)

Subsumptions vs Synonyms - Conceptual Indexing Revisited

June 11, 2007

Back in 1997, William Woods, Principal Scientist and Distinguished Engineer at Sun Microsystems Labs, wrote Conceptual Indexing: A Better Way to Organize Knowledge. Although the notion of conceptual indexing turned out to be a complex thing, his paper is still relevant these days wherein many SEOs make incorrect claims about how search engines use Latent Semantic Indexing (LSI) and wherein others are paying attention to synonymy and phrase processing patents. This post is based in part on Woods’s manuscript.

(more…)

LSI Blog Posts and SEOs

June 6, 2007

I’m still trying to understand why so many SEOs have LSI backward and why others insists in promoting or explaining something that is not LSI as LSI. Some even repeat previous fallacies they have heard across the Web or from contaminated pools of knowlege like Wikipedia.

To top off, I have emails from SEOs so mad about being misled into error by other SEO “experts” regarding claims about what is LSI or how it works.

(more…)

Zoom in this Theme: The LSI Myth

May 11, 2007

Few days ago, Michael Duz had a great post, The LSI Myth wherein he describes the nonsense promoted by snakeoil SEO marketers. He has a list of common taglines used by these people:

(more…)

Keyword Density Myth - The Devil’s Advocate

May 9, 2007

Those that promote keyword density (KW) myths are now claiming that search engines use KW as an inexpensive spam detection mechanism; i.e.,

if (KW = fij/lj > upper threshold value) {// raise the spam red flag }

(more…)

Keyword Density (KD): Revisiting an SEO Myth

May 7, 2007

Back in March of 2005 I wrote The Keyword Density of Non Sense article for Mike Grehan’s newsletter. An expanded and improved version was also published at Mi Islita.com. After these articles, many SEOs saw the light.

However, in an attempt at perpetuating KD myths, few SEOs tried to reformulate the alleged importance or usefulness of keyword density by presenting KD as a spam detection filter used by search engines. Good try, but this still is non sense and another SEO myth.

(more…)

SEOs Blogging LSI Non Sense

May 6, 2007

At this SEOMoz.org blog, posters are discussing about search engines semantic capabilities, including LSI.

I stopped by to clarify several things since many of these present their hearsay as valid statements.

(more…)

Two SEO Blogonomies

May 3, 2007

As I mentioned in a ClickZ column written by Mike Grehan, The Myths and Maths of SEOs, a blogonomy is the dissemination of false knowledge through electronic forums, especially through blogs. Today I want to commment on two LSI blogonomies promoted by several SEO firms.

(more…)

SEO Blogonomies: The Search Engine Markov Chain

May 3, 2007

Note: I added this post content to the  Stochastic Matrix tutorial.

The spreading of incorrect knowledge or at best innaccurate representation of concepts is prevalent in circles associated to search engine optimization (SEO). This is a social phenomenon more notorious in the blogosphere and through public forums (sites and discussion forums). Because of this, we call the phenomenon a bunches of “blogonomies”.

(more…)