Keyword Density (KD): Revisiting an SEO Myth

Back in March of 2005 I wrote The Keyword Density of Non Sense article for Mike Grehan’s newsletter. An expanded and improved version was also published at Mi Islita.com. After these articles, many SEOs saw the light.

However, in an attempt at perpetuating KD myths, few SEOs tried to reformulate the alleged importance or usefulness of keyword density by presenting KD as a spam detection filter used by search engines. Good try, but this still is non sense and another SEO myth.

Ask these folks about any proof of these claims to see if they can provide one. If you read between lines what they are trying to do is to keep their KD tools relevant and alive. They just want to insists in dumb ideas and theories they nurtured around keyword density. A “saving face” effort can be recognized by the many twists involved.

This simply reinforces my notion that these folks either don’t get it or don’t have a background on IR.

I’m working on a paper on local term weigths, called Understanding Local Weights. It is clear from the several models examined that some scales can be used as math red flags for detecting spam. Here is a sneak preview:

Let Lij be the local weight of term i in doc j. We can defined Lij in many different ways and as a function of fij, wherein fij is the ocurrence (frequency) of term i in doc j.

Two candidate scales to think about are FREQ and LOGA.

FREQ Scale

If term i is in doc j, then Lij = fij; else, Lij = 0.

This is the most common model SEOs know about, wherein local weights are given as raw frequencies. KD lovers simply divide an fij value by total # of words and call that KD.

This primitive way of defining local weights (as Lij = fij) is used in the old IR literature and in few simplistic models. For instance, early LSI papers used local weight-only scores. These are the same papers often misquoted by SEOs. The fact is that we normally use Lij=fij to teach students basic things. Then, they can move on and learn about other local weighting schemes and improved LSI models.

FREQ has several drawbacks and limitations. For example, it assumes that a term repeated x times weighs x times more, which we know is not necessarily true.

So, if doc1 repeats the term “crap” once and doc2 repeats this term ten times then

L(crap, doc1) = f = 1

L(crap, doc2) = f = 10

which assumes that crap is ten times more important in doc 2 than in doc 1 or that doc2 is ten times more pertinent to crap than doc1.

LOGA Scale

If term i is in doc j, then Lij = 1 + log fij; else, Lij = 0.

This model is a bit better than FREQ. It is a logarithm augmented scale at a given base. A base 2 or base 10 scale is often used. Other base values are possible.

If I use base 10 logarithms in the example, then

L(crap, doc1) = 1 + log(1) = 1 + 0 = 1

L(crap, doc2) = 1 + log(10) = 1 + 1 = 2

The term weight just doubles. Term repetitions don’t increment drastically local weights. Now if I want to make a third document, doc3, three times more heavier in “crap” than in doc1,

L(crap, doc3) = 1 + log(100) = 1 + 2 = 3

So, I would need to repeat the term 100 times. That’s a lot of crap!

More likely, valid keywords repeated 100, 1000, etc in a single web page for sure will raise a spam red flag. Thus LOGA can work as both a weighting scheme and an obvious spam detector.

No doubt that both FREQ and LOGA have some drawbacks and that there are better local scoring models (log normalized, squared, local entropy, etc). All these models are designed to address specific extreme cases. Still KD is not nearly around in any of these models.

How do I know if Google or Yahoo is using LOGA or other published local weight models? I don’t know so I am not going to speculate.

I am sure about one thing: How natural would be repeating a term 100 or 1000 times in a single web page? Exactly.

So, if I use LOGA at base 10 in my own experimental search engine I would be inclined to tag as spam any doc with Lij around 3. I would probably use a maximum upper bound around Lij = 2 or less.

There are better local term weight scales. Wait for my article and learn why.

On and on, next time you hear or read about the virtues of KD, get a gas mask or think about… toilet paper.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Keyword Density (KD): Revisiting an SEO Myth

Published by egarcia

3 Comments

Leave a comment Cancel reply

Share this:

Related

Published by egarcia

3 Comments

Leave a comment Cancel reply