Keyword Density Myth – The Devil’s Advocate

Those that promote keyword density (KW) myths are now claiming that search engines use KW as an inexpensive spam detection mechanism; i.e.,

if (KW = fij/lj > upper threshold value) {// raise the spam red flag }

Here terms are defined as follows:

fij = frequency of term i in doc j
lj = length of doc j

It is often argued that search engines compute lj as number of words, including stopwords. Others argue that lj is given without counting stopwords. Even some claim that search engines define lj as number of unique terms. Obsessed KW chasers claim that this ratio is computed for specific portions of a document and for the entire document.

The idea of computing KW for a given term i in a given doc j, they suggest, is to compare this ratio against an upper threshold value, tv. It is assumed that:

1. tv is the same for all terms and all docs present in a collection, so tv is unique.

2. if KW > tv this condition can be used to flag a doc as suffering from keyword spam in term i.

There are several problems with this reasoning: 

Note that this would require of two loops: one to count query terms in a given doc and another to count docs and their lengths.

In addition, the uniqueness of tv; i.e., that there is a single threshold value for all terms in a collection is given as a fact, regardless of the nature of terms.

KW fans have claimed all sort of threshold values (tv): 1%, …. 5%…, and so forth. Often they guess tv by simply looking at the top N ranked results from a given query. The implication is that somehow tv correlates with relevance judgments and ranking results.

Of course there are at least two problems with this reasoning. First, many factors determine the search engine rank assigned to a document upon a query. Second, a simple local ratio of words like KW tell us nothing about term relevancy in a document or about word usage, distribution of terms, co-occurrence between terms, term pertinence to a topic, and so forth.

Some might claim that using KW is less expensive than other mathy methods for detecting keyword repetition (keyword spam).

Of course, KW fans cannot prove any of their claims. Meanwhile, this works well for those promoting their own KW tools. Can you spell conflict of interests and self-promotion?

What do I think about all this?

This is my take:

From the operational standpoint, this way of flagging docs as spam is not cost effective neither, but actually unnecessary.

It is easier and inexpensive for a system to reuse already computed and in-memory values than reinvent the wheel. Right?

Consider this.

The Devil’s Advocate

Search systems already compute local weights (Lij), wherein Lij is the weight of term i in doc j. Such weights are computed for many reasons. One is to compute overall weights aij, wherein

aij = Lij Gi Nj.

Here, Gi is the weight of term i in the collection and N is the normalization weight of a given document j. Lij, Gi, and Nj are not universal –search engines defines these in many different ways.

In the post Keyword Density (KD): Revisiting an SEO Myth I explained two definitions of local weights, Lij. Indeed, dozen of definitions for Lij, Gi, and Nj have been published.

The point is that Lij is already computed. It would be easier and inexpensive to assign a threshold value, L*, and just evaluate the conditional

if (Lij > L*) do this or that…

than invoking the above two loops to compute a whole new ratio (KW) for every single term and every single document of a collection.

Such “this or that” could be as simply as resetting Lij to another value X.

X can be set to a value from 0 to Y, with Y < = L*. Here X = 0 acts as a full penalty; i.e. no weight is assigned to the term. [Some search engines are so politically correct (or hypocrit) that don’t like any reference to “penalties”. Works for me. A monkey is a monkey no matter what, anyway.] 

X can be set to be unique to the entire collection or variable –true that X can be set in a variable fashion, but this adds another operational layer. Still, using a variable upper threshold allows for topic importance discrimination and distilling.

Wait a second, one might ask: why not compute a KW value at this point and as part of the routine?

Well, in principle, this could be done if search engines defines local weights as mere Lij = fij.

But, there is a problem….

1. Dozen of local term weight scales have been already published.
2. Lij = fij is just a primitive scheme with many drawbacks.
3. More likely, no one outside a search engine knows for sure which of these local weight scales are used (of course, unless the information is not classified or an insider has breached his/her non-disclosure employment agreement).

Chances are that search engines use their own unpublished scales.

I can understand why: It is in the best interest of search engines to keep outsiders in obscurity –and confused. Why disclose how they actually detect spam? Why tip off spammers?

I understand why they might prefer to send someone to compound the confusion among SEOs/spammers.

Whatever you read in a research paper or patent or heard from insiders and even outsiders, why assume that the piece you have read or heard defaults to a current implementation in a high production environment already plagued with scalability issues?

At least, knowing about all these local weight scales might be of value to SEO copy or to those that try to construct automatic copy evaluators and editors. This is one reason for reading the upcoming article (not a blog post):

Understanding Local Weights

If you are a copy strategist, human evaluator, or programmer, you might find something useful in it.