Davidson made a fortune by providing marketing services for companies by sending huge amounts of email spam. The news reads that “the spamming was designed to promote the visibility and sale of products offered by various companies. Davidson utilized the services and assistance of other individuals who he hired as “sub-contractors” to provide spamming at his direction on behalf of his client companies, the DOJ said”
I wonder who these companies and “sub-contractors” really are and when they are going to be held accountable. I also wonder if similar crooks are still behind the scene doing the usual activities (email spam, click through fraud, spamdexing, misrepresentation of products and services, etc).
Not insinuating anyone in particular, but whether you are a traditional marketer, a “bloggeter” (blogger + marketer), or a search engine optimization marketer, think about the consequences of becoming a spam marketer. All of your web trails are under constant watch by the IRS, DOJ, and few others…
Prospective clients: resist the temptation of doing business with such kind of people. You can also be held accountable for being an accessory.
Some SEOs, in an effort to sell something, gain credibility, or save face, will come up with all sort of theories made out of thin air about search engines. When not citing themselves, they cite each other hearsays, often through their link farms. When caught with the pants down, they will lie or edit qualifiers in their posts. Can you guess who, according to Mike Duz, wrote this?
“Some of those well in the know attribute this to latent semantic indexing, which Google has been using for a while, but recently increased its weighting”. (From the Internet Archive)”
According to Duz, this guy later changed his categorical assertion into this:
“Even if they are not using LSI, Google has likely been using other word relationship technologies for a while, but recently increased its weighting”.
Note that in this case changing the qualifier (“had” to “if”) also changes the categorically asserted facts, which is not a minor thing since flies against Credibility. Thanks, Mike.
Instances of such kind of edits are not new across the Web.
We roast these folks simply because they sell search engine snake oil and lies often to promote themselves, their peers, or some kind of crap tool or service. We do this through IR knowledge. One of our goals is to warn the ethical sector of the search marketing industry about such pseudo experts.
We will hammer their myths any day of the year, which takes us to another persistent myth about how search engines work: the search exhaustivity myth.
SEOs have this idea that when a user submits a query, the system does an exhaustive search through the entire document collection or index to compute term weights and rank documents according to a particular similarity measure. Evidently these folks do not know how an inverted index works. One of the reasons (there are many) for using inverted indexes is to avoid searching through all the documents listed, present in a collection. “Jumping” and “intersecting” posting lists is one of the reasons why search engines return results so fast.
BTW, when we understand how positional inverted indexes work, the benefits of document linearization, a topic we have written on before, become clear.
How an inverted index works is a good topic for IR Watch – The Newsletter.
I’m interested in developing a Porter Stemmer for the Irish language.
Would it be possible to send me your lecture notes for Porter Stemmer
development from your graduate course?
I am doing an MSc thesis on developing a search engine for TEI marked up
multilingual texts and hope to use Apache Lucene as a basis.
Thanks for any help,
UCC, Cork, Ireland.
Thank you for reading this blog and for emailing me, but I normally don’t release lecture notes. However, the lecture was based on Martin Porter’s site, which can be accessed by visiting the following link:
Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.
There is something funny about SEOs that sell snake oil ( https://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.
The following links provide additional information about AIRWeb and EMNLP
defines term weights, where aij is the score assigned to term i in document j. Lij is the local weight of term i in document j. Gi is a collection-wide weight. Nj is a normalization factor over document j, often set to 1. Other values and models for N are possible.
There are many definitions for Lj and Gi. One of these consists in setting
Lij = tij, where tfij is the frequency of term i over document j
Gi = IDFi, where IDFi = log(N/ni) is the inverse document frequency of term i over a collection of N documents, where ni is the number of documents containing term i.
Nj = 1
Terms weights then reduces to evaluating
aij = tfij*IDFi = tfij*log(N/ni)
This is the so-called classic TFIDF term weight scheme, one of several examined by Salton et al. in 1975.
TFIDF is frequently used to construct a term vector space model. In this model, terms define the dimensions of a vector space called “the term space”. Dimension units are aij scores. Vectors representing documents and queries are embedded in this space and often converted into unit vectors to simplify further matrix analyses.
Terms do not posses positions in the term vector space simply because they are the dimensions of the space. Since dimensions are assumed to be orthogonal, term independence is assumed. Documents and queries are then treated as “bags of words” and represented as vectors in this space.
What TFIDF Scores Are Not
The TFIDF product, aij = tfij*IDFi, combines two different event spaces: the event space of terms over TF and the event space of documents over IDF. This product can hardly be defined as a term importance score or term relevance judgment for several reasons. Still this is what many informational electronic outlets like Wikipedia and search marketing blogs have claimed.
Firstly, let’s consider the contribution of tf to aij. For a given IDF value a plot of aij vs tfij presumes proportionality. However, a term i repeated x times in document j is not necessarily x times more pertinent or relevant. Term repetition is not precisely a good indicator of term importance.
Secondly, let’s consider the contribution of IDF to aij. IDF weights do not incorporate relevance information. According to Robertson (1), “we can regard IDF as a simple version of the RSJ weight, applicable when we have no relevance information.”
IDF is a measure of the discriminative power of a term over a collection of documents. Sparck Jones called this “term specificity”. Numerically, it is a log estimate of the inverse probability that a random document ni from a collection of N documents would contain term i. Both IDF and TFIDF have never been a buzzword in IR since the ‘70s.
On Term Importance
IDF as the TFIDF product, aij = tfij*IDFi, does not estimates term importance either. The importance of a term, a string, a passage, a message, etc is linked to many things like its meaning (semantics) and amount of information carried (entropy). A TFIDF product does not evaluate either one.
So far we have considered two event spaces, but we could include other event spaces that add up to other variables that affect term importance; for example, the query space.
For the purpose of keyword-driven services on the Web, other considerations can be added to the “term importance” label. One of these is search volume. Terms with a high search volume or with a good conversion ratio are frequently deemed as “important” to keyword-driven services, regardless if these are quite discriminatory or “rare”.
Average users rarely search for very rare terms simply because they are rare. The odds are that very rare terms will exhibit low search volume. Thus, a highly discriminatory term not necessarily is important in this case. Thus, the query is a third and different event space to consider before deeming a term as “important”.
It is true that in the TFIDF model, we can take the cosine angle between any two vectors as a similarity measure and rank whatever the vectors represent. However, ranking by similarity in such a way can hardly incorporate semantics or information entropy, despite what some have written about the topic.
A partial solution to this consists in replacing IDF in the aij expression by an entropy expression. Such models do exist, but are computationally expensive and certainly prohibitively expensive for large scale web search engines.
On IDF and Information Entropy
Over the years, several authors including Robertson himself have tried to link IDF to entropy. According to Robertson (1), “We might be tempted to link this interpretation to the IDF formula, by regarding IDF as measuring the amount of information carried by the term. Indeed, something along these lines has been done several times, including by the present author (Robertson, 1974). Some problems with this view of IDF are discussed below.”
We are not going to go over those problems here, but the reader can go to the original source:
Robertson (1) contended:
“The essence of the problem is that it is hard to identify a single event space and probability measure within which all the relevant random variables can be defined. Without such a unifying event space, any process that involves comparing and/or mixing different measures (even a process as simple as adding IDFs for different terms) may well be invalid in terms of a Shannon-like formalism.”
Robertson then expanded on efforts along those lines (1):
“But there is a more serious problem. When we search using weighted terms, we typically take the query terms and assign weights to them, but ignore all the other terms in the vocabulary. It is hard to see how such a practice could make sense in a Shannon-like model: every term in the document must be assumed to carry information as well as any other. That is, the presence of a term ti would carry −log P(ti) amount of information irrespective of whether or not it is in the query. There is nothing to link the amount of information to the specific query. So we would have no justification for leaving it out of the calculation. Nevertheless, the similarity of the usual IDF formulation to a component of entropy has stimulated other researchers to try to make connections, sometimes somewhat differently from the link suggested above.”
In the rest of the paper Robertson mentioned by name authors that proposed such efforts.
1. Robertson, S.; Understanding Inverse Document Frequency: On theoretical arguments for IDF.
Journal of Documentation,Volume 60, Number 5, 2004 pp 503–520.
This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long standing presence in IR since its introduction back in 1972 by the late Karen Sparck Jones. Since then the model has been thoroughly researched and incorporated into IR models (Salton’s tf-idf, RSJ, and BM25 models).
SEO myths and misconceptions in connection with IDF is featured in the current issue of IRW newsletter.
It appears that repeating SEO crap across the blogosphere makes some to become experts on the subject. These pseudo teachers should have researched the topic before making dumb claims or repeating their peers’s hearsay or come up with definitions made out of thin air. Nothing new coming from SEOs. Such practices are almost their trademarks.
For instance Aaron Wall, has incorrectly defined IDF as follows:
“Inverse Document Frequency is a term used to help determine the position of a term in a vector space model.”
As usual other seos have repeated like parrots such misinformation. It is not the first time. It reminds me of Wall’s claims about LSI, a topic he wrote extensively on until his ignorance about the topic was exposed in several blogs that discuss IR. He was not alone. Andy Beal, Mike Marshall, and few other vocal SEOs have claimed to know about LSI or have used LSI in SEO work. Really?
But this post is not about LSI, but IDF which is a topic equally misunderstood by SEOs. So, let us debunk their claims. As stated in IRW-2008-06:
Salton et al. proposed the vector space model in 1975 in the paper A Vector Space Model for Automatic Indexing (15). In that paper several schemes for scoring term weights were proposed. One of these consisted in combining term frequency (tf) with IDF. Over the years, a family of tf-IDF models has been proposed. Obviously, these are predated by the IDF model of 1972.
In Salton’s vector space model documents are represented as vectors. A query is represented just as another document. Vectors are projected in a vector space, whose dimensions are terms. The units of those dimensions are weights. Coordinates associated to a point (or a vector) in that space are computed according to a scoring model. Terms cannot have positions in this vector space because they are the dimensions of the space. It is that simple.
Despite the fact that IDF has been around since 1972 and tf-IDF since 1975, some search marketers like those that repeat Andy Edmonds’s claims are saying that IDF or tf-IDF is a “new” buzzword in the IR field. WOW! IDF and tf*IDF is a “new” buzzword in IR circles. Really?