This post is a continuation of a previous one on the topic of SEO non sense in relation with inverse document frequency (IDF). IDF has a long standing presence in IR since its introduction back in 1972 by the late Karen Sparck Jones. Since then the model has been thoroughly researched and incorporated into IR models (Salton’s tf-idf, RSJ, and BM25 models).
SEO myths and misconceptions in connection with IDF is featured in the current issue of IRW newsletter.
It appears that repeating SEO crap across the blogosphere makes some to become experts on the subject. These pseudo teachers should have researched the topic before making dumb claims or repeating their peers’s hearsay or come up with definitions made out of thin air. Nothing new coming from SEOs. Such practices are almost their trademarks.
For instance Aaron Wall, has incorrectly defined IDF as follows:
“Inverse Document Frequency is a term used to help determine the position of a term in a vector space model.”
As usual other seos have repeated like parrots such misinformation. It is not the first time. It reminds me of Wall’s claims about LSI, a topic he wrote extensively on until his ignorance about the topic was exposed in several blogs that discuss IR. He was not alone. Andy Beal, Mike Marshall, and few other vocal SEOs have claimed to know about LSI or have used LSI in SEO work. Really?
But this post is not about LSI, but IDF which is a topic equally misunderstood by SEOs. So, let us debunk their claims. As stated in IRW-2008-06:
Salton et al. proposed the vector space model in 1975 in the paper A Vector Space Model for Automatic Indexing (15). In that paper several schemes for scoring term weights were proposed. One of these consisted in combining term frequency (tf) with IDF. Over the years, a family of tf-IDF models has been proposed. Obviously, these are predated by the IDF model of 1972.
In Salton’s vector space model documents are represented as vectors. A query is represented just as another document. Vectors are projected in a vector space, whose dimensions are terms. The units of those dimensions are weights. Coordinates associated to a point (or a vector) in that space are computed according to a scoring model. Terms cannot have positions in this vector space because they are the dimensions of the space. It is that simple.
Despite the fact that IDF has been around since 1972 and tf-IDF since 1975, some search marketers like those that repeat Andy Edmonds’s claims are saying that IDF or tf-IDF is a “new” buzzword in the IR field. WOW! IDF and tf*IDF is a “new” buzzword in IR circles. Really?
Others have claimed that it is not possible to evaluate the IDF of a phrase. Even some that plan to teach IR have claimed that calling log(N/n) “inverse document frequency” is an “insult to students”. Before making a fool of themselves they should read Robertson and Sparck Jones legacy papers on the topic.
Sorry to sound harsh, but I wonder what kind of crap all these pseudo teachers are lecturing while sitting in the dark of their empty classrooms and forums.
Did search engines use IDF? Yes, absolutely.
Do all search engines use IDF? No, absolutely.
Do I think X search engine currently uses IDF? I cannot speculate what X is doing simply because I don’t work at X.
Do I use IDF? Yes, in my experimental search engine students are building/researching.
What are the drawbacks of IDF? Several. Its stability as N gets larger is an ongoing research topic.
Before commenting on IDF, SEOs please don’t lose credibility and do your homework. Start here.
New Research on the topic: http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/
Dr. Garcia,
First of all, thanks for your work communicating about the challenges of search in a very detailed way.
Since you first referenced my post on TF-IDF, I’ve read your writings on it, and realized that I did not speak especially clearly in my post. It’s been on my list to revise it, while preserving the original, to make my message more clear.
I tried to provide direct feedback through several channels — but only received email bounces. It’s a tough email world with spam, I get it.
I’ll address your original post once I’ve rewritten it to make my points more clear, but it’s obvious you didn’t do your homework on me
Worse than that, your reference to me saying TF-IDF is new is entirely inaccurate. In no way did I suggest this is a new concept. My quote was “The buzzword in IR”, not the new buzzword. I was reviewing a new website doing some interesting word frequency analyses.
Inaccurate and shallow reading like this makes me doubt the quality of all your writings.
Best Regards,
Andy Edmonds
I hope you do.
IDF and tf-idf is still not a buzzword in IR. Good try.
Sorry to hear that. I might have to reciprocate your feelings about all your writings as well.
It appears that some are getting their knowledge from Wikipedia, which not always gives accurate definitions.
For instance, Wikipedia (http://en.wikipedia.org/wiki/Tf-idf) incorrectly describes tf-idf as a measure used to evaluate how important a word is to a document in a collection.
“The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document.”
It is not true that tf*idf is used to evaluate term importance. A term weight score does not equate to term importance. Personally I don’t know of any current IR colleague that claims such thing.
The other problem with Wikipedia’s definition is that assumes a function f of the form f(term importance) = tf*idf and therefore that “the importance increases proportionally to the number of times a word appears in the document.”
Obviously, this is an incorrect definition. The importance of a term does not necessarily increases proportionally to the number of times a word appears in a document. A term repeated x times is not x time more important. Similarly a document repeating a term x times is not x times more pertinent to the term.
In addition, taking the very same document and placing it in different collections, might change its tf*idf weight in the collection, but the importance of a term to a document remains the same, contradicting the proportionality assumption between term importance and the number of times a word appears in the document.
If that crap is what SEOs are teaching to their peers, they are messengers of misinformation.
I hate to be disagreeable, Dr. Garcia, but a “buzzword” is exactly what tf-IDF is. Here’s a definition offered by Google when I searched for buzzword.
“A word or phrase connected with a specialized field or group that usually sounds important or technical and is used primarily to impress laypersons.”
Isn’t that exactly what you’ve been talking about? And isn’t it, in fact, quite accurate?
Hi, Dan:
Thank you for stopping by. Last time we exchanged few words was back in SES, NY 2005, I believe. I hope you are doing well.
Dan, you are more than welcome to disagree.
Sorry, Dan. I understand what you are coming from, but I have to disagree with you as well. Referring to IDF as a “buzzword” should be avoided. Why?
Buzzword is often used in reference to something that is new and obscure. He might not have literaly used the qualifier, but by using “buzzword” the message that came across was one of presenting tf-idf as something new and obscure.
Dan, since you want to go with the “definition defense”, a definition is as good and accurate as who proposes it -Google in this case.
Merriam-webster dictionary (http://www.merriam-webster.com/dictionary/buzzword) lists this for “buzzword”: (emphasis added to stress intention)
1: an important-sounding usually technical word or phrase often of little meaning used chiefly to impress laymen
2: a voguish word or phrase —called also buzz phrase
tf-idf stands for term frequency*inverse document frequency as a clear meaning. It is not of little meaning used chiefly to impress, nor is a voguish word or phrase. The scoring model it defines is quite simple:
tf = number of times a term appears in a document
idf = log(N/ni)
where N is the collection size and ni number of documents mentioning query term i. It is not possible to compute IDF, nor tf*idf without knowing the number of documents N in the collection. If assuming term independence, any computed estimate of N, IDF, and tf*idf can be meaningless, especially with a dumb tool that does not use large-scale web collections.
Wikipedia has this line about “buzzword” (emphasis added to stress intention)
Buzzwords differ from jargon in that they have the function of impressing or of obscuring meaning, while jargon (ideally) has a well-defined technical meaning, if only to specialists. However, the hype surrounding new technologies often turns technical terms into buzzwords.
Wikipedia also lists these “Reasons for using buzzwords” (emphasis added to stress intention)
Reasons for using buzzwords
With any stipulative neologism, such as “quark,” to describe new concepts, without the danger of over-simplification and confusion that can arise from using words and phrases with previously established, commonplace meanings.
To control thought by being intentionally vague. In management, stating organizational goals by using words with unclear meanings but positive connotations prevents anybody from questioning the directions and intentions of these decisions, especially if many such words are used.[2] (See also newspeak.)
To boost creativity among listeners by compelling them to think of the applications and particulars on their own.
To make something trivial seem to have greater import and stature.
To impress a judge or examiner by seeming familiar with a theory or principle by dint of mere name-dropping, as with “cognitive dissonance” or the “Heisenberg Uncertainty Principle.”
To provide a camouflage for saying nothing in particular.
Indeed, none of the above “definitions” applies to IDF or tf-idf, and quite honest is insulting and a bad descriptor used by someone that pretends to be a teacher, which is why your apparent “definition defense” cannot be sustained.
Let’s discuss now the merits of the tf-idf tool reviewed.
For those interested, here is the link to Wikipedia: http://en.wikipedia.org/wiki/Buzzword
One more thing, for definition lovers and pseudo teachers out there. This by virtue of wikipedia and previous definitions:
The statement that “the buzzword in IR is TFIDF” is by all means false, innaccurate, and shallow. In IR tf-idf is not a buzzword or “the buzzword”, nor is obscure, a new concept, trivial, a camouflage for saying nothing in particular, used to impress, etc, etc. In IR we know exactly what tf-idf means, stands for, and what it does.
Having expended several hundred of your own words so far arguing the case for “slightly inaccurate word use” by Mr. Edmonds, you haven’t persuaded me that his comments are “false, inaccurate, and shallow.”
Some folks seem to have a need to be “right” all the time. You seem to be such a person, as I’ve watched you change your methods over the years from simply illuminating the subject to going after people.
This time, you’ve ignored the actual content of Andy Edmonds’ post, over the use of the word “buzzword.”
Which, I still contend, is exactly what tf-idf is when it’s “dropped” by SEOs who want to give the impression that they have some sort of secret sauce. LSI is also tossed out as a buzzword.
Frankly, I think you would agree with this, if you weren’t trying so hard to be “right.”
What you’re doing here is often interesting, often useful, often helpful. But you’re not always right. Should I expend a thousand words dissecting your “false, inaccurate, and shallow” characterization of TheRarestWords.com as a “tf-idf tool,” or stick to the subject?
If you’re going to go after people, Dr. Garcia, please don’t lose credibility and do your homework. Start here.
Hey, Dan; this is simple.
He has stated that TFIDF is a buzzword in IR and I contend that is not and explained why. So far he or you haven’t shown any evidence that it is a buzzword in IR.
In your previous post it was you that raised the “definition defense” and I simply show you why that defense simply cannot be sustained to tf-IDF based on the merits of those definitions. That is not going after anyone.
Now you are switching to TFIDF as buzzword in the SEO circles, which are another twenty bucks and topic. I have to agree with you on that one. But again that is a different scenario. If Edomonds wanted to say “the buzzword in SEO” why not saying it?
After that switching defense, you raises the “motivation defense”. Good try and equally wrong about me.
Sorry you are now taking things personal. If I contend that some of the folks within the SEO sphere are incorrect in some assessments or expose some of these guys for selling false claims, misrepresentation, and snake oil, and you don’t like it, I am sorry for you.
I am not always right, by the way, nor pretend to be always right. Take that accusation you know where…
It is up to you to expect that. You tell me.
As mentioned before, lets discuss the merits of the tool. Shall we?
Doctor, there’s nothing to “defend” – your “nitpick attack” is without merit. I was just pointing out that it’s as easy to find fault with your own choice of words, and even easier to question your motivation. Readers will easily see that this is so without me taking it any further.
So, shall we discuss the merits of the tool? Or if you want to pick something apart, we could take a look at the “auto-SE-wordizer” for fun:
http://rarestblog.com/2008/05/auto-sewordizer-automatic-search-engines-words-optimizer/
In http://alwaysbetesting.com/abtest/index.cfm/2008/5/24/Term-Frequency-Inverse-Document-Frequency-TDIDF-Exploring-TheRarestWordscom
Edmonds wrote:
“The buzzword in IR is TFIDF, or term frequency inverse document frequency. This is a method for giving more importance to the less common words in a document that match the query. Mid-range frequency words get discounted, but they’re likely key terms, if the page is truly relevant, and often repeated.”
If your read his first post here, he did not clarify anything, but reassured what he said: “The buzzword in IR is TFIDF”.
There is a difference between buzzword and jargon and IR and SEO.
Then alone came you here. i have seen you doing this before to promote you or your associates.
First you dropped by with a definition defense/diversion. Good try.
That didn’t work. So you switched to a different scenario (SEO buzzwords), then to accusations…
Talking about nitpick attacks. I think the nickpicker is you after all. Get a 4th of July life!
Now on the sustantive part:
Their rare parser looks like a wanderer/sampler matching certain words from documents, nothing of a novelty. How TFIDF plays in the picture and what is the document collection size used to compute IDF, if ever used?
I have seen many of these tools coming, going, and eventually ignored, often because add little or zero value to the bottom line of a business.
Still, I want to be fair. Please go ahead and start discussing the merits of the tool here.
Wow. I came back from my two-day Fourth of July vacation and my inbox has all sort of SEO dizzying comments that adds no value to the discussion.
For your information sir, IR Thoughts is a blog wherein we comment on IR and search engines and debunk search marketing myths like the many promoted by SEOs. We dissect these through IR knowledge, not hearsay.
We didn’t go after you, Dan. You came here voluntarily, with a spaghetti-like defense, throwing all sort of false statements to see what sticks to the wall. Are you still bleeding from NY SES 2005?
Refluxing words just shows you are short of your very own. To waste your will over the beautiful Fourth of July weekend in such manner might suggest that you are a very lonely, depressive man. You need a vacation.
Since your comments add no value to the discussion, you lost your chance, so you and your friends are out of here. BTW, more entertaining than chasing rare terms most users don’t care to search about is reading Understanding TFIDF.
What an interesting thread.
I like your style Dr. Garcia.
When going through the series of posts these things struck me.
1). andyed Says:
July 3, 2008 at 7:33 pm
“Worse than that, your reference to me saying TF-IDF is new is entirely inaccurate. In no way did I suggest this is a new concept. My quote was “The buzzword in IR”, not the new buzzword.”
My eye was caught on the last sentence.
“I was reviewing a new website doing some interesting word frequency analyses.”
As a 12+ year practitioner of SEO, it never ceases to amaze me with what some of the fertile minds can come up with to inflate their own value.
Let me put this in context as I interpreted it.
“I was reviewing a new website doing some interesting word frequency analyses”
And just HOW were you doing this analysis andyed? Using TF-IDF as you imply?
Are you privy to Google’s algorithms?
Using any other technology is simply a waste of the client’s money.
I am not saying KW analysis is a waste of time, but having to dissect each (new) client’s site and weigh the value of the current composition is ludicrous.
ANY practitioner of SEO worth their salt knows how to build the site properly from a results driven “template”. Stop trying to “game” the engines.
2). danthies Says:
July 4, 2008 at 1:21 am
I have to agree with danthies.
andyed is using TF-IDF as “A word or phrase connected with a specialized field or group that usually sounds important or technical and is used primarily to impress laypersons.” With emphasis on “usually sounds important or technical and is used primarily to impress laypersons”.
The more you can impress, the more you can charge. Build the BS mystique.
3). E. Garcia Says:
July 4, 2008 at 11:37 am
In IR we know exactly what tf-idf means, stands for, and what it does.
Of course, but the general population does not. And the article is for the general population.
4). E. Garcia Says:
July 4, 2008 at 2:15 pm
“He has stated that TFIDF is a buzzword in IR and I contend that is not and explained why.”
Of course he would. To have it so would imply more importance to his use of the term for the general market.
5). E. Garcia Says:
July 4, 2008 at 10:06 pm
“Still, I want to be fair. Please go ahead and start discussing the merits of the tool here.”
I used TheRarestWords.com on a couple of my sites and learned nothing useful.
Results were as expected.
6). Going back to your article on understanding TF-IDF I agree strongly that “IDF as the TFIDF product, aij = tfij*IDFi, does not estimate term importance either. The importance of a term, a string, a passage, a message, etc is linked to many things like its meaning (semantics) and amount of information carried (entropy). A TF-IDF product does not evaluate either one.”
The semantics are the key.
Even if TF-IDF were to be the end-all and be-all of evaluating keyword and keyword phrases over the scope of the database, it still means nothing without the exact vectors used by the search engine(s). Using it as a buzzword is simply hyperbole.
Writing for the search engines is not rocket science.
Writing the search engines is.
Reg Charie.
Hi, Reg Charie:
Thank you for stopping by.
Indeed. And to compute IDF, thus tf-IDF, one must know N, the size of the entire collection of a search engine.
Over the years some IRs have tried to estimate N by resourcing to the term independence assumption. The result are estimated N values that vary so wild that they cannot be trusted; at least not with Web search engines.
The term independence assumption is the source of all kind of inconsitencies in IR scoring functions, including vector space models. When one thinks about it thoroughly, the notion of term specificity itself is inherently divorced from term independence.
On other matters. Here are some links to post somehow related with this thread posts.
http://irthoughts.wordpress.com/2008/07/21/seos-and-their-exhaustivity-search-myths/
http://irthoughts.wordpress.com/2008/07/14/claps-and-slaps/
http://irthoughts.wordpress.com/2008/07/07/understanding-tfidf/
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/2007/07/19/seos-and-still-their-lsi-misconceptions/
http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/
Thanks Dr. Garcia.
You have given me untold hours of interesting reading.
Reg
Pingback: Understanding TFIDF « IR Thoughts
Pingback: IR and SEO Misnomers « IR Thoughts
Pingback: SEOs and Their IDF Myths: Part 3 « IR Thoughts