This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/
That said, today’s post is in reaction to the article at
The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.
Blockquoted passages and my comments are next.
Another way we can assess the relevance of a document is by term weighting.
From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.
But klebenleiben (“the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.
The measure is called inverse document frequency.
This measure is our weight; it is what we use to judge the relevance of a document with.
At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?
Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as
IDF = log(1/p) = log(N/n)
Logs at a given base (often 2 or 10) are used.
Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.
Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.
The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.
For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.
IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.
For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 – p)) instead of just p. Inverting and taking logs,
IDFP = ((N – n)/n)
If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.
Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html
There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.
Term Frequency Times
We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.
We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.
In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.
Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.
This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?
The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.
L = f/fmax
L = 1 + log(f)
L = log(1 + f)
etc, etc, etc.
Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.
There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:
W = L*G*N
A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.
As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.
Documents as Vectors
For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.
Queries as Vectors
Every word in a query can also be shown as a vector.
By looking at documents that are “near” our query we can rank (sort) documents in our result set.
In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.
However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.
Blockquoted Passage (from one commenter)
It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.
Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.
There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.
The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).
From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.
For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.
For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.
To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.
Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.
Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:
One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html which incidentally mentions tf and IDF.
Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.
Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.
Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.
Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.
Why agree to SEO hearsay? See previous comments
Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.
Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/ or just do a search for “phrase idf”.
To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.
Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.