Archive for the ‘Vector Space Models’ Category

Vector Space Models and Search Engines

April 21, 2008

This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/

That said, today’s post is in reaction to the article at
http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.

Blockquoted passages and my comments are next.

Blockquoted Passage

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (”the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Comments

At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?

Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as

IDF = log(1/p) = log(N/n)

Logs at a given base (often 2 or 10) are used.

Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.

Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.

The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.

For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.

IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.

For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 - p)) instead of just p. Inverting and taking logs,

IDFP = ((N - n)/n)

If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.

Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html  

There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.

Blockquoted Passage

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Comments

Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.

This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?

The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.

L = f/fmax
L = 1 + log(f)
L = log(1 + f)

etc, etc, etc.

Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.

There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:

W = L*G*N

A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.

As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.

Blockquoted Passage

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

Comments

In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.

However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.

Blockquoted Passage (from one commenter)

It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

Comments

Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.

There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.

The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).

From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.

For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.

For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.

To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.

Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.

Note

Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:

http://www.webpronews.com/topnews/2001/09/05/google-interview-by-fredrick-marckini

One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html  which incidentally mentions tf and IDF.

Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.

Blockquoted Passage

Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

Comments
Why agree to SEO hearsay? See previous comments

Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.

Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/  or just do a search for “phrase idf”.

Conclusion

To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.

Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.

Keyword Density Tools and SEOs

February 26, 2008

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than to what web page creators or a KD tool tag as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the one that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing (and comparing against) a whole new ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

Global Term Weights based on Entropies

January 16, 2008

A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.

Global weights based on entropies are frequently combined with local and normalization weights into overall weights.  These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.

The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:

Global Entropy Weights

Evidently,

G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.

Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.

Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.

A Novel Perspective of Similarity

September 18, 2007

The problem with so many definitions of similarity measures is that each of them is defined for a particular knowledge or model domain or tied to a specific problem or application. In addition, some are based on assumptions which are not clearly stated. Thus, transforming such similarity measures into distances simply compounds the problem.

Professor Dekang Lin, in An Information-Theoretic Definition of Similarity addresses these issues and provides an exciting perspective.

According to Lin,

A problem with previous similarity measures is that each of them is tied to a particular application or assumes a particular domain model. For example, distance-based measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in a network. If a collection of documents is not represented as a network, the distance-based measures do not apply. The Dice and cosine coefficients are applicable only when the objects are represented as numerical feature vectors.

Another problem with the previous similarity measures is that their underlying assumptions are often not explicitly stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any particular particular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on empirical results.

Lin’s approach is quite interesting.

The similarity measure is not defined directly by a formula. Rather, it is derived from a set of assumptions about similarity. In other words, if the assumptions are deemed reasonable, the similarity measure necessarily follows.

Being Referenced Without Being Referenced

September 13, 2007

I have the honor of being “referenced” by Lau Patrick from ETH’s Databases and Information Systems Group in the manuscript Algorithms for Data Base Systems Report to the topic “Abusing Web Search” based on “A web based kernel function for measuring the similarity of short text snippets”.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/LauReport.pdf

Interesting project. I can even recognize some figures and examples of the manuscript taken from my document indexing tutorial,
http://www.miislita.com/information-retrieval-tutorial/indexing.html

In particular, Figure 1 of the tutorial, shown in the manuscript as Figure 2. No link to the tutorial was provided, but to a different document at Mi Islita. That’s fine, since I referenced that my Figure 1 is a modification from the one at  

http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

At least they should have given credit to Ruthven and Lalmas to avoid allegations of student plagiarism.

Memo to Thomas Hofmann (Google), Donald Kossmann, and Peter Widmayer from http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/:

Please! Better than talking about Web search abuse, similarity of text, and duplicated content, let’s talk about lack of originality and properly referenced work.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

THESUS: Semantic Subsets Based on WWW Links

August 2, 2007

I came across a 2003 paper on mining WWW text links, in which researchers introduced THESUS. Really interesting piece.

The abstract of THESUS: ORGANIZING WEB DOCUMENT COLLECTIONS BASED ON LINK SEMANTICS follows:

(more…)

Conceptual Similarity and Conceptual Mapping

July 26, 2007

I came across a relatively old paper authored by researchers at Indiana University and Institute for Human and Machine Cognition: Assessing Conceptual Similarity to Support Concept Mapping

Indeed, it seems like a great concept.

(more…)

What is a Similarity Thesaurus?

July 16, 2007

In my previous post I explained to a reader the difference between inverse term frequency (ITF) and inverse document frequency (IDF), but did not provide practical applications. This post is to explain what ITF is good for.Like IDF, ITF is a global weight measure; i.e., Gi = ITF. Combined with a local weight measure (Lij), it can be used to compute an overall weight.Local weights can be defined in many different ways. Here is one definition:

(more…)

What is Inverse Term Frequency?

July 13, 2007

Good question from a reader:

Hi, I have heard the expression inverse document frequency, but recently I came across a paper mentioning inverse term frequency. Can you clarify these?

Sure.

Consider a set, D, consisting of n documents:

D ={d1, d2…dn}

and a set, T, consisting of m unique terms extracted from D:

T = {t1, t2…tm}

(more…)

Research Channel, LRA, Microsoft, and more

June 26, 2007

I forget to mention that I’m attending ICANN this week, so most will be legacy posts –straight from the conference.

The ResearchChannel is a research consortium dedicated to serve as an online channel for the dissemination of cutting edge technologies. If you want to learn the real stuff under the hood of search engines, just do it through the ResearchChannel. Want to learn the difference between LSA(LSI) and LRA (Latent Relational Analysis)?

(more…)

Closeness, Proximity, Similarity, and Distance

June 20, 2007

Often a distinction between the terms given in the title of this post is not clear in the literature.

Closeness is a generic notion that can be expressed in terms of proximity, similarity or distance.

(more…)

On n-Grams and IR Theses

June 15, 2007

A grad student asked me about n-grams in IR as a thesis topic.

What are n-grams

Well, most of the modern work on n-grams is due to the work by D’Amore and Mah (1985) ONE-TIME COMPLETE INDEXING OF TEXT: THEORY AND PRACTICE

Page 116 of Grossman and Frieder (Information Retrieval: Algorithms and Heuristics) has great introductory material.

(more…)

Our Tutorials, Required Readings at University of Maryland

May 18, 2007

Yan Qu over at the College of Information Studies, University of Maryland taughts the graduate course

LBSC 670 Information Structure

For the course Qu selected as required readings our tutorials:

(more…)

Thesis: Information Retrieval with Genetic Programming

May 10, 2007

Here is the 2002 master thesis of Nir Oren, University of the Witwatersand, Johannnesburg:

Improving the effectiveness of information retrieval with genetic programming

where he proposes an interesting approach to IR using genetic algorithms. Part of his abstract states:

(more…)

Representing Documents, Terms, and Queries in the Same Space

May 6, 2007

A reader asked me an interesting question: Without using LSI, how do you represent documents, terms, and queries in the same space?

(more…)