Archive for the ‘IR Tutorials’ Category

Thesaurus as a Complex Network

August 6, 2009

I came across Thesaurus as a complex network, a fascinating 2003 paper written by Adriano de Jesus Holanda, Ivan Torres Pisa, Osame Kinouchi, Alexandre Souto Martinez and Evandro Eduardo Seron Ruiz from Universidade Sao Paulo, Brazil in which they model thesauri using graph theory. The abstracts reads:

“A thesaurus is one, out of many, possible representations of term (or word) connectivity. The terms of a thesaurus are seen as the nodes and their relationship as the links of a directed graph. The directionality of the links retains all the thesaurus information and allows the measurement of several quantities. This has lead to a new term classification according to the characteristics of the nodes, for example, nodes with no links in, no links out, etc. Using an electronic available thesaurus we have obtained the incoming and outgoing link distributions. While the incoming link distribution follows a stretched exponential function, the lower bound for the outgoing link distribution has the same envelope of the scientific paper citation distribution proposed by Albuquerque and Tsallis [1]. However, a better fit is obtained by simpler function which is the solution of Ricatti’s differential equation. We conjecture that this differential equation is the continuous limit of a stochastic growth model of the thesaurus network. We also propose a new manner to arrange a thesaurus using the “inversion method”.”

The study is important because it provides an interesting look at word relationships. They have identified an underlying power law, which in my opinion might be worth to be investigated as to whether it is at core of semantic relationships.

They briefly mentioned the limitations of LSA.:

“However, LSA has been criticized as a poor approach for predicting semantic neighborhood”.

Indeed, LSA (or LSI) not necessarily describes or predicts semantics, as originally thought. In my view, LSA/LSI itself is a misnomer. Research references can be provided to support this view.

I do have one additional comment on the paper. In it, LSA is described as a PCA technique. The authors write:

“Another interesting way to treat data is the Latent Semantic Analysis (LSA) [5] which deals with word covariance in a corpus. LSA is a principal component analysis (PCA) technique , i.e., the covariance matrix is diagonalized and from the most important eigenvalues (around 300) the eigenvectors are considered to span an Euclidean vector space.”

This might not be entirely accurate. Let see why:

1. PCA was invented by Karl Pearson in 1901 so is more than half a century  older than Golub and Kahan’s SVD algorithm which was published in 1965. See G. Golub and W. Kahan, J. SIAM, Numer. Anal. SEr. B, Vol 2, No. 2 (1965). 

2. In 1988 Dumais, et al applied Golub’s SVD to text and called that LSA (LSI). See Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286, Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988). See also, Improving information retrieval using Latent Semantic Indexing. Proceedings of the 1988 annual meeting of the American Society for Information Science. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Beck, L. (1988).

3. In LSA (LSI) the SVD algorithm can be applied to matrices that not necessarily are populated with covariance values.

4. It was only later realized that SVD can be applied to a covariance matrix to obtain the PCA components.

5. See the PCA & SPCA Tutorial

6. PCA is not LSI. See http://irthoughts.wordpress.com/2007/05/05/pca-is-not-lsi/

IR Videos in Spanish

June 22, 2009

I normally do not put online my lecture notes (ppt, pdf, videos). However, there are two public conferences that event organizers taped. Both last over 1 hour and are in Spanish, but with slides in English. Here are the links. The quality of the videos is so-so.

Since the videos were made available few months later after the events, these are not properly dated. I have included below the actual date of the events. If you don’t know Spanish, you are out of luck.

1. Understanding Search Engines (Entendiendo a los Buscadores), University of Puerto Rico, Bayamon, 4-23-2008

http://video.google.com/videoplay?docid=-653964730907023811

This one last for about two hours. The audience consisted of grad students and researchers. Unfortunately, the video has an audio-visual mismatch of about one slide. If you can coupe with this, I hope you like it.

2. Demystifying LSI (Desmitificando LSI)- OJOBuscador Congress, Madrid, Spain, 3-09-2007.

http://www.ojotube.com/videos/congreso-ojobuscador-2007-ponencia-desmitificando-lsi-de-dr-e-garcia/

This one last for over one hour. Since it was for a non-scientific audience  (most Spanish SEOs)  I tried to talk very slow.

Vector Normalization with Excel – Part II

May 7, 2009

Back in March, we explained how to normalize column vectors with Excel. But, what about normalizing row vectors? This question is addressed in the current QA column of IRW. I think it might be useful sharing the answer with readers since many of these are students struggling with similar questions. So, here we go.

The following table emulates an Excel array consisting of three columns (A, B, and C) and six rows (1-6).

  A B C
1 1 2 3
2 4 5 6
3 7 8 9
4 0.27 0.53 0.80
5 0.46 0.57 0.68
6 0.50 0.57 0.65

Rows 1, 2, and 3 are row vectors. Rows 4, 5, and 6 are the corresponding normalized vectors, also known as unit vectors because their length is 1. To compute these, do as follows:

1. In cell A4, enter the formula =A1/(SQRT(SUMSQ($A1:$C1))). The result should be as given in this cell.

2. Copy this formula, select cells A5 and A6 and paste the formula in these.

 3. Finally, copy at once cells A4 through A6, select the remaining empty cells of the array, i.e., cells B4 through C6 and paste the formulas in these.

Why IDF is Expressed Using Logs

April 15, 2009

Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.

I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.

Questions 1 and 3 are related so I will answer both now. After that, I will answer question 2.

1) Why is a log function used for calculating IDF?
3) Would it be accurate to describe IDF as “the ratio of documents in a collection to documents in that collection with a given term”? I’m guessing your answer would be, IDF is the [LOG of " the ratio of documents in a collection to documents in that collection with a given term"]? Which brings us back to question, I guess? hehe

These are recurrent questions students asked me before. The reason for using logs is due to two assumptions frequently made in most IR models; i.e.

I. that scoring functions are additive.
II. that terms are independent.

While in some models II might not be present, both (I and II) play well with logs since these also are additive.

These functions and why the use of logs is explained in the recent RSJ-PM Tutorial http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf

Document Frequency (DF) is defined as d/D, where d is number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).

But since often D > d the log of d/D, that is log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared. Now log(D/d) is conveniently called Inverse Document Frequency.

Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B

p(AB) = p(A)p(B).

Taking logs we can write

log[p(AB)] = log[p(A)]+ log[p(B)]

It is easy to show that for two terms

log(d12/D) = log(d1/D) + log(d2/D)

Inverting and using the definition of IDF we end up with

IDF12 = IDF1 + IDF2

validating assumption I; that IDF as a scoring function is additive.

That is the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,

p(AB) > p(A) + p(B)

or

p(AB) < p(A) + p(B)

and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think in order to promote some dumb keyword research tools is plain snakeoil.

2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a
collection.”

This is legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents. If a a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.

The RSJ-PM Tutorial mentioned above was written to kill for good some misconceptions regarding IDF. In it we explain why IDF is considered by Robertson and Jones a particular RSJ weight in the absence of relevance information.

In a nutshell, IDF is a collection wide estimate and as such the information on whether documents containing the terms being queried are relevant to these is unknown. Similarly, the information on whether documents not containing the query terms are relevant or not is unknown and often remains unscrambled when we just look at the d/D and d/(D – d) collection-wide ratios. All we can say is that relevant documents might have a higher probability of containing query terms in comparison with other documents from the collection as a whole. But we could make such assertion without resourcing to IDF as well.

In the case of Web documents, often these are about multiple topics. Many documents aggregate content from dissimilar sources (news headlines, rss, blogs, etc) and said document content might change in time. The mere mention of a term (regardless of repetition) is not a proof of its relevancy or of its importance with respect to the topics discussed in a document.

Thus, the idea that we can assess if terms are relevant to a document by simply comparing IDF values is missing the whole point and defeats the purpose for which the RSJ-PM model and many of its variants (e.g., BM25) were developed.

I hope this helps to clear up some SEO misconceptions on the topic.

RSJ-PM: Probabilistic Model Tutorial

March 30, 2009

As promised, I am pleased to announce the publication of the Robertson-Sparck Jones Probabilistic Model Tutorial.

It is available in Mi Islita.com in the Tutorials Section. A link is provided in the index page.

The tutorial guides you through the intricasies of RSJ-PM. It is a great start for CS students and teachers interested in probabilistic models in information retrieval.

Enjoy it.

Due to the time spent on it, the April issue of the IR Watch newsletter will be a bit delayed.

SEOs and Their IDF Myths: Part 3

March 20, 2009

In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).

In SEOs and their IDF Myths: Part 2, we exposed some of these folks.

In Understanding TFIDF, we wrote a rebuttal.

We are still seeing so many bloggers mistaking IDF for something that is not. We have to conclude these pseudo-teachers either are just trying to sell something or they don’t really understand what term specificity stands for. They should know that IDF is a small pixel section within the bigger picture of the Robertson-Sparck Jones Probabilistic Model for information retrieval.

Thus, we are writing a tutorial on RSJ-PM to kill for good their intentionally misleading efforts. Hopefully, the tutorial will be ready before the month ends. It will be a great way of putting to rest all the false information flying around from the usual agents of misinformation (mostly SEOs). CS students interested in knowing about the pros and cons of probability models in IR will find it useful.

IDF and Vector Space Models

March 11, 2009

I’m back from SIDIM XXIV. It was a great conference in honor of Professor Oscar Moreno, from the Gauss Research Laboratory and NIC.PR (responsible for the .pr Internet domains.).

Dr. Moreno is a legend in the area of pure and applied mathematics. I have the privilege of meeting with him.

The conference plenary speakers were equally three legends:

Elwyn Berlekamp, University of California at Berkeley
Solomon W. Golomb, University of Southern California
Guang Gong, University of Waterloo

The event was a success, although some speakers read straight from their notes. As an interdisciplinary conference on pure and applied mathematics, all kind of topics were covered.

I got the chance to present research work on a new global weight algorithm we are testing called scaled inverse document frequency (SIDF), a variant of the well-known IDF scheme.

For those unfamiliar with IDF and its implementation with ranking algorithms, Dr. Deepak Khemani from the Artificial Intelligence & Database Research Group at Indian Institute of Technology Madras has published a very useful tutorial presentation on Vector Space Models.

The tutorial is based on our series of articles on the subject and provides a better understanding of the theory. We could not have done it better.

Vector Normalization with Excel

March 4, 2009

Unit vectors are frequently used in information retrieval and data mining studies because simplify further calculations and analyses.

In the current issue of IR Watch, we show how easy is to convert column vectors into unit vectors with Excel. It is assumed you know how to define spreadsheet arrays in Excel and how to enter formulas in it.

Say we have two vectors in columns A and B each with four elements. To convert these into unit vectors, do this:

1. In cell C1, enter the formula =A1/(SQRT(SUMSQ(A$1:A$4)))

2. Paste content of C1 into cell D1. This creates a modified instance of this formula.

3. Paste content of C1 and  D1 cells into remaining empty cells of these columns by selecting these at once. This also creates modified instances of these formulas.

C and D columns represent the unit vectors.

A figure with a step-by-step example is given in IRW (free subscription)

Below is another example, but with the final results.

A B C D
1 8 0.13 0.36
2 10 0.26 0.45
4 12 0.53 0.53
6 14 0.79 0.62

That was easy!

If you use the first row to label columns, as in this example, be sure to readjust the formulas so these start at cell 2 and run up to cell 5.

If you still have questions on how to do this, email me or subscribe to IRW.

Answers to IR Quiz

June 27, 2008

Answers to the IR Quiz are given below:

Term Independence Assumption:

If k1 and k2 are statistically independent they should occur by chance, co-occurring in only

(100)(200)/500 = 40 documents.

Thus, if they occur by chance, the number of documents mentioning the k1 k2 sequence should be unknown, but certainly no greater than 40.

Term Dependence Assumption:

If terms actually co-occur in 70 documents, they are co-occuring more often than by chance (70 > 40). So, terms are statistically dependent and positively correlated. It is a given that the k1 k2 terms sequence is present in 25 out of the 70 documents wherein terms co-occur.

Detailed Results:

Results are given below, rounded off to two decimal places. First/second results respectively are for terms independence/dependence assumptions. You should be able to double check these results.

1. k1 NOT k2: 60, 30

2. k2 NOT k1: 160, 130

3. k1 OR k2 (unconditional OR): 260, 230

4. k1 OR k2 (conditional OR): 220, 160

5. NOT k1: 400, 400

6. NOT k2: 300, 300

7. NOT (k1 AND k2): 460, 430

8. k1 AND k2 NOT (k1 k2): NC, 45

9. EF-Ratio of the k1 k2 terms sequence: NC, 0.36

10. c12-index of the k1 k2 terms sequence: NC, 0.11

11. c12-index of k1 AND k2: 0.15, 0.30

12. IDF of k1: 0.70, 0.70

13. IDF of k2: 0.40, 0.40

14. IDF of k1 AND k2: 1.10, 0.85

15. IDF of k1 k2 terms sequence: NC, 1.30

Additional exercises open to discussion:

Calculate the associated odds, odd ratios, and logits.

 

IR Quiz

June 18, 2008

Here is a question I included during the final examination of the Search Engines Architecture course. I am modifying the question. It might serve as a little quiz for non IR readers:

A collection consists of 500 documents. Some documents mention k1 and/or k2 keywords. If 100 mention k1, 200 mention k2, 70 mention k1 and k2, and 25 mention the k1 k2 terms sequence. Calculate the number of results for the following queries first, assuming terms independence and second assuming terms dependence. If the calculation is not possible from the provided data, write NC, ‘Not Computable’.

1. k1 NOT k2

2. k2 NOT k1

3. k1 OR k2 (unconditional OR)

4. k1 OR k2 (conditional OR)

5. NOT k1

6. NOT k2

7. NOT (k1 AND k2)

8. k1 AND k2 NOT (k1 k2)

9. EF-Ratio of the k1 k2 terms sequence

10. c12-index of the k1 k2 terms sequence

11. c12-index of k1 AND k2

12. IDF of k1

13. IDF of k2

14. IDF of k1 AND k2

15. IDF of k1 k2 terms sequence

Total Possible Scores: 15 points for terms independence and 15 points for terms dependence correct results.

Grading Yourself: A (100 – 90), B (89 – 80), C (79 – 70), D (69 -60), F(59 – 0)

Correct answers will be given during the week.

 

PCA and SPCA Tutorial

March 25, 2008

As promised, the PCA and SPCA Tutorial is now available at Mi Islita.

I wrote this tutorial for two reasons:

1. to help graduate students taking the Search Engines Architecture course with a general overview and review of linear algebra concepts.
2. because most tutorials discuss PCA, but ignore SPCA.

If you are enrolled in the course and have any question, please feel free to post.

Feel also free to comment or make suggestions so we can improve the tutorial.

Search Engines Architecture Week 2

March 14, 2008

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

IRW-2008-2 Sneak Preview

February 18, 2008

Scalar Clusters

Masking Effects in Similarity Matrices:
Topics with strong similarities can mask other topics,
making these invisible to the clustering process.

Here is the sneak preview of the current issue of IR Watch – The Newsletter. Due to academic duties, it is running late. It should be in subscribers’s inboxes during the day. The topic is Scalar Clusters and Back Mapping and is based on lecture material covered in the Web Mining course. The following topics are covered:

Introduction
On Association and Scalar Clusters
The Neighborhood Similarity Matrix, Mn
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Term Clusters to Documents
Masking Effects in Similarity Matrices
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Not a subscribers? What are you waiting for?

Back Mapping for the Masses

January 31, 2008

In a recent tutorial on association and scalar clusters, http://www.miislita.com/information-retrieval-tutorial/association-scalar-clusters-tutorial-1.pdf, I introduced a back mapping technique wherein once features conforming clusters are extracted from objects, the clusters are mapped back to objects.

The technique works well with clusters of terms extracted from documents. The reverse case is also possible: given a cluster of documents extracted from terms, it is possible to map these back to terms.

What do we gain from such two-way manipulations? A lot. Consider the first scenario: Mapping term clusters back to documents; a tutorial on the second scenario will be available soon.

Back Mapping Term Clusters to Documents

A document is just a distribution over topics while topics are distributions over words. Thus, across a collection of documents there are topics hidden (latent) and waiting to be uncovered. Back mapping allows us to recover these, precisely.

Combinations of terms that do not amount to topics across the collection are discovered as well. Reasonably, one would expect these to be least relevant across other documents than those distributed across the collection. In addition, one would expect documents traced back to clusters to be the most relevant documents, from the collection and with respect to the topics.

The implications of this for search engine optimization and keyword bidding are quite obvious. Implementation is straightforward. To learn more about it, read Part 1 of the tutorial.

A Case-Based Experience Sharing Search Engine

November 28, 2007

Mobyen Ahmed, Erik Olsson, Peter Funk, Ning Xiong from Department of Computer Science and Electronics, Malardalen University, Sweden have published the paper “Efficient Condition Monitoring and Diagnosis Using a Case-Based Experiene Sharing System”
(http://www.mrtc.mdh.se/publications/1269.pdf)  at a workshop at the Swedish Artificial Intelligence Society, p 70-80, in May, 2007.

This is a case-based reasoning search engine system that could be used in an industrial environment. That’s quite interesting.

In their paper, the authors kindly referenced me. That’s an honor. It appears that more CS researchers are happy with my Cosine Similarity Tutorial (http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html) and with Mi Islita’s content (http://www.miislita.com/searchito/educational-links.html). If they are happy, I’m happy.

Being Referenced Without Being Referenced

September 13, 2007

I have the honor of being “referenced” by Lau Patrick from ETH’s Databases and Information Systems Group in the manuscript Algorithms for Data Base Systems Report to the topic “Abusing Web Search” based on “A web based kernel function for measuring the similarity of short text snippets”.
http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/LauReport.pdf

Interesting project. I can even recognize some figures and examples of the manuscript taken from my document indexing tutorial,
http://www.miislita.com/information-retrieval-tutorial/indexing.html

In particular, Figure 1 of the tutorial, shown in the manuscript as Figure 2. No link to the tutorial was provided, but to a different document at Mi Islita. That’s fine, since I referenced that my Figure 1 is a modification from the one at  

http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

At least they should have given credit to Ruthven and Lalmas to avoid allegations of student plagiarism.

Memo to Thomas Hofmann (Google), Donald Kossmann, and Peter Widmayer from http://www.dbis.ethz.ch/education/ss2007/07_dbs_algodbs/:

Please! Better than talking about Web search abuse, similarity of text, and duplicated content, let’s talk about lack of originality and properly referenced work.

Levenshtein Edit-Distance Based Tool

August 20, 2007

As announced, the Levenshtein Edit-Distance Based Tool is now available at Mi Islita.com site.

The tool is meant to be for demonstration purposes; e.g., as in a classroom setting or as part of a hands-on tutorial on edit distances.

Some suggested conversions are:

Democrats –> Republicans
Google –> Yahoo!
Good –> Evil
password –> userID
Jesus –> Satan
Britney –> Spears
Lotto No. –> Quick Pick No.

Enjoy it!

Upcoming Tool on Edit Distances

August 17, 2007

I’m working on a tool for computing edit distances (number of insertions, deletions, and substitutions) in a text stream.

It will be up and running this Monday. It is great for a hands-on tutorial.

Did you know that to change Democrats into Republicans and vice versa requires of just 8 edits? :)

(more…)

Being Quoted at University of Campinas, Brazil

May 24, 2007

More universities are quoting our tutorials. I’m happy to learn that these are read even in the huge and great nation that is Brazil.

Today we found out that Prof. Wu, Shin – Ting from Department of Computer Engineeering and Industrial Automation, School of Electrical and Computer Engineering State University of Campinas, Sao Paulo, Brazil and who teaches EA978 Graphic Information Systems is referencing our Matrix Tutorial 3: Eigenvalues and Eigenvectors.

(more…)

Our Tutorials, Required Readings at University of Maryland

May 18, 2007

Yan Qu over at the College of Information Studies, University of Maryland taughts the graduate course

LBSC 670 Information Structure

For the course Qu selected as required readings our tutorials:

(more…)