• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Latent Semantic Indexing

My IPAM Lost Pictures

20 Friday Jan 2012

Posted by egarcia in Data Mining, Latent Semantic Indexing, Machine Learning, Statistics and Mathematics

≈ 7 Comments

On January 23-27, 2006 I was at the Institute for Pure and Applied Mathematics, UCLA, California attending a now infamous Document Space Workshop. I took some pictures, but did not find these until now.

I’ve posted these in my facebook page, posing with back then IPAM director and with world-recognized LSI expert Dr. Michael Berry and his former students. To learn more about the workshop and the speakers, follow this link http://www.miislita.com/ipam/ipam-document-space-workshop.pdf

Matrix Algebra for Search Marketing

23 Monday Aug 2010

Posted by egarcia in IR Quizzes, Latent Semantic Indexing, Search Engines Architecture Course

≈ Leave a Comment

Today I feel like giving away a little quiz material on applied linear algebra. The topic is relevant these days wherein some misleading SEOs are playing the we-do-”science” game (quack “science”, after all).

The following is taken from the Search Engines Architecture grad course I lectured back in 2008. I’m providing only one exercise with multiple parts. The quiz with answers might be a great topic for an IRW issue.

1.1 A search engine has three types of revenue channels: pay-per-click (PPC), pay-per-placement (PPP), and pay-for-conversion (PFC). In quarter 1, the million-dollar revenues respectively were: 20, 4, and 9. In quarter 2, PPC revenues were 20% less, PPP revenues doubled, and PFC revenues remained constant.

1.1.1 Write a matrix M1 expressing the revenue and quarter vectors for the first two quarters.

1.1.2 If the goal in quarter 3 is to increase by 20% all revenues earned in quarter 2, update M1 so it reflects such a goal as a new matrix M2.

1.1.3 If the goal in quarter 4 is to meet the average revenues of each of the previous channels in quarter 4, update M2 such that it reflects that goal as a new matrix M3.

1.1.4 Express the above quarters as column unit vectors. Inspecting either rows or columns, construct a nearest neighbor similarity matrix Mnn and construct scalar clusters of quarters. Ignore cosine similarity deviations of 0.02 units or less. How similar the quarters are?

Have fun.

A Web-Browser Approach to the Construction of Fractals and Multifractals

28 Friday May 2010

Posted by egarcia in Fractal Geometry, Latent Semantic Indexing, Programming

≈ Leave a Comment

We have published a new article titled:

A Web-Browser Approach to the Construction of Fractals and Multifractals

The traditional way of constructing fractal patterns is with mathematical algorithms. Some of these are based on pixel-by-pixel drawing techniques wherein the output of a recursive function is evaluated against a predefined condition. This is a slow process which requires of a large number of iterations.

Other strategies or combination of strategies have been proposed; for instance, using HTML tables, image files, VML, SVG, or the canvas tag introduced in HTML5. In most cases, implementing these strategies and techniques is not a straightforward process, involve a learning curve, or unnecessarily consume Web server resources.

None of this is necessary with our approach. We are also providing the corresponding source codes at Mi Islita.com, so others can reproduce or improve our results. These are winRAR-zipped files. An unzipped, live example for the Sierpinski Gasket is provided.

Mann Iteration Method

11 Tuesday May 2010

Posted by egarcia in Fractal Geometry, Latent Semantic Indexing, Programming

≈ Leave a Comment

Here are some great papers on Mann Iteration Method.

“What does it has to do with IR, CSS or Fractals?”, you might ask. Well, I’m using it in an upcoming article.

Simpler is also better in approximating fixed points

The equivalence of Mann and Ishikawa iteration methods

The equivalence between the convergence of Mann and Ishikawa iteration methods

Myths about Correlation Coefficients

06 Thursday May 2010

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ Leave a Comment

In a recent post (http://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/) we warned readers against SEO statistical studies. For those that want a second opinion, here is a collection correlation coefficient myths, taken from a reviewed manuscript written by top researchers from Sandia National Labs, Stony Brooks University, Iowa State University, Lewis and Clark College, and Applied Biomathematics. Enjoy it.

http://www.ramas.com/wttreprints/Myths.pdf

The authors provide applications to risk analysis while debunking the following widespreaded myths:

1. All variables are mutually independent.
2. If X and Y are independent and Y and Z are independent, then X and Z are too.
3. Variables X and Y are independent if and only if they are uncorrelated.
4. Zero correlation between X and Y means there’s no relationship between X and Y.
5. Small correlations imply weak dependence.
6. Small correlations can be “safely ignored” in risk assessments.
7. Different correlation coefficients are similar.
8. A correlation coefficient specifies the dependence between two random variables.
9. Correlation coefficients vary between −1 and +1.
10. Any correlation can be specified between inputs.
11. Perfect dependencies between X and Y and between X and Z imply perfect dependence between Y and Z.
12. Monte Carlo simulations can account for dependencies between variables.
13. Varying correlation coefficients constitutes a sensitivity analysis for uncertainty about dependence.
14. A model should be expressed in terms of independent variables only.
15. You have to know the dependence to model it.
16. The notion of independence generalizes to imprecise probabilities.

Read the article to understand in which context these are listed, before agreeing or disagreeing with these.

They also provide reference material to many correlation coefficients:

“There are many different measures of correlation that are in common use and many more that have been proposed. The most commonly used measures are Pearson’s product-moment correlation and Spearman’s (1904) rank correlation, but there are a host of other measures that also arise in various engineering contexts, including Kendall’s rank correlation, concordance of various kinds (e.g., Hoeffding 1947; Lehmann 1966; Scarsini 1984), Blomqvist’s (1950) coefficient, Gini’s coefficient (Nelsen 1999), etc. Hutchinson and Lai (1990) review many of these.”

Fractal Art and Three-Column Iterated Layouts

03 Monday May 2010

Posted by egarcia in Fractal Geometry, Latent Semantic Indexing, Programming

≈ Leave a Comment

I’ve uploaded a new article on fractals:

Fractal Art and Three-Column Iterated Layouts

Enjoy it.

In an upcoming article I will demonstrate how to create classic fractals as found in the literature, right on the user’s browsers, with no HTML5 canvas tag and without high-level mathematical algorithms. We only need to use CSS and HTML. Markup code will be provided for those interested in reproducing the results.

A great ppt presentation on Latent Semantic Indexing

01 Thursday Apr 2010

Posted by egarcia in Latent Semantic Indexing

≈ Leave a Comment

Here is a nice ppt presentation from Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann lectures on Latent Semantic Indexing. Happy to see in the notes of slide 25 that my SVD and LSI Tutorial was referenced.

For those that were gamed by some unethical SEOs using LSI verbose crap, check the following oldie, but still relevant, post:

http://irthoughts.wordpress.com/2007/05/01/irwatch-may-issue-demystifying-lsi/

Preferred Hackers Attack: Database Query Injections

22 Monday Feb 2010

Posted by egarcia in Latent Semantic Indexing

≈ Leave a Comment

According to a report (http://www.infoworld.com/d/security-central/webs-greatest-security-threats-revealed-949), query injections is the preferred hacking method since it enables other forms of attacks. Great.

Vector Notation

10 Monday Aug 2009

Posted by egarcia in Latent Semantic Indexing

≈ 2 Comments

I’ve been asked what is the standard notation for vectors. I normally use loose notation, unless I need to write or review a formal piece, in which case I follow the APS style. See also here.

A vector should be represented by a letter, in boldface or with a right arrow on top.

A caret should be used to indicate a unit vector.

An inner product should be indicated by placing a dot between two letters representing vectors.

Note that Dirac Notation is a different animal.

The APS Style Guide has additional guidelines.

Thesaurus as a Complex Network

06 Thursday Aug 2009

Posted by egarcia in Latent Semantic Indexing, IR Tutorials

≈ 3 Comments

I came across Thesaurus as a complex network, a fascinating 2003 paper written by Adriano de Jesus Holanda, Ivan Torres Pisa, Osame Kinouchi, Alexandre Souto Martinez and Evandro Eduardo Seron Ruiz from Universidade Sao Paulo, Brazil in which they model thesauri using graph theory. The abstracts reads:

“A thesaurus is one, out of many, possible representations of term (or word) connectivity. The terms of a thesaurus are seen as the nodes and their relationship as the links of a directed graph. The directionality of the links retains all the thesaurus information and allows the measurement of several quantities. This has lead to a new term classification according to the characteristics of the nodes, for example, nodes with no links in, no links out, etc. Using an electronic available thesaurus we have obtained the incoming and outgoing link distributions. While the incoming link distribution follows a stretched exponential function, the lower bound for the outgoing link distribution has the same envelope of the scientific paper citation distribution proposed by Albuquerque and Tsallis [1]. However, a better fit is obtained by simpler function which is the solution of Ricatti’s differential equation. We conjecture that this differential equation is the continuous limit of a stochastic growth model of the thesaurus network. We also propose a new manner to arrange a thesaurus using the “inversion method”.”

The study is important because it provides an interesting look at word relationships. They have identified an underlying power law, which in my opinion might be worth to be investigated as to whether it is at core of semantic relationships.

They briefly mentioned the limitations of LSA.:

“However, LSA has been criticized as a poor approach for predicting semantic neighborhood”.

Indeed, LSA (or LSI) not necessarily describes or predicts semantics, as originally thought. In my view, LSA/LSI itself is a misnomer. Research references can be provided to support this view.

I do have one additional comment on the paper. In it, LSA is described as a PCA technique. The authors write:

“Another interesting way to treat data is the Latent Semantic Analysis (LSA) [5] which deals with word covariance in a corpus. LSA is a principal component analysis (PCA) technique , i.e., the covariance matrix is diagonalized and from the most important eigenvalues (around 300) the eigenvectors are considered to span an Euclidean vector space.”

This might not be entirely accurate. Let see why:

1. PCA was invented by Karl Pearson in 1901 so is more than half a century  older than Golub and Kahan’s SVD algorithm which was published in 1965. See G. Golub and W. Kahan, J. SIAM, Numer. Anal. SEr. B, Vol 2, No. 2 (1965). 

2. In 1988 Dumais, et al applied Golub’s SVD to text and called that LSA (LSI). See Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286, Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988). See also, Improving information retrieval using Latent Semantic Indexing. Proceedings of the 1988 annual meeting of the American Society for Information Science. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Beck, L. (1988).

3. In LSA (LSI) the SVD algorithm can be applied to matrices that not necessarily are populated with covariance values.

4. It was only later realized that SVD can be applied to a covariance matrix to obtain the PCA components.

5. See the PCA & SPCA Tutorial

6. PCA is not LSI. See http://irthoughts.wordpress.com/2007/05/05/pca-is-not-lsi/

← Older posts
June 2013
M T W T F S S
« May    
 12
3456789
10111213141516
17181920212223
24252627282930

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.