• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: March 2008

Search Engines Architecture Week 3

28 Friday Mar 2008

Posted by egarcia in Search Engines Architecture Course

≈ 6 Comments

Week 3 Agenda

Lecture Session

Document Indexing
Web Crawling Techniques

Lab Session

For this lab, students should have already signed to download Terrier from http://ir.dcs.gla.ac.uk/terrier. We can use the Desktop API as is, but for development we need JAVA in the local machine.

Lab instructions for using the API will be provided in class. Please read in advanced Terrier documentation. Bring with you a directory (folder) full of documents from the pupr.edu site or your favorite site to play with. This will be analyzed during the lab.

This lab report is due next week.

PCA and SPCA Tutorial

25 Tuesday Mar 2008

Posted by egarcia in IR Tutorials, Search Engines Architecture Course

≈ Leave a Comment

As promised, the PCA and SPCA Tutorial is now available at Mi Islita.

I wrote this tutorial for two reasons:

1. to help graduate students taking the Search Engines Architecture course with a general overview and review of linear algebra concepts.
2. because most tutorials discuss PCA, but ignore SPCA.

If you are enrolled in the course and have any question, please feel free to post.

Feel also free to comment or make suggestions so we can improve the tutorial.

Verizon, FCC, and the C Block Competition

24 Monday Mar 2008

Posted by egarcia in Hacking, Homeland Security

≈ Leave a Comment

Now that the B and C blocks of the spectrum has been allotted by the FCC things are set-ready-go to open mobile broadband U.S. networks, broadband IR, and, yes, to a whole new hacking space. It’s a matter of time. The C Block hacking competition is coming. Never ignore what can be done with such new playground.

I wonder how the FCC is going to enforce regulations on the 22-MHz portion of the spectrum, already handled to Verizon. http://www.pcworld.com/article/id,143705-c,industrynews/article.html

Meanwhile, IR research centered around open broadband networks are needed, so as search engines.

Important Terrier Upgrade and Call for Participation

20 Thursday Mar 2008

Posted by egarcia in Data Mining, Machine Learning, Search Engines Architecture Course

≈ Leave a Comment

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 – 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Principal Component Analysis Tutorial and Other Algos

18 Tuesday Mar 2008

Posted by egarcia in Data Mining, Machine Learning

≈ 1 Comment

I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods

Unsupervised

Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).

Supervised

Discriminant analysis
Neural Networks (Back propagation)
SIMCA
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

Search Engines Architecture Week 2

14 Friday Mar 2008

Posted by egarcia in IR Tutorials, Latent Semantic Indexing, Search Engines Architecture Course

≈ 11 Comments

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

Back Mapping for SEOs

13 Thursday Mar 2008

Posted by egarcia in Data Mining, Machine Learning

≈ Leave a Comment

A reader asked me how he could apply back mapping (BM) to SEO work.

Essentially, any algorithm like BM that works at the level of collections can be scaled down and used at a page level. One just needs to decompose a page into passages (chunks of text or text windows) and treat each passage as a pseudo document.

First, convert a document into a text stream; i.e., linearize, tokenize, and filtrate the document. You should end with a linearized piece of text.

 If it turns out that the linearized document is almost identical to the original document,  you might want to go back and partition the original document every x paragraphs or sentences. If not, partition the text stream so you end with segments or text windows. You need to decide if you want to do this very x number of words or characters.

Next, treat each segment or chunk of text as a “document” and construct a term-”document” matrix “A”.

If you want to narrow this down to the level of links, construct a term-link matrix.

 In general, an algorithm that applies to a true term-document matrix A can be applied to “A”.

Welcome to scaling.

How Many SEO Myths In One Sentence?

12 Wednesday Mar 2008

Posted by egarcia in Latent Semantic Indexing, SEO Myths

≈ Leave a Comment

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

Back Mapping Document Clusters to Terms

11 Tuesday Mar 2008

Posted by egarcia in Newsletters

≈ Leave a Comment

Back Mapping

Back Mapping is a simple technique that
elegantly identifies local and global topics.

Here is a sneak preview of the current issue of IR Watch – The Newsletter. It should be in subscribers inbox during the day.

IR Watch 2008-03: Back Mapping Documents to Terms

Keywords: back mapping, documents, terms, association clusters, scalar clusters

In this issue:

Introduction
Revisiting Association and Scalar Clusters
On Back Mapping
Back Mapping Example
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Document Clusters to Terms
Identifying Local and Global Topics
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Search Engines for Penetration Testing Course

07 Friday Mar 2008

Posted by egarcia in Graduate Courses

≈ 3 Comments

More on the topic of college education and search engines:

In the Caribbean we don’t have seasons, just early spring breaks –compared to back in the states. After a 2-week spring break vacation, the new trimester just started. I am teaching tomorrow the Search Engines Architecture Course. This is going to be fun.

BTW. I am putting together a new graduate course for the next fall semester: Search Engines for Penetration Testing. I look forward to any suggestion from colleagues and from former  or prospective students. This is going to be a nirvana for ethical hackers/spammers.

← Older posts
March 2008
M T W T F S S
« Feb   Apr »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.