• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Theses

Outstanding IR Theses

18 Tuesday Nov 2008

Posted by egarcia in Newsletters, Theses

≈ Leave a Comment

Thank you to all new researchers that have signed to receive IRW, now in its new format. The following theses are listed in the Outstanding Graduate Theses column:

DNIDS: A Dependable Network Intrusion Detection System Using the CSI-KNN Algorithm
PDF

A Hybrid Knowledge-based/Content-based Recommender System in the Bluejay Genome Browser
PDF

Exploitation of Redundant Inverse Term Frequency for Answer Extraction
PDF

Improving the effectiveness of information retrieval with genetic programming
PDF

BTW our next issue is closed and will go out by the first of December. The featuring article is about identity theft through search engines.

Zeeker: A Topic-Based Search Engine

07 Tuesday Oct 2008

Posted by egarcia in Theses

≈ Leave a Comment

I came across a 2007 graduate thesis that describes the Zeeker Search Engine. It is accessible at

http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/5502/pdf/imm5502.pdf

I am happy to learn it quotes my article The Keyword Density of Non Sense and even republishes in page 9 and 10 two of my images from that article. While it mentions this author and the article’s title, no reference was included in the body or reference sections of the thesis so readers would need to search the Web to find the article.

Fortunately, the thesis mentions the importance of document linearization, a topic we discussed in the ‘KD Non-Sense’ article and that many SEOs still don’t get it.

The thesis is worth a read and explains in easy terms the Spherical K-Means and NMF algorithms. I believe it deserves to be nominated to the next issue of IRW. Read the thesis and let me know what you think.

A temporary holding page is available at

http://fredmus.com/

The music version of Zeeker is available at

http://muzeeker.com/Index.aspx

Outstanding Graduate Theses

06 Monday Oct 2008

Posted by egarcia in Theses

≈ 1 Comment

The following have been nominated in IRW as Outstanding Graduate Theses

Identification of Saudi Arabian License Plates
http://library.kfupm.edu.sa/lib-downloads/A1Y685.pdf

A Language-Based Approach to Categorical Analysis
http://alumni.media.mit.edu/~cameron/cv/pubs/01-thesis.pdf

Just in time Information Retrieval
http://www.bradleyrhodes.com/Papers/rhodes-phd-JITIR.pdf

Sneak Preview of IRW: Graduate Research

01 Friday Aug 2008

Posted by egarcia in Graduate Courses, Machine Learning, Marketing Research, Theses

≈ Leave a Comment

The current issue of IRW, Graduate Students Research, is out. It consists of short abstracts of research conducted by graduate students.

In this issue:

Introduction
Genetic Algorithms, K-Means, and Fuzzy C-Means
Word Association Patterns
U-Site Search Engine Interface
Enhancement of a U-Site Search Engine Interface
News, Research, and Events
Terms of Use and Copyright

The next issue will go back to its how-to mode.

The Porter Stemmer

18 Friday Jul 2008

Posted by egarcia in Machine Learning, Programming, Theses

≈ Leave a Comment

A grad student asks (name omitted):

Dear Dr. Garcia,

I’m interested in developing a Porter Stemmer for the Irish language.

Would it be possible to send me your lecture notes for Porter Stemmer
development from your graduate course?

I am doing an MSc thesis on developing a search engine for TEI marked up
multilingual texts and hope to use Apache Lucene as a basis.

Thanks for any help,

******
MSc student,
UCC, Cork, Ireland.

Thank you for reading this blog and for emailing me, but I normally don’t release lecture notes. However, the lecture was based on Martin Porter’s site, which can be accessed by visiting the following link:

http://tartarus.org/~martin/PorterStemmer/

You might also want to check the Porter2 Stemmer:
http://snowball.tartarus.org/algorithms/english/stemmer.html

For stemmers in other languages, check the Snowball site:
http://snowball.tartarus.org/

The great thing about the Porter stemmer is that it has been written in many programming flavors and languages.

Claps and Slaps

14 Monday Jul 2008

Posted by egarcia in Latent Semantic Indexing, Machine Learning, SEO Myths, Theses

≈ 2 Comments

Claps

Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.

Slaps

There is something funny about SEOs that sell snake oil ( http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.

The following links provide additional information about AIRWeb and EMNLP

AIRWeb
http://irthoughts.wordpress.com/2008/04/29/for-seo-spammers-airweb-2008-presentations/  

EMNLP
http://conferences.inf.ed.ac.uk/emnlp08/

Thesis: DNIDS Using the CSI-KNN Algorithm

04 Friday Jan 2008

Posted by egarcia in Homeland Security, Theses

≈ Leave a Comment

Here is a great 2007 MS Thesis from Liwei (Vivian) Kuang from School of Computing, Queen’s University, Kingston, Ontario, Canada. DNIDS: A Dependable Network Intrusion Detection System Using the CSI-KNN Algorithm

I’m happy she quoted my Cosine Similarity Tutorial.

Part of the abstract states: “In this thesis, we propose a Dependable Network Intrusion Detection System(DNIDS) based on the Combined Strangeness and Isolation measure K-Nearest Neighbor(CSI-KNN) algorithm. The DNIDS can effectively detect network intrusionswhile providing continued service even under attacks. The intrusion detection algorithmanalyzes different characteristics of network data by employing two measures:strangeness and isolation. Based on these measures, a correlation unit raises intrusionalerts with associated confidence estimates. In the DNIDS, multiple CSI-KNNclassifiers work in parallel to deal with different types of network traffic. An intrusiontolerantmechanism monitors the classifiers and the hosts on which the classifiers resideand enables the IDS to survive component failure due to intrusions. As soon asa failed IDS component is discovered, a copy of the component is installed to replaceit and the detection service continues.”

“We evaluate our detection approach over the KDD’99 benchmark dataset. Theexperimental results show that the performance of our approach is better than the bestresult of the KDD’99 contest winner. In addition, the intrusion alerts generated byour algorithm provide graded confidence that offers some insight into the reliabilityof the intrusion detection. To verify the survivability of the DNIDS, we test theprototype in simulated attack scenarios. In addition, we evaluate the performanceof the intrusion-tolerant mechanism and analyze the system reliability.”

Disgression to Numerical Dynamics and Chaos

27 Thursday Sep 2007

Posted by egarcia in Programming, Theses

≈ Leave a Comment

I recently reviewed a thesis project wherein the student used recursion to minimize a function. The student used number of iterations as a stopping criterion. When I revised the manuscript, it gave me a flashback to 1991.

Back then I was inmersed into Numerical Dynamics and Chaos Theory conferences. I wrote a paper wherein one little component consisted in stopping a routine if it reaches 10,000 iterations. The piece landed in the hands of a picky reviewer. It was rejected on the grounds that number of iterations was not a good stopping criterion.

The reviewer was absolutely right. I missed the point back then, though.

Why use 1,000, 10,000, or 1,000,000 iterations? The point is that no matter how many iterations one uses there will always be someone out there asking: Why use n number of iterations? Why not more?

Indeed, number of iterations as a stopping criterion is not recommended, especially if the goal is optimization. Not to mention that this is a subjective approach which requires initializing an extra parameter. The parameter might be sensitive to other variables or initial conditions.

Instead of number of iterations, a better approach consists in using a comparative function F. F can be defined as the absolute error or absolute relative error between a sequence of results.

If F is less than a threshold value, we stop the iterative procedure; otherwise, we continue with it.

Such functions are given below:

F = | (X(n) – X(n+1)) | * 100 < threshold value
F = | (X(n) – X(n+1))/X(n+1) | * 1000 < threshold value

and so forth.

For example if F is less than, let say, 10^-6, we stop the recursions.

In this case, F is based on an objective, statistical criterion: the relative error between two consecutive results. Note that:

1. the analyst states the threshold value that best sweets his/her precision needs.

2. F is independent of any initial condition or parameter.

Of course, for this to work the recursion should be toward what we call in Chaos Theory, an attractive fixed point. Methods for evaluating if this is the case do exist in Numerical Dynamics (e.g., the Fixed Point Theorem).

Fortunately, the student’s problem was not of this kind and he was not dealing with a dynamical system at all.

Co-Weight or Co-Occurrence Matrices?

05 Wednesday Sep 2007

Posted by egarcia in Data Mining, Latent Semantic Indexing, Machine Learning, Theses, Vector Space Models

≈ 1 Comment

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

Reviewing Papers: How-To

16 Thursday Aug 2007

Posted by egarcia in Conferences, Legacy Posts, Theses

≈ Leave a Comment

As reviewer of journal manuscripts and conference papers I normally look to see if the piece before me answers the following questions:

1. WHAT-WHY: What is the scientific problem at hand and why is important?
2. WHO-WHAT-WHY: Who proposed what previous solutions and why are these inadequate or incomplete?
3. WHAT-YOUR-WHY: What is your proposed solution and why is better?
4. HOW-WHAT: How is the solution implemented and what are the benefits or practical applications?
5. PROS-CONS-WHAT: What are the possible pros and cons of your solution and what are the next areas of research?

Continue reading »

← Older posts

♣  

May 2012
M T W T F S S
« Apr    
 123456
78910111213
14151617181920
21222324252627
28293031  

♣ Favorite Sites

  • Mi Islita

♣ Pages

  • About IR Thoughts

♣ Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

♣ Recent Posts

  • Puerto Rico’s Science and Technology Trust Fund: Innovation Island Blast II
  • The L’Hôpital Rule: Deriving the Geometric Mean
  • Understanding the L’Hôpital Rule
  • How to Create Windows Metro Style Apps with JavaScript
  • Electronic Drugs and Hackers
  • Why a Social and Search Presence is Important for You
  • NY SES – 2012: My little briefing
  • Hello, World. I’m SWM.
  • SES NY – See You All There!
  • Which separators to use with title tags?
  • A Study of Puerto Rico Newspaper Home Pages
  • Hey, SEOs: On Information Gain, Keyword Wallop, and Relevance
  • Social Media and Puerto Rico Local Brands
  • When and Why not to take arithmetic averages
  • l’Hopital’s Rule and the 0^0 Power Controversy

♣ Archives

  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

♣ Category Cloud

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Image Compression Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.