Archive for the ‘Theses’ Category

Thesis: DNIDS Using the CSI-KNN Algorithm

January 4, 2008

Here is a great 2007 MS Thesis from Liwei (Vivian) Kuang from School of Computing, Queen’s University, Kingston, Ontario, Canada. DNIDS: A Dependable Network Intrusion Detection System Using the CSI-KNN Algorithm

I’m happy she quoted my Cosine Similarity Tutorial.

Part of the abstract states: “In this thesis, we propose a Dependable Network Intrusion Detection System(DNIDS) based on the Combined Strangeness and Isolation measure K-Nearest Neighbor(CSI-KNN) algorithm. The DNIDS can effectively detect network intrusionswhile providing continued service even under attacks. The intrusion detection algorithmanalyzes different characteristics of network data by employing two measures:strangeness and isolation. Based on these measures, a correlation unit raises intrusionalerts with associated confidence estimates. In the DNIDS, multiple CSI-KNNclassifiers work in parallel to deal with different types of network traffic. An intrusiontolerantmechanism monitors the classifiers and the hosts on which the classifiers resideand enables the IDS to survive component failure due to intrusions. As soon asa failed IDS component is discovered, a copy of the component is installed to replaceit and the detection service continues.”

“We evaluate our detection approach over the KDD’99 benchmark dataset. Theexperimental results show that the performance of our approach is better than the bestresult of the KDD’99 contest winner. In addition, the intrusion alerts generated byour algorithm provide graded confidence that offers some insight into the reliabilityof the intrusion detection. To verify the survivability of the DNIDS, we test theprototype in simulated attack scenarios. In addition, we evaluate the performanceof the intrusion-tolerant mechanism and analyze the system reliability.”

Disgression to Numerical Dynamics and Chaos

September 27, 2007

I recently reviewed a thesis project wherein the student used recursion to minimize a function. The student used number of iterations as a stopping criterion. When I revised the manuscript, it gave me a flashback to 1991.

Back then I was inmersed into Numerical Dynamics and Chaos Theory conferences. I wrote a paper wherein one little component consisted in stopping a routine if it reaches 10,000 iterations. The piece landed in the hands of a picky reviewer. It was rejected on the grounds that number of iterations was not a good stopping criterion.

The reviewer was absolutely right. I missed the point back then, though.

Why use 1,000, 10,000, or 1,000,000 iterations? The point is that no matter how many iterations one uses there will always be someone out there asking: Why use n number of iterations? Why not more?

Indeed, number of iterations as a stopping criterion is not recommended, especially if the goal is optimization. Not to mention that this is a subjective approach which requires initializing an extra parameter. The parameter might be sensitive to other variables or initial conditions.

Instead of number of iterations, a better approach consists in using a comparative function F. F can be defined as the absolute error or absolute relative error between a sequence of results.

If F is less than a threshold value, we stop the iterative procedure; otherwise, we continue with it.

Such functions are given below:

F = | (X(n) - X(n+1)) | * 100 < threshold value
F = | (X(n) - X(n+1))/X(n+1) | * 1000 < threshold value

and so forth.

For example if F is less than, let say, 10^-6, we stop the recursions.

In this case, F is based on an objective, statistical criterion: the relative error between two consecutive results. Note that:

1. the analyst states the threshold value that best sweets his/her precision needs.

2. F is independent of any initial condition or parameter.

Of course, for this to work the recursion should be toward what we call in Chaos Theory, an attractive fixed point. Methods for evaluating if this is the case do exist in Numerical Dynamics (e.g., the Fixed Point Theorem).

Fortunately, the student’s problem was not of this kind and he was not dealing with a dynamical system at all.

Co-Weight or Co-Occurrence Matrices?

September 5, 2007

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

Reviewing Papers: How-To

August 16, 2007

As reviewer of journal manuscripts and conference papers I normally look to see if the piece before me answers the following questions:

1. WHAT-WHY: What is the scientific problem at hand and why is important?
2. WHO-WHAT-WHY: Who proposed what previous solutions and why are these inadequate or incomplete?
3. WHAT-YOUR-WHY: What is your proposed solution and why is better?
4. HOW-WHAT: How is the solution implemented and what are the benefits or practical applications?
5. PROS-CONS-WHAT: What are the possible pros and cons of your solution and what are the next areas of research?

(more…)

Thesis on Redundant ITF

July 17, 2007

Thomas Richard Lynam, has researched extensively a variant of ITF called Redundant ITF (RITF). His 2002 master thesis, “Exploitation of Redundant Inverse Term Frequency”, is a must-read for anyone interested in the topic. His thesis is available as a PDF and Postscript.

The justification for using RITF is as follows.

(more…)

Thesis: A Hybrid Knowledge-based/Content-based Recommender

May 10, 2007

Here are some great news:

1. I am getting ready for my presentation at the Intektel International Conference and Expo. I am presenting the second day of the conference on “The Impact of Search Engines in the Internet”.

2. Next week we have the ARIN Conference (American Registry of Internet Numbers) in Puerto Rico, and in June we have also in San Juan, PR the 29th ICANN Conference. WOW!

3. Taschuk Morgan has written an excellent Honour Thesis in which kindly references our tutorial on Cosine Similarity and Term Weights. Morgan writes:

(more…)

Thesis: Understanding LSI via the Term-Term Truncated Matrix

May 10, 2007

As we mentioned in IR Watch - The Newsletter (got a free subscription?), although LSI (LSA) itself is not first-order co-occurrence (see Prof. Tom Landauer: Introduction to Latent Semantic Analysis), a recent thesis from Regis Newo shows that high-order co-occurrence might be at the heart of LSI and is what makes the technique works. This 2005 thesis abstract on Understanding LSI via the Truncated Term-Term Matrix states:

(more…)

Thesis: Information Retrieval with Genetic Programming

May 10, 2007

Here is the 2002 master thesis of Nir Oren, University of the Witwatersand, Johannnesburg:

Improving the effectiveness of information retrieval with genetic programming

where he proposes an interesting approach to IR using genetic algorithms. Part of his abstract states:

(more…)

Thesis: A Language-Based Approach to Categorical Analysis

May 10, 2007

I am finishing reading the 2001 Master Thesis of Cameron Alexander Marlow, from MIT:
A Language-Based Approach to Categorical Analysis

where he proposed the use of Synchronic Imprints (SI) combined with LSI. Great thesis. Essentially, SI incorporates a spring model in which term frequencies are inversely proportional to their distances.

(more…)