• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Theses

What is the more effective way of writing scientific research articles?

15 Friday Jun 2012

Posted by egarcia in Data Mining, Graduate Courses, Miscellaneous, Newsletters, Theses

≈ Leave a Comment

Over the years, I’ve been asked about the more effective way of writing peer-reviewed articles for scientific journals.

My response is always the same: Think like a referee/editor. Here is a list of items that they want to see accomplished:

Referees/editors like to see that the content and format of the title, abstract, document body, tables, images, graphics, appendices, and references follow their journal guidelines.

In general, referees/editors like to see in the first page of the printed version of an article:

1. Statement of the problem – what is the problem to be solved.
2. Purpose of the article – how the present research solves the problem.
3. Organization – how the article is organized and what is covered in each section.

This is a general practice across scientific journals. So, whenever possible, I try to accomplish 1 – 3 in the first three paragraphs of the first page of the printed article. To do this, you need to avoid lengthy introductions and wordiness. Be concise and ‘go the point’.

Referees/editors also like to see the article as a whole semantic unit. So they like to see:

Transitional statements; i.e., sections ending as an introduction to the next section.

1. One paragraph, one idea; i.e., each paragraph discussing one main idea.
2. Short paragraphs; i.e., each paragraph of about five sentences or less, where sentences are of appropriate length. This provides a natural stop to the reading. In general, short paragraphs and sentences are easier to read than the long ones. Use compound sentences with caution.
3. Facts supported by pertinent references.
4. Opinion written as opinions, not as facts.

Of course, there are other tips to think about, but in my opinion, the above can make a difference… well, in my opinion :)

Outstanding IR Theses

18 Tuesday Nov 2008

Posted by egarcia in Newsletters, Theses

≈ Leave a Comment

Thank you to all new researchers that have signed to receive IRW, now in its new format. The following theses are listed in the Outstanding Graduate Theses column:

DNIDS: A Dependable Network Intrusion Detection System Using the CSI-KNN Algorithm
PDF

A Hybrid Knowledge-based/Content-based Recommender System in the Bluejay Genome Browser
PDF

Exploitation of Redundant Inverse Term Frequency for Answer Extraction
PDF

Improving the effectiveness of information retrieval with genetic programming
PDF

BTW our next issue is closed and will go out by the first of December. The featuring article is about identity theft through search engines.

Zeeker: A Topic-Based Search Engine

07 Tuesday Oct 2008

Posted by egarcia in Theses

≈ Leave a Comment

I came across a 2007 graduate thesis that describes the Zeeker Search Engine. It is accessible at

http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/5502/pdf/imm5502.pdf

I am happy to learn it quotes my article The Keyword Density of Non Sense and even republishes in page 9 and 10 two of my images from that article. While it mentions this author and the article’s title, no reference was included in the body or reference sections of the thesis so readers would need to search the Web to find the article.

Fortunately, the thesis mentions the importance of document linearization, a topic we discussed in the ‘KD Non-Sense’ article and that many SEOs still don’t get it.

The thesis is worth a read and explains in easy terms the Spherical K-Means and NMF algorithms. I believe it deserves to be nominated to the next issue of IRW. Read the thesis and let me know what you think.

A temporary holding page is available at

http://fredmus.com/

The music version of Zeeker is available at

http://muzeeker.com/Index.aspx

Outstanding Graduate Theses

06 Monday Oct 2008

Posted by egarcia in Theses

≈ 1 Comment

The following have been nominated in IRW as Outstanding Graduate Theses

Identification of Saudi Arabian License Plates
http://library.kfupm.edu.sa/lib-downloads/A1Y685.pdf

A Language-Based Approach to Categorical Analysis
http://alumni.media.mit.edu/~cameron/cv/pubs/01-thesis.pdf

Just in time Information Retrieval
http://www.bradleyrhodes.com/Papers/rhodes-phd-JITIR.pdf

Sneak Preview of IRW: Graduate Research

01 Friday Aug 2008

Posted by egarcia in Graduate Courses, Machine Learning, Marketing Research, Theses

≈ Leave a Comment

The current issue of IRW, Graduate Students Research, is out. It consists of short abstracts of research conducted by graduate students.

In this issue:

Introduction
Genetic Algorithms, K-Means, and Fuzzy C-Means
Word Association Patterns
U-Site Search Engine Interface
Enhancement of a U-Site Search Engine Interface
News, Research, and Events
Terms of Use and Copyright

The next issue will go back to its how-to mode.

The Porter Stemmer

18 Friday Jul 2008

Posted by egarcia in Machine Learning, Programming, Theses

≈ Leave a Comment

A grad student asks (name omitted):

Dear Dr. Garcia,

I’m interested in developing a Porter Stemmer for the Irish language.

Would it be possible to send me your lecture notes for Porter Stemmer
development from your graduate course?

I am doing an MSc thesis on developing a search engine for TEI marked up
multilingual texts and hope to use Apache Lucene as a basis.

Thanks for any help,

******
MSc student,
UCC, Cork, Ireland.

Thank you for reading this blog and for emailing me, but I normally don’t release lecture notes. However, the lecture was based on Martin Porter’s site, which can be accessed by visiting the following link:

http://tartarus.org/~martin/PorterStemmer/

You might also want to check the Porter2 Stemmer:
http://snowball.tartarus.org/algorithms/english/stemmer.html

For stemmers in other languages, check the Snowball site:
http://snowball.tartarus.org/

The great thing about the Porter stemmer is that it has been written in many programming flavors and languages.

Claps and Slaps

14 Monday Jul 2008

Posted by egarcia in Latent Semantic Indexing, Machine Learning, SEO Myths, Theses

≈ 2 Comments

Claps

Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.

Slaps

There is something funny about SEOs that sell snake oil ( http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.

The following links provide additional information about AIRWeb and EMNLP

AIRWeb
http://irthoughts.wordpress.com/2008/04/29/for-seo-spammers-airweb-2008-presentations/  

EMNLP
http://conferences.inf.ed.ac.uk/emnlp08/

Thesis: DNIDS Using the CSI-KNN Algorithm

04 Friday Jan 2008

Posted by egarcia in Homeland Security, Theses

≈ Leave a Comment

Here is a great 2007 MS Thesis from Liwei (Vivian) Kuang from School of Computing, Queen’s University, Kingston, Ontario, Canada. DNIDS: A Dependable Network Intrusion Detection System Using the CSI-KNN Algorithm

I’m happy she quoted my Cosine Similarity Tutorial.

Part of the abstract states: “In this thesis, we propose a Dependable Network Intrusion Detection System(DNIDS) based on the Combined Strangeness and Isolation measure K-Nearest Neighbor(CSI-KNN) algorithm. The DNIDS can effectively detect network intrusionswhile providing continued service even under attacks. The intrusion detection algorithmanalyzes different characteristics of network data by employing two measures:strangeness and isolation. Based on these measures, a correlation unit raises intrusionalerts with associated confidence estimates. In the DNIDS, multiple CSI-KNNclassifiers work in parallel to deal with different types of network traffic. An intrusiontolerantmechanism monitors the classifiers and the hosts on which the classifiers resideand enables the IDS to survive component failure due to intrusions. As soon asa failed IDS component is discovered, a copy of the component is installed to replaceit and the detection service continues.”

“We evaluate our detection approach over the KDD’99 benchmark dataset. Theexperimental results show that the performance of our approach is better than the bestresult of the KDD’99 contest winner. In addition, the intrusion alerts generated byour algorithm provide graded confidence that offers some insight into the reliabilityof the intrusion detection. To verify the survivability of the DNIDS, we test theprototype in simulated attack scenarios. In addition, we evaluate the performanceof the intrusion-tolerant mechanism and analyze the system reliability.”

Disgression to Numerical Dynamics and Chaos

27 Thursday Sep 2007

Posted by egarcia in Programming, Theses

≈ Leave a Comment

I recently reviewed a thesis project wherein the student used recursion to minimize a function. The student used number of iterations as a stopping criterion. When I revised the manuscript, it gave me a flashback to 1991.

Back then I was inmersed into Numerical Dynamics and Chaos Theory conferences. I wrote a paper wherein one little component consisted in stopping a routine if it reaches 10,000 iterations. The piece landed in the hands of a picky reviewer. It was rejected on the grounds that number of iterations was not a good stopping criterion.

The reviewer was absolutely right. I missed the point back then, though.

Why use 1,000, 10,000, or 1,000,000 iterations? The point is that no matter how many iterations one uses there will always be someone out there asking: Why use n number of iterations? Why not more?

Indeed, number of iterations as a stopping criterion is not recommended, especially if the goal is optimization. Not to mention that this is a subjective approach which requires initializing an extra parameter. The parameter might be sensitive to other variables or initial conditions.

Instead of number of iterations, a better approach consists in using a comparative function F. F can be defined as the absolute error or absolute relative error between a sequence of results.

If F is less than a threshold value, we stop the iterative procedure; otherwise, we continue with it.

Such functions are given below:

F = | (X(n) – X(n+1)) | * 100 < threshold value
F = | (X(n) – X(n+1))/X(n+1) | * 1000 < threshold value

and so forth.

For example if F is less than, let say, 10^-6, we stop the recursions.

In this case, F is based on an objective, statistical criterion: the relative error between two consecutive results. Note that:

1. the analyst states the threshold value that best sweets his/her precision needs.

2. F is independent of any initial condition or parameter.

Of course, for this to work the recursion should be toward what we call in Chaos Theory, an attractive fixed point. Methods for evaluating if this is the case do exist in Numerical Dynamics (e.g., the Fixed Point Theorem).

Fortunately, the student’s problem was not of this kind and he was not dealing with a dynamical system at all.

Co-Weight or Co-Occurrence Matrices?

05 Wednesday Sep 2007

Posted by egarcia in Data Mining, Latent Semantic Indexing, Machine Learning, Theses, Vector Space Models

≈ 1 Comment

I reviewed few months ago a research manuscript and a thesis wherein the same author indiscriminately used the expression “a co-occurrence matrix”. The author, a graduate student and friend, allowed me to post this, since we think it may be of benefit to other graduate students.

Co-Weight Matrices

Let A be a term-document matrix populated with term weights, aij, where aij is the weight of term i in document j, and defined as follows:

aij = Lij*Gi*Nj

Lij = a local weight
Gi = a global weight
Nj = a normalization weight

Let AT be the transpose of A. Consequently, an unnormalized co-weight matrix, Cu, is defined as

Cu = A*AT

Cu can be normalized by restating its elements as Jaccard’s Coefficients, in which case a normalized co-weight matrix, Cn, is obtained. If Jaccard’s Coefficients are taken for similarity measures, then Cn is a normalized similarity matrix.

Co-Occurrence Matrices

An unnormalized and a normalized co-occurrence matrix are respectively obtained from Cu and Cn. This is accomplished by initially setting Nj = 1, Gi = 1, and Lij = fij; where fij is the occurrence of term i in document j.

This means that term weights are defined as mere local weights and based on raw word occurrences in documents:

aij = fij

All these matrices can be transformed into binary matrices by setting aij values to 1 or 0. These values indicate the presence (1) or absence (0) of term i in document j, regardless if terms occur many times in documents. Thus, binary co-occurrence -and therefore, binary co-weight- matrices are particular cases.

To conclude, a co-occurrence matrix, normalized or not, or binary or not, is just a particular case of a co-weight matrix.

The indiscriminate use of the term “co-occurrence matrix” should be avoided, since the expression implies that term weights are defined as occurrences, aij = fi. This is not always the case, though.

All co-occurrence matrices are co-weight matrices, but the reverse is not necessarily true; not all co-weight matrices are co-occurrence matrices. Calling “co-occurrence” something that is not is risky.

Unfortunately, we frequently read research papers, including LSI papers, wherein authors and reviewers fail to recognize this generalization.

I advice graduate students and readers (i.e., SEOs, IR friends, colleagues) to avoid such generalizations.

← Older posts
May 2013
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.