Here is the 2002 master thesis of Nir Oren, University of the Witwatersand, Johannnesburg:

Improving the effectiveness of information retrieval with genetic programming

where he proposes an interesting approach to IR using genetic algorithms. Part of his abstract states:

“This research seeks to find alternative vector schemes to tf.idf. This is achieved by using a technique from machine learning known as genetic programming. Genetic programming attempts to search a program space for “good” solutions in a stochastic directed manner. This search is done in a manner motivated by evolution, in that the good programs are more likely to be combined to form new programs, while poor solutions die off.

Within this research, evolved programs consisted of a subset of possible classifiers, and one program was deemed better than another if it better classified documents as relevant or irrelevant to a user query.

The contribution of this research is an evaluation of the effectiveness of using genetic programming to create classifiers in a number of IR settings. A number of findings were made: It was discovered that the approach proposed here is often able to outperform the basic tf.idf method: on the CISI and CF datasets, improvements of just under five percent were observed. Furthermore, the form of the evolved programs indicates that classifiers with a structure different to tf.idf may have a useful role to play in information retrieval methods. Lastly, this research proposes a number of additional areas of investigation that may further enhance the effectiveness of this technique.”

The thesis has a basic explanation of many standard retrieval models and techniques and good references.

This is a legacy post originally published on 8/22/2006