A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.
Global weights based on entropies are frequently combined with local and normalization weights into overall weights. These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.
The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:
G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.
Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.
Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.