As described in the current issue of IRW newsletter, meta information (structured or unstructured) can be extracted from Web documents with dashboard technology.

To illustrate this, we have incorporated a non-intrusive dashboard in the Fractal Resource Index  sub-site of Mi Islita.com (http://www.miislita.com/fractals/index.html).  One of its channels reads the content of the meta keywords tag and generates a keyword matrix of bigrams (two-term keywords).

Clicking on a bigram opens a new browser window and submits the bigram as a query to a search engine (in this case, to Google). This allows one to estimate keyword co-occurrence statistics and which pair of terms is relevant to the current document. In addition, if each column (or row) is treated as a topic vector, this allows one to identify which topics are associated to a set of terms. Another advantage is that a second matrix can be constructed by inspecting a current relevance matrix. This is done as follows.

A cell bigram is first clicked. If the current page is in the top N ranked documents, color-code the cell in black, otherwise in white. Alternatively, code cells with the rank obtained. This simple exercise allows one to identify a cluster of terms that is semantically relevant to the current page.

Be advised that results can change in time or across search engines. Once relevant bigrams have been identified, trigrams (three-term keywords) containing one or two terms from the relevant bigrams can be constructed and tested. Indeed, this is how we construct trigrams for some of the sub-site pages.

This allows one to monitor, test, and, if neccesary change keywords. Originally, matrices were created out of anchor text and also the entire document. However, the resultant matrices were too big. Testing and maintenance was also another formidable task. Another alternative tested was selecting the top N terms according to their term weights (vectors space based) and then constructing the matrix out of these. Then carry out the analysis as before.

Still, we settled for the meta keywords tag purely for convenience as terms can anyway be selected based on their term weights and stored in a single HTML element and easily tested.

Advertisements