We are currently building the Information Retrieval Collection (IRC) with the Minerazzi platform. URLs pointing to resources like articles from scholarly journals and teaching material from scholarly web pages are crawled and indexed.
So far we have found that a non trivial amount of said URLs point to teaching material (lecture notes, tutorials,..) in a markup format with title tags or file names that poorly describe what their content is about, then difficulting indexing. Some don’t even have any useful information in their html head section. Something similar, occurs with the URLs of .pdf and .ps files: we often found no useful file names.
For article journals, this is understandable as that could be the result of editorial policies; not so for teaching materials, though.
Saddly, the taste is that either university webmasters or the scholars who wrote the resources seem to be sloppy or go by the “do as I teach, no as I do” rule.
In any case, the above documents have descriptors that are too short or poorly written to help a human or robot with the indexing. One can do better by going with the anchor text displayed in the documents, but again, that workaround relies on the content quality of said text.
We are working on a partial solution to the problem using temporary tagging instead of extracting summaries after full-text indexing. It is an interesting trick for dynamically building collections, but again it is not a perfect solution.