Did you know that before PageRank and current link models, and even before CLEVER and HITS, there were some actually working on hyper search engines and hyper text?


The Quest for Correct Information on the Web: Hyper Search Engines

Massimo Marchiori described just that: components and elements now in use by many current link models, including PageRank. It is clear that there is nothing new under the Sun, but is nice to have a flashback and read about search engines that no longer are among us.

Marchiori’s abstract follows:

Finding the right information on the World Wide Web is becoming a fundamental problem, since the amount of global information that the WWW contains is growing at an incredible rate. In this paper, we present a novel method for extracting from a Web object its “hyper” informative content, in contrast with current search engines, which only deal with the “textual” informative content. This method is not only valuable per se, but it is shown to be able to considerably increase the precision of current search engines, Moreover, it integrates smoothly with existing search engine technology since it can be implemented on top of every search engine, acting as a post-processor, thus automatically transforming a search engine into its corresponding “hyper” version. We also show how, interestingly, the hyper information can be usefully employed to face the search engine persuasion problem.

Some highlights from the WWW6 paper follows (emphasis added). Even when some statements in it are no longer valid and other are, I prefer readers to dissect these in the light of the current state of the art:

The problem is that visibility says nothing about the informative content of a Web object. The misleading assumption is that if a Web object has high visibility, then this is a sign of importance and consideration, and so de facto its informative content must be more valuable than other Web objects that have fewer links pointing to them….In a nutshell, visibility is likely to be a synonym of popularity, which is something completely different than quality, and thus using it to gain higher score from search engines is a rather poor choice.

What is really missing

As said, what is really missing in the evaluation of the score of a Web object is its hyper part, that is the dynamic information content which is provided by hyperlinks (henceforth, simply links).

We call this kind of information hyper information: this information should be added to the textual information of the Web object, giving its (overall) information to the World Wide Web. We indicate these three kinds of information as HYPERINFO, TEXTINFO and INFORMATION, respectively. So, for every Web object A we have that INFORMATION(A) = TEXTINFO(A) + HYPERINFO(A) (note that in general these information functions depend on a specific query, that is to say they measure the informative content of a Web object with respect to a certain query: in the sequel; we will always consider such a query to be understood).

The presence of links in a Web object clearly augments the informative content with the information contained in the pointed Web objects (although we have to establish to what extent).

Recursively, links present in the pointed Web objects further contribute, and so on. Thus, in principle, the analysis of the informative content of a Web object A should involve all the Web objects that are reachable from it via hyperlinks (i.e., “navigating” in the World Wide Web).

Regarding SEP (now called SEO)

A big problem that search engines have to face is the phenomenon of so-called sep (search engines persuasion). Indeed, search engines have become so important in the advertising market that it has become essential for companies to have their pages listed in top positions of search engines, in order to get a significant Web-based promotion. Starting with the already pioneering work of Rhodes ([5]), this phenomenon is now boosting at such a rate as to have provoked serious problems for search engines, and has revolutioned the Web design companies, which are now specifically asked not only to design good Web sites, but also to make them rank high in search engines. A vast number of new companies was born just to make customer Web pages as visible as possible. More and more companies, like Exploit, Allwilk, Northern Webs, Ryley & Associates, etc., explicitly study ways to rank a page highly in search engines. OpenText arrived even to sell “preferred listings;” i.e., assuring that a particular entry will stay in the top ten for some time, a policy that has provoked some controversies (cf. [9]).

Interesting pictures

Regarding information depth:

Regarding the cost of non determinism:

Regarding back links:

12 years later, no much has changed.