, , , , ,

In Web Spam Taxonomy, Gyongyi and Garcia-Molina, describe several web spam techniques, one being honey pots.

They describe these as “a set of pages that provide some useful resource …, but that also have (hidden) links to the target spam page(s).

To target non-human visitors (e.g., web crawlers), said links could be placed in HTML elements that are made invisible (e.g. division tags with a display:none CSS rule) and then with rel=”dofollow” in the anchor tags.

These types of tricks can be easily unveiled with the recrawling feature of Minerazzi.

For instance, searching for [ heap sort ] in http://www.minerazzi.com/dsac retrieves several records, one being http://www.aihorizon.com/resources/sourcecode/trees/heap_h.htm.

Clicking the “Recrawl it” link retrieves several URLs, one being http://www.aihorizon.com/index.htm.

Clicking again the “Recrawling the Web” Recrawling that link quickly reveals that said index page has several hidden links to porn sites. Looking at the source code of that URL shows that the above adversarial technique was used.