Here is a python-based search engine with an implementation inspired on one of our papers at the old Mi Islita.com site, now a search engine on Puerto Rico.
This module replicates the miislita vector spaces from
"A Linear Algebra Approach to the Vector Space Model -- A Fast Track Tutorial"
by Dr. E. Garcia...
Great and positive accomplishment!
That tutorial is no longer at miislita.com, but was long ago moved to minerazzi.com. Find it here:
For other resources do a search for python in our IR miner at
For inquiries about that implementation, contact its author.
For other inquiries, applications, suggestions, drop me a line.
PS. Please note that Nullege.com itself is a search engine for finding python code. Here is a good example: http://nullege.com/codes/search/wx.calendar.CalendarCtrl
Effectively immediately Minerazzi (http://www.minerazzi.com) allows users to recursively recrawl search results.
Why is recrawling so important?
The purpose of allowing users to recrawl URLs is to expose them to new content, to involve them in learning through discovery. To turn their searches into a mining activity. This makes more sense than limiting their search experience to inspecting zillion of cached records from a search engine index. The problem with the latter is that frequently those records are either outdated or irrelevant, not to mention that in that scenario the users are simply passive expectators.
Allowing users to recrawl search results has many advantages and possibilities. For instance, users can use the discovered URLs to build curated collections, self-guide investigative work, or gather link intelligence from sites, directories, blogs, forums, or social networks. In general, recrawling allows users to discover hidden paths to fresh, new, or rich content.
Considering that the total number of primary and secondary URLs defines the reach of a microindex, in theory recrawling should result into an endless reach.
At this time, we do not recrawl .css and .pdf files, but we recrawl the most common file formats (.php, .asp, .aspx, .html, .htm, .js, etc). However, if the content of a file is dynamic, obfuscated, or poorly coded more likely it will return garbage or nothing.
Having said that, we invite you to try the recrawling experience with our public miners.