Search Engines Architecture Week 4

Week 4 Agenda

Lecture Session

Building an Inverted Index
Incorporating Query Modes
Building the Term-Doc Matrix from the Inverted Index

Lab Session

Programming the Parser
Programming the Inverted Index

2 Responses to “Search Engines Architecture Week 4”

  1. panzernieves Says:

    Hi Prof.:

    We’ve been working with the perl spider for a while now and while it seems to work (at least it runs!!!) it doesnt seems to be working right. It never get the right results (it doesn’t get any results, that is). I’d document what I do understand about the code but there not much I can do for now.

    G. Nieves

  2. E. Garcia Says:

    Since this is a hands-on course, any roadblock is a learning lesson.

    Try other alternatives.

    1. Google for other perl scripts (there are plenty).
    2. Try to use a C code like the old Lycos Scoutget code.
    http://robot-club.com/lti/lycos/scoutget.html
    3. Try any PHP or Java based spider.

    Here is a barebone simple pseudocode. The crawler should do this:

    1. get a web page via an http request (it can be via AJAX) and send it to a directory.
    2. scrape links from that page.
    3. pick a link from the page and do step 1.

    Modifications:

    In step 2, the program sends links to a list of links and in step 3, picks a link from the list instead from the page.

    Once we send enough documents to the directory, we can index this with Terrier.

Leave a Reply

You must be logged in to post a comment.