Archive for May, 2008

Search Engines Architecture Week 10

May 16, 2008

Week 10 Agenda

Lecture Session

Other Inverted Index Architectures
Divide-and-Conquer Strategies for Fast Indexing and Searching

Lab Session

Lectures and Lab Review

Final Examination Notes

Next week we have the final examination. This is an open book exam, with theory and practice sections.

To answer the test you need:

#2 pencil.
Calculator.
Working version of Terrier.
Tools developed during the course: parser, crawler, url and query normalizers, stemmer, etc.
Laptop (or a PC will be supplied to you).

IRW-2008-05:Search Engines Architecture Review Test

May 13, 2008

Search Engine Architecture

How much do you know about the architecture of a search engine?

The current issue of IR Watch - The Newsletter is out (finally!).

It consists of a Theory and Practice Test I prepared for PUPR.edu graduate students taking my Search Engines Architecture course. As such, it deviates from the format used in previous issues of the newsletter. I thought it may be used to assess how much readers know about the architecture of a search engine. If you are an IR student, the test will probably help you to review basic concepts. In fact, the test is intentionally long since it was designed to serve as a comprehensive review. The actual exam is slightly different in terms of length and content.

In This Issue:

Introduction
Early Architectures
First/Deep-Breadth Crawlers
Search Agents
Dispatcher
Forward Index
Inverted Index
Query Servers
Front/Back-End Servers
Lexicon/Thesaurus
Posting Lists
Search Results
Tokenization
Filtration
Stemming
Search Modes
News, Research, and Events
Terms of Use and Copyright

Note to Students: The final exam will be on 24th and is a Theory and Practice examination. Bring with you the tools developed during the course (web crawler, parser, stemmer, Terrier, query normalizer, etc). These will be used during the test. If you still own me a lab, be sure to turn it in this saturday.

PowerSet Semantic Searches in Wikipedia

May 12, 2008

According to Reuters,

Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.

Powerset’s technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.

What Google has to say about the topic?

According to PCWorld:

In an interview in October with IDG News Service, Marissa Mayer, Google’s vice president of Search Products & User Experience, acknowledged that the company’s search engine should — and will — overcome its keyword dependence in time.

“People should be able to ask questions and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like ‘what is this about?’. A lot of people will turn to things like the semantic Web as a possible answer to that,” she said.

But she added that Google’s search engine acts smart thanks to the humongous amount of data it crunches. “With a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force,” she said. As examples, she cited a query like “GM,” which the engine interprets as “General Motors” but if the query is “GM foods,” it delivers results for “genetically-modified foods.” “Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart, like it achieved that semantic understanding, but it hasn’t really,” she said.

Hmm…

A search for GM goods and for GM in Powerset returns results relevant to General Motors, while Google does discriminate these searches possibly using brute force.

By contrast, a search for GM foods and for GM in both are discriminated.

PowerSet, Google, and almost all search engines do not seem to discriminate between the following two semantically different searches, which score against aforementioned semantic analysis claims:

Who is the best college junior?

Who is the best junior college?

A simple change in word order affects meaning and the information needs sought. Semantic searches? It is still a long way to go. This gonna be a nice race to watch, from the architectural side.

Talking about search engines architecture, the current issue of IRWatch - The Newsletter is the very same practice test I am giving to my grad students. Since they need to study for the finals, I thought I could kill two birds with one stone. It should reach subscribers inbox today or, at the latest, tomorrow.

Search Engines Architecture Week 9

May 9, 2008

Week 9 Agenda

Lecture Session

This lecture is an extension of our previous lecture. A detailed discussion of the search engine architecture floor plan of the following search engines is presented:

WebCrawler
Google

Main components to be discussed include:

crawlers administration, indexing, forward index, inverted index, posting lists intersection, faulty tolerance, redundancy, query servers load balance, etc.

If we don’t run out of time, we will get into Lucene and other architectures.

Lab Session

Complete previous labs.

Important Note

I am finishing the Practice Test and should be in students inbox by Monday.

Search Engines Cache in the Times of Drug Busts

May 7, 2008

One nice thing about modern search engines is that these allow users access to cached pages. These are old version pages that reside -often precompressed- in a specific section of their architecture. 

Unless the owner or administrator of a site instructs search engines (via metadata or a robot text file) not to cache a document(s) old versions will be available to the end users via the cache command or via a cache link next to a search result.

This feature comes handy for those that use search engines for intelligence purposes. A lot of useful information can be found by searching for cached documents. At the same times old glorious pages can be become unwanted.

Ask San Diego State University’s Marketing and Communication Department. Out of embarrassment, they just removed the document listed at http://advancement.sdsu.edu/marcomm/features/2006/compact.html in which they feature a role model student (Kenneth Ciaccio), which yesterday was arrested on charges in connection with an on campus drug bust operation.

The page is still showing up in Google’s cache and reflects bad on SDSU and its Compact for Success program. To access this in Google just do a search and click the cache link or enter in the query box cache:url where url is the address of the above document.

 

Pink Keywords: Optimization of Resumes and Job Applications

May 5, 2008

The current slump in the US and PR economy and so many local employers giving pink slips induces me to think of the importance of pink keywords.

These are keywords one would use to optimize resumes and job applications.

Now than ever recruiters, middle management, and HR departments need to look through zillion of resumes, looking for specific clues in the form of pinky keywords. This means that resumes and job applications must be optimized for such terms.

http://career-advice.monster.com/resume-writing-basics/Keyword-Challenge/home.aspx

The best way of finding good pinky keywords consists in selling to employers their own crappy ads and job offers; that is, by scanning employment ads, job offerings, and classifieds relevant to the target position one is interested in and then using the target terms in your own resume. Another thing one can do is to expand these with related or contextual terms; of couse, using those that match your own experience and skills.

I see here an opportunity for ethical SEO companies to provide a valuable and noble service: Pinky Optimization. At the same time I see an opportunity for crook SEOs and spammers to prey on other people’s misfortune. Since many in the seophere have being disposed by fat cats and sold(soul)-outs, these folks are also job searching. Life ironies.

Search Engines Architecture Week 8

May 2, 2008

Week 8 Agenda

Lecture Session

To understand the present we need to look at the past.

Thus, in this lecture we will take an in-depth look at early and current search engine architectures and their “floor plan”. We look at some published papers that started what we have today. Some hard-to-find material will be used. These are actual pieces of history that explain few “how-did-they-do-that”..

Since we cannot cover all the search engine architectures as we would like to, I have selected few of these. These were classified in three category:early, old glory, and currrent. The last two are open source projects.

I might add few more. Either way, this lecture might cover two weeks.

Early:

Archie
ALIWEB
WWW Wanderer
WWW Worm
JumpStation
RBSE

Old Glory:

WebCrawler
Lycos

Current:

Google
Lucene
Terrier

Lab Session

Complete previous lab.