• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: May 2008

Search Engines Architecture Week 11

23 Friday May 2008

Posted by egarcia in Search Engines Architecture Course

≈ Leave a Comment

Final Examination

The course final examination is almost here. This is a theory and practice test. If you are taking the test, bring with you a No. 2 pencil, eraser, calculator, laptop, and all tools developed during the course (parser, crawler, query/url normalizers, and a working copy of Terrier). You will need this material for the practicum.

Some of the questions to be faced involve discussion and good reasoning like the ones discussed during the review session. Consider this one:

Question. String noise can be generated during markup removal, tokenization, filtration, and stemming, especially if we blindfold remove apostrophes, possesives, contractions, and stopwords. In which order should you remove these so that a minimum of noise is generated?

See answer at the end.

Thank you for taking this course. By now you probably understand how search engine architectures are designed and actually work. At least you got the basics.

I have been asked to teach next Fall a graduate course on Text Mining under the title

Adversarial IR: Web Spam and Search Engines for Penetration Testing

From now to the fall, anything can happen.

Answer to question:

Step 1. Remove markup and then tokenize according to rules.
Step 2. Remove contractions and then possesives.
Step 3. Remove apostrophes and then stopwords.
Step 4. If applying stemming, do according to a flavored version of Porter’s.

This strategy is as good as your regexp expressions and parsing rules and can only be applied on a per case basis (e.g., caution with rules for hyphenated tokens). It is not perfect, but is workable.

Microsoft: Putting Dollar Value to Searches

22 Thursday May 2008

Posted by egarcia in Marketing Research, Miscellaneous

≈ Leave a Comment

According to VNUNET.com,

Microsoft has introduced a service which offers ad-funded cash rebates to customers who search for and buy products.

The Live Search cashback portfolio includes more than 10 million product offers from more than 700 merchants.

Early adopters of the service include eBay, Barnes & Noble, Overstock.com, Sears and Zappos.com.

The cost-per-action (CPA) model is one of the best advertising models around, in which advertisers pay each time a click results in a sale. Combine this with a cash rebate plan for consumers and you have a win-win revenue model.

For those that were born to hate Microsoft, that giant from the software world, sure they will find something wrong with this or any move from Microsoft, simply because comes from Bill Gates. I disagree with these folks, many of which are eager to justify in their minds “the other Microsoft”; that is, the one of the search world: Google.

I remember when GoTo (later Overture) put a dollar value to searches. Many SEOs and average users sworn not to use a search engine that allows competitors and advertisers buy their way to the top. Look around now. Google jumped in front of the parade “a la Microsoft” and few lawsuits later we are where we are.

History repeats itself. Who will jump in front of the parade now?

Cell Phone Spam

19 Monday May 2008

Posted by egarcia in Marketing Research, Miscellaneous, Spam

≈ Leave a Comment

Cell phone spam: Hum. Nothing new, but it is more prevalent than ever.

Yesterday a local newspaper (El Nuevo Dia) featured the Los spams ahora atacan a los celulares article in which few local sources were inquired on the subject.

Unfortunately they all seem to miss the point.

Telephone companies are indoubtly making money from spam, and quite a lot. So, why kill the money making machine? Duh!

Don’t just take my word. Look around for a second opinion like this one:

Verizon Won’t Help You Filter Out SMS Spam Because It Makes Them Money

If that is not enough, then check why

Angry Customers Sue T-Mobile Over Texting Charges.

Indeed, cell phone spam “is the perfect storm of annoying attributes. It audibly interrupts your life like telemarketing”.

Search Engines Architecture Week 10

16 Friday May 2008

Posted by egarcia in Graduate Courses, Search Engines Architecture Course

≈ Leave a Comment

Week 10 Agenda

Lecture Session

Other Inverted Index Architectures
Divide-and-Conquer Strategies for Fast Indexing and Searching

Lab Session

Lectures and Lab Review

Final Examination Notes

Next week we have the final examination. This is an open book exam, with theory and practice sections.

To answer the test you need:

#2 pencil.
Calculator.
Working version of Terrier.
Tools developed during the course: parser, crawler, url and query normalizers, stemmer, etc.
Laptop (or a PC will be supplied to you).

IRW-2008-05:Search Engines Architecture Review Test

13 Tuesday May 2008

Posted by egarcia in Newsletters, Search Engines Architecture Course

≈ Leave a Comment

Search Engine Architecture

How much do you know about the architecture of a search engine?

The current issue of IR Watch – The Newsletter is out (finally!).

It consists of a Theory and Practice Test I prepared for PUPR.edu graduate students taking my Search Engines Architecture course. As such, it deviates from the format used in previous issues of the newsletter. I thought it may be used to assess how much readers know about the architecture of a search engine. If you are an IR student, the test will probably help you to review basic concepts. In fact, the test is intentionally long since it was designed to serve as a comprehensive review. The actual exam is slightly different in terms of length and content.

In This Issue:

Introduction
Early Architectures
First/Deep-Breadth Crawlers
Search Agents
Dispatcher
Forward Index
Inverted Index
Query Servers
Front/Back-End Servers
Lexicon/Thesaurus
Posting Lists
Search Results
Tokenization
Filtration
Stemming
Search Modes
News, Research, and Events
Terms of Use and Copyright

Note to Students: The final exam will be on 24th and is a Theory and Practice examination. Bring with you the tools developed during the course (web crawler, parser, stemmer, Terrier, query normalizer, etc). These will be used during the test. If you still own me a lab, be sure to turn it in this saturday.

PowerSet Semantic Searches in Wikipedia

12 Monday May 2008

Posted by egarcia in Machine Learning, Search Engines Architecture Course

≈ Leave a Comment

According to Reuters,

Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.

Powerset’s technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.

What Google has to say about the topic?

According to PCWorld:

In an interview in October with IDG News Service, Marissa Mayer, Google’s vice president of Search Products & User Experience, acknowledged that the company’s search engine should — and will — overcome its keyword dependence in time.

“People should be able to ask questions and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like ‘what is this about?’. A lot of people will turn to things like the semantic Web as a possible answer to that,” she said.

But she added that Google’s search engine acts smart thanks to the humongous amount of data it crunches. “With a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force,” she said. As examples, she cited a query like “GM,” which the engine interprets as “General Motors” but if the query is “GM foods,” it delivers results for “genetically-modified foods.” “Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart, like it achieved that semantic understanding, but it hasn’t really,” she said.

Hmm…

A search for GM goods and for GM in Powerset returns results relevant to General Motors, while Google does discriminate these searches possibly using brute force.

By contrast, a search for GM foods and for GM in both are discriminated.

PowerSet, Google, and almost all search engines do not seem to discriminate between the following two semantically different searches, which score against aforementioned semantic analysis claims:

Who is the best college junior?

Who is the best junior college?

A simple change in word order affects meaning and the information needs sought. Semantic searches? It is still a long way to go. This gonna be a nice race to watch, from the architectural side.

Talking about search engines architecture, the current issue of IRWatch – The Newsletter is the very same practice test I am giving to my grad students. Since they need to study for the finals, I thought I could kill two birds with one stone. It should reach subscribers inbox today or, at the latest, tomorrow.

Search Engines Architecture Week 9

09 Friday May 2008

Posted by egarcia in Search Engines Architecture Course

≈ 2 Comments

Week 9 Agenda

Lecture Session

This lecture is an extension of our previous lecture. A detailed discussion of the search engine architecture floor plan of the following search engines is presented:

WebCrawler
Google

Main components to be discussed include:

crawlers administration, indexing, forward index, inverted index, posting lists intersection, faulty tolerance, redundancy, query servers load balance, etc.

If we don’t run out of time, we will get into Lucene and other architectures.

Lab Session

Complete previous labs.

Important Note

I am finishing the Practice Test and should be in students inbox by Monday.

Search Engines Cache in the Times of Drug Busts

07 Wednesday May 2008

Posted by egarcia in Miscellaneous

≈ Leave a Comment

One nice thing about modern search engines is that these allow users access to cached pages. These are old version pages that reside -often precompressed- in a specific section of their architecture. 

Unless the owner or administrator of a site instructs search engines (via metadata or a robot text file) not to cache a document(s) old versions will be available to the end users via the cache command or via a cache link next to a search result.

This feature comes handy for those that use search engines for intelligence purposes. A lot of useful information can be found by searching for cached documents. At the same times old glorious pages can be become unwanted.

Ask San Diego State University’s Marketing and Communication Department. Out of embarrassment, they just removed the document listed at http://advancement.sdsu.edu/marcomm/features/2006/compact.html in which they feature a role model student (Kenneth Ciaccio), which yesterday was arrested on charges in connection with an on campus drug bust operation.

The page is still showing up in Google’s cache and reflects bad on SDSU and its Compact for Success program. To access this in Google just do a search and click the cache link or enter in the query box cache:url where url is the address of the above document.

 

Pink Keywords: Optimization of Resumes and Job Applications

05 Monday May 2008

Posted by egarcia in Miscellaneous, Spam

≈ Leave a Comment

The current slump in the US and PR economy and so many local employers giving pink slips induces me to think of the importance of pink keywords.

These are keywords one would use to optimize resumes and job applications.

Now than ever recruiters, middle management, and HR departments need to look through zillion of resumes, looking for specific clues in the form of pinky keywords. This means that resumes and job applications must be optimized for such terms.

http://career-advice.monster.com/resume-writing-basics/Keyword-Challenge/home.aspx

The best way of finding good pinky keywords consists in selling to employers their own crappy ads and job offers; that is, by scanning employment ads, job offerings, and classifieds relevant to the target position one is interested in and then using the target terms in your own resume. Another thing one can do is to expand these with related or contextual terms; of couse, using those that match your own experience and skills.

I see here an opportunity for ethical SEO companies to provide a valuable and noble service: Pinky Optimization. At the same time I see an opportunity for crook SEOs and spammers to prey on other people’s misfortune. Since many in the seophere have being disposed by fat cats and sold(soul)-outs, these folks are also job searching. Life ironies.

Search Engines Architecture Week 8

02 Friday May 2008

Posted by egarcia in Search Engines Architecture Course

≈ Leave a Comment

Week 8 Agenda

Lecture Session

To understand the present we need to look at the past.

Thus, in this lecture we will take an in-depth look at early and current search engine architectures and their “floor plan”. We look at some published papers that started what we have today. Some hard-to-find material will be used. These are actual pieces of history that explain few “how-did-they-do-that”..

Since we cannot cover all the search engine architectures as we would like to, I have selected few of these. These were classified in three category:early, old glory, and currrent. The last two are open source projects.

I might add few more. Either way, this lecture might cover two weeks.

Early:

Archie
ALIWEB
WWW Wanderer
WWW Worm
JumpStation
RBSE

Old Glory:

WebCrawler
Lycos

Current:

Google
Lucene
Terrier

Lab Session

Complete previous lab.

May 2008
M T W T F S S
« Apr   Jun »
 1234
567891011
12131415161718
19202122232425
262728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.