• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Search Engines Architecture Course

The Web Crawler is Back!

26 Tuesday Mar 2013

Posted by egarcia in Data Mining, IR Tools, Programming, Search Engines Architecture Course, Software, Web Mining Course

≈ Leave a Comment

Our popular tool, The Web Crawler, is back! This new iteration of the tool is a lot more faster because is based on a different strategy: extractions of HREF sets and then refinement of these to get URLs that are qualified for status checks. So the tool also works as a link checker.

Another advantage of the above strategy is this:

A set of HREFs may contain information about absolute and relative URLs, visible and hidden links, internal and external file paths, email addresses, css files, local javascript calls, and anchors (#). A subset of HREFs can also be used as pointers to anchor text information. So, a set of HREFs can be more informative than a mere set of links or URLs as it subsumes both.

IRW:2010-9: Inverted Index Architectures Part Two

30 Thursday Sep 2010

Posted by egarcia in IR Tools, Newsletters, Search Engines Architecture Course

≈ Leave a Comment

inverted index

The current issue of IRW is out!

This is Part Two of the series on inverted index architectures, a 3-part series organized as follows:

Part One: Inverted Index Types
Part Two: Fast Indexing Techniques
Part Three: Fast Intersecting and Sharding

Tasks related with indexing, searching and processing are also discussed.

The QA section features short code liners in JavaScript aimed at helping readers understand what is tokenization and how is implemented.

Although not described in the newsletter, it is possible to construct these type of components with scripting languages. As a matter of fact, we have built an entire forward index and inverted index written entirely with JavaScript. Once computed, the inverted index can be written to memory. This work for small collections. For large collections, we read/write it to a text file using ActiveX, which is then posting-lists intersected in the usual way. However, for really large collections this is not effective and a database solution is recommended. The point to be made is that constructing a JavaScript-based search engine at the client and with real components, not a mere over-sized look-up “site search tool”, is possible. Since ActiveX is Microsoft’s land, it is not a universal solution. As a quick enterprise solution for short collections, it is ok, I guess.

Matrix Algebra for Search Marketing

23 Monday Aug 2010

Posted by egarcia in IR Quizzes, Latent Semantic Indexing, Search Engines Architecture Course

≈ Leave a Comment

Today I feel like giving away a little quiz material on applied linear algebra. The topic is relevant these days wherein some misleading SEOs are playing the we-do-”science” game (quack “science”, after all).

The following is taken from the Search Engines Architecture grad course I lectured back in 2008. I’m providing only one exercise with multiple parts. The quiz with answers might be a great topic for an IRW issue.

1.1 A search engine has three types of revenue channels: pay-per-click (PPC), pay-per-placement (PPP), and pay-for-conversion (PFC). In quarter 1, the million-dollar revenues respectively were: 20, 4, and 9. In quarter 2, PPC revenues were 20% less, PPP revenues doubled, and PFC revenues remained constant.

1.1.1 Write a matrix M1 expressing the revenue and quarter vectors for the first two quarters.

1.1.2 If the goal in quarter 3 is to increase by 20% all revenues earned in quarter 2, update M1 so it reflects such a goal as a new matrix M2.

1.1.3 If the goal in quarter 4 is to meet the average revenues of each of the previous channels in quarter 4, update M2 such that it reflects that goal as a new matrix M3.

1.1.4 Express the above quarters as column unit vectors. Inspecting either rows or columns, construct a nearest neighbor similarity matrix Mnn and construct scalar clusters of quarters. Ignore cosine similarity deviations of 0.02 units or less. How similar the quarters are?

Have fun.

Search Engines Architecture Week 11

23 Friday May 2008

Posted by egarcia in Search Engines Architecture Course

≈ Leave a Comment

Final Examination

The course final examination is almost here. This is a theory and practice test. If you are taking the test, bring with you a No. 2 pencil, eraser, calculator, laptop, and all tools developed during the course (parser, crawler, query/url normalizers, and a working copy of Terrier). You will need this material for the practicum.

Some of the questions to be faced involve discussion and good reasoning like the ones discussed during the review session. Consider this one:

Question. String noise can be generated during markup removal, tokenization, filtration, and stemming, especially if we blindfold remove apostrophes, possesives, contractions, and stopwords. In which order should you remove these so that a minimum of noise is generated?

See answer at the end.

Thank you for taking this course. By now you probably understand how search engine architectures are designed and actually work. At least you got the basics.

I have been asked to teach next Fall a graduate course on Text Mining under the title

Adversarial IR: Web Spam and Search Engines for Penetration Testing

From now to the fall, anything can happen.

Answer to question:

Step 1. Remove markup and then tokenize according to rules.
Step 2. Remove contractions and then possesives.
Step 3. Remove apostrophes and then stopwords.
Step 4. If applying stemming, do according to a flavored version of Porter’s.

This strategy is as good as your regexp expressions and parsing rules and can only be applied on a per case basis (e.g., caution with rules for hyphenated tokens). It is not perfect, but is workable.

Search Engines Architecture Week 10

16 Friday May 2008

Posted by egarcia in Graduate Courses, Search Engines Architecture Course

≈ Leave a Comment

Week 10 Agenda

Lecture Session

Other Inverted Index Architectures
Divide-and-Conquer Strategies for Fast Indexing and Searching

Lab Session

Lectures and Lab Review

Final Examination Notes

Next week we have the final examination. This is an open book exam, with theory and practice sections.

To answer the test you need:

#2 pencil.
Calculator.
Working version of Terrier.
Tools developed during the course: parser, crawler, url and query normalizers, stemmer, etc.
Laptop (or a PC will be supplied to you).

IRW-2008-05:Search Engines Architecture Review Test

13 Tuesday May 2008

Posted by egarcia in Newsletters, Search Engines Architecture Course

≈ Leave a Comment

Search Engine Architecture

How much do you know about the architecture of a search engine?

The current issue of IR Watch – The Newsletter is out (finally!).

It consists of a Theory and Practice Test I prepared for PUPR.edu graduate students taking my Search Engines Architecture course. As such, it deviates from the format used in previous issues of the newsletter. I thought it may be used to assess how much readers know about the architecture of a search engine. If you are an IR student, the test will probably help you to review basic concepts. In fact, the test is intentionally long since it was designed to serve as a comprehensive review. The actual exam is slightly different in terms of length and content.

In This Issue:

Introduction
Early Architectures
First/Deep-Breadth Crawlers
Search Agents
Dispatcher
Forward Index
Inverted Index
Query Servers
Front/Back-End Servers
Lexicon/Thesaurus
Posting Lists
Search Results
Tokenization
Filtration
Stemming
Search Modes
News, Research, and Events
Terms of Use and Copyright

Note to Students: The final exam will be on 24th and is a Theory and Practice examination. Bring with you the tools developed during the course (web crawler, parser, stemmer, Terrier, query normalizer, etc). These will be used during the test. If you still own me a lab, be sure to turn it in this saturday.

PowerSet Semantic Searches in Wikipedia

12 Monday May 2008

Posted by egarcia in Machine Learning, Search Engines Architecture Course

≈ Leave a Comment

According to Reuters,

Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.

Powerset’s technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.

What Google has to say about the topic?

According to PCWorld:

In an interview in October with IDG News Service, Marissa Mayer, Google’s vice president of Search Products & User Experience, acknowledged that the company’s search engine should — and will — overcome its keyword dependence in time.

“People should be able to ask questions and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like ‘what is this about?’. A lot of people will turn to things like the semantic Web as a possible answer to that,” she said.

But she added that Google’s search engine acts smart thanks to the humongous amount of data it crunches. “With a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force,” she said. As examples, she cited a query like “GM,” which the engine interprets as “General Motors” but if the query is “GM foods,” it delivers results for “genetically-modified foods.” “Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart, like it achieved that semantic understanding, but it hasn’t really,” she said.

Hmm…

A search for GM goods and for GM in Powerset returns results relevant to General Motors, while Google does discriminate these searches possibly using brute force.

By contrast, a search for GM foods and for GM in both are discriminated.

PowerSet, Google, and almost all search engines do not seem to discriminate between the following two semantically different searches, which score against aforementioned semantic analysis claims:

Who is the best college junior?

Who is the best junior college?

A simple change in word order affects meaning and the information needs sought. Semantic searches? It is still a long way to go. This gonna be a nice race to watch, from the architectural side.

Talking about search engines architecture, the current issue of IRWatch – The Newsletter is the very same practice test I am giving to my grad students. Since they need to study for the finals, I thought I could kill two birds with one stone. It should reach subscribers inbox today or, at the latest, tomorrow.

Search Engines Architecture Week 9

09 Friday May 2008

Posted by egarcia in Search Engines Architecture Course

≈ 2 Comments

Week 9 Agenda

Lecture Session

This lecture is an extension of our previous lecture. A detailed discussion of the search engine architecture floor plan of the following search engines is presented:

WebCrawler
Google

Main components to be discussed include:

crawlers administration, indexing, forward index, inverted index, posting lists intersection, faulty tolerance, redundancy, query servers load balance, etc.

If we don’t run out of time, we will get into Lucene and other architectures.

Lab Session

Complete previous labs.

Important Note

I am finishing the Practice Test and should be in students inbox by Monday.

Search Engines Architecture Week 8

02 Friday May 2008

Posted by egarcia in Search Engines Architecture Course

≈ Leave a Comment

Week 8 Agenda

Lecture Session

To understand the present we need to look at the past.

Thus, in this lecture we will take an in-depth look at early and current search engine architectures and their “floor plan”. We look at some published papers that started what we have today. Some hard-to-find material will be used. These are actual pieces of history that explain few “how-did-they-do-that”..

Since we cannot cover all the search engine architectures as we would like to, I have selected few of these. These were classified in three category:early, old glory, and currrent. The last two are open source projects.

I might add few more. Either way, this lecture might cover two weeks.

Early:

Archie
ALIWEB
WWW Wanderer
WWW Worm
JumpStation
RBSE

Old Glory:

WebCrawler
Lycos

Current:

Google
Lucene
Terrier

Lab Session

Complete previous lab.

Search Engines Architecture Week 7

25 Friday Apr 2008

Posted by egarcia in Programming, Search Engines Architecture Course

≈ Leave a Comment

Week 7 Agenda

Lecture Session

Review of a typical
Search Engine Architecture
Regular Expressions for Building Porter’s Stemmer

Lab Session

Finishing Lab 3 and 4

← Older posts
June 2013
M T W T F S S
« May    
 12
3456789
10111213141516
17181920212223
24252627282930

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.