Stay ahead here, with a new search experience:
Stay ahead here, with a new search experience:
Our popular tool, The Web Crawler, is back! This new iteration of the tool is a lot more faster because is based on a different strategy: extractions of HREF sets and then refinement of these to get URLs that are qualified for status checks. So the tool also works as a link checker.
Another advantage of the above strategy is this:
The current issue of IRW is out!
This is Part Two of the series on inverted index architectures, a 3-part series organized as follows:
Part One: Inverted Index Types
Part Two: Fast Indexing Techniques
Part Three: Fast Intersecting and Sharding
Tasks related with indexing, searching and processing are also discussed.
Today I feel like giving away a little quiz material on applied linear algebra. The topic is relevant these days wherein some misleading SEOs are playing the we-do-“science” game (quack “science”, after all).
The following is taken from the Search Engines Architecture grad course I lectured back in 2008. I’m providing only one exercise with multiple parts. The quiz with answers might be a great topic for an IRW issue.
1.1 A search engine has three types of revenue channels: pay-per-click (PPC), pay-per-placement (PPP), and pay-for-conversion (PFC). In quarter 1, the million-dollar revenues respectively were: 20, 4, and 9. In quarter 2, PPC revenues were 20% less, PPP revenues doubled, and PFC revenues remained constant.
1.1.1 Write a matrix M1 expressing the revenue and quarter vectors for the first two quarters.
1.1.2 If the goal in quarter 3 is to increase by 20% all revenues earned in quarter 2, update M1 so it reflects such a goal as a new matrix M2.
1.1.3 If the goal in quarter 4 is to meet the average revenues of each of the previous channels in quarter 4, update M2 such that it reflects that goal as a new matrix M3.
1.1.4 Express the above quarters as column unit vectors. Inspecting either rows or columns, construct a nearest neighbor similarity matrix Mnn and construct scalar clusters of quarters. Ignore cosine similarity deviations of 0.02 units or less. How similar the quarters are?
The course final examination is almost here. This is a theory and practice test. If you are taking the test, bring with you a No. 2 pencil, eraser, calculator, laptop, and all tools developed during the course (parser, crawler, query/url normalizers, and a working copy of Terrier). You will need this material for the practicum.
Some of the questions to be faced involve discussion and good reasoning like the ones discussed during the review session. Consider this one:
Question. String noise can be generated during markup removal, tokenization, filtration, and stemming, especially if we blindfold remove apostrophes, possesives, contractions, and stopwords. In which order should you remove these so that a minimum of noise is generated?
See answer at the end.
Thank you for taking this course. By now you probably understand how search engine architectures are designed and actually work. At least you got the basics.
I have been asked to teach next Fall a graduate course on Text Mining under the title
Adversarial IR: Web Spam and Search Engines for Penetration Testing
From now to the fall, anything can happen.
Answer to question:
Step 1. Remove markup and then tokenize according to rules.
Step 2. Remove contractions and then possesives.
Step 3. Remove apostrophes and then stopwords.
Step 4. If applying stemming, do according to a flavored version of Porter’s.
This strategy is as good as your regexp expressions and parsing rules and can only be applied on a per case basis (e.g., caution with rules for hyphenated tokens). It is not perfect, but is workable.
Week 10 Agenda
Other Inverted Index Architectures
Divide-and-Conquer Strategies for Fast Indexing and Searching
Lectures and Lab Review
Final Examination Notes
Next week we have the final examination. This is an open book exam, with theory and practice sections.
To answer the test you need:
Working version of Terrier.
Tools developed during the course: parser, crawler, url and query normalizers, stemmer, etc.
Laptop (or a PC will be supplied to you).
How much do you know about the architecture of a search engine?
The current issue of IR Watch – The Newsletter is out (finally!).
It consists of a Theory and Practice Test I prepared for PUPR.edu graduate students taking my Search Engines Architecture course. As such, it deviates from the format used in previous issues of the newsletter. I thought it may be used to assess how much readers know about the architecture of a search engine. If you are an IR student, the test will probably help you to review basic concepts. In fact, the test is intentionally long since it was designed to serve as a comprehensive review. The actual exam is slightly different in terms of length and content.
In This Issue:
News, Research, and Events
Note to Students: The final exam will be on 24th and is a Theory and Practice examination. Bring with you the tools developed during the course (web crawler, parser, stemmer, Terrier, query normalizer, etc). These will be used during the test. If you still own me a lab, be sure to turn it in this saturday.
According to Reuters,
Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.
Powerset’s technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.
What Google has to say about the topic?
According to PCWorld:
In an interview in October with IDG News Service, Marissa Mayer, Google’s vice president of Search Products & User Experience, acknowledged that the company’s search engine should — and will — overcome its keyword dependence in time.
“People should be able to ask questions and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like ‘what is this about?’. A lot of people will turn to things like the semantic Web as a possible answer to that,” she said.
But she added that Google’s search engine acts smart thanks to the humongous amount of data it crunches. “With a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force,” she said. As examples, she cited a query like “GM,” which the engine interprets as “General Motors” but if the query is “GM foods,” it delivers results for “genetically-modified foods.” “Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart, like it achieved that semantic understanding, but it hasn’t really,” she said.
A search for GM goods and for GM in Powerset returns results relevant to General Motors, while Google does discriminate these searches possibly using brute force.
By contrast, a search for GM foods and for GM in both are discriminated.
PowerSet, Google, and almost all search engines do not seem to discriminate between the following two semantically different searches, which score against aforementioned semantic analysis claims:
Who is the best college junior?
Who is the best junior college?
A simple change in word order affects meaning and the information needs sought. Semantic searches? It is still a long way to go. This gonna be a nice race to watch, from the architectural side.
Talking about search engines architecture, the current issue of IRWatch – The Newsletter is the very same practice test I am giving to my grad students. Since they need to study for the finals, I thought I could kill two birds with one stone. It should reach subscribers inbox today or, at the latest, tomorrow.
Week 9 Agenda
This lecture is an extension of our previous lecture. A detailed discussion of the search engine architecture floor plan of the following search engines is presented:
Main components to be discussed include:
crawlers administration, indexing, forward index, inverted index, posting lists intersection, faulty tolerance, redundancy, query servers load balance, etc.
If we don’t run out of time, we will get into Lucene and other architectures.
Complete previous labs.
I am finishing the Practice Test and should be in students inbox by Monday.
Week 8 Agenda
To understand the present we need to look at the past.
Thus, in this lecture we will take an in-depth look at early and current search engine architectures and their “floor plan”. We look at some published papers that started what we have today. Some hard-to-find material will be used. These are actual pieces of history that explain few “how-did-they-do-that”..
Since we cannot cover all the search engine architectures as we would like to, I have selected few of these. These were classified in three category:early, old glory, and currrent. The last two are open source projects.
I might add few more. Either way, this lecture might cover two weeks.
Complete previous lab.