Search Engines Architecture Week 9

May 9, 2008 by E. Garcia

Week 9 Agenda

Lecture Session

This lecture is an extension of our previous lecture. A detailed discussion of the search engine architecture floor plan of the following search engines is presented:

WebCrawler
Google

Main components to be discussed include:

crawlers administration, indexing, forward index, inverted index, posting lists intersection, faulty tolerance, redundancy, query servers load balance, etc.

If we don’t run out of time, we will get into Lucene and other architectures.

Lab Session

Complete previous labs.

Important Note

I am finishing the Practice Test and should be in students inbox by Monday.

Search Engines Cache in the Times of Drug Busts

May 7, 2008 by E. Garcia

One nice thing about modern search engines is that these allow users access to cached pages. These are old version pages that reside -often precompressed- in a specific section of their architecture. 

Unless the owner or administrator of a site instructs search engines (via metadata or a robot text file) not to cache a document(s) old versions will be available to the end users via the cache command or via a cache link next to a search result.

This feature comes handy for those that use search engines for intelligence purposes. A lot of useful information can be found by searching for cached documents. At the same times old glorious pages can be become unwanted.

Ask San Diego State University’s Marketing and Communication Department. Out of embarrassment, they just removed the document listed at http://advancement.sdsu.edu/marcomm/features/2006/compact.html in which they feature a role model student (Kenneth Ciaccio), which yesterday was arrested on charges in connection with an on campus drug bust operation.

The page is still showing up in Google’s cache and reflects bad on SDSU and its Compact for Success program. To access this in Google just do a search and click the cache link or enter in the query box cache:url where url is the address of the above document.

 

Pink Keywords: Optimization of Resumes and Job Applications

May 5, 2008 by E. Garcia

The current slump in the US and PR economy and so many local employers giving pink slips induces me to think of the importance of pink keywords.

These are keywords one would use to optimize resumes and job applications.

Now than ever recruiters, middle management, and HR departments need to look through zillion of resumes, looking for specific clues in the form of pinky keywords. This means that resumes and job applications must be optimized for such terms.

http://career-advice.monster.com/resume-writing-basics/Keyword-Challenge/home.aspx

The best way of finding good pinky keywords consists in selling to employers their own crappy ads and job offers; that is, by scanning employment ads, job offerings, and classifieds relevant to the target position one is interested in and then using the target terms in your own resume. Another thing one can do is to expand these with related or contextual terms; of couse, using those that match your own experience and skills.

I see here an opportunity for ethical SEO companies to provide a valuable and noble service: Pinky Optimization. At the same time I see an opportunity for crook SEOs and spammers to prey on other people’s misfortune. Since many in the seophere have being disposed by fat cats and sold(soul)-outs, these folks are also job searching. Life ironies.

Search Engines Architecture Week 8

May 2, 2008 by E. Garcia

Week 8 Agenda

Lecture Session

To understand the present we need to look at the past.

Thus, in this lecture we will take an in-depth look at early and current search engine architectures and their “floor plan”. We look at some published papers that started what we have today. Some hard-to-find material will be used. These are actual pieces of history that explain few “how-did-they-do-that”..

Since we cannot cover all the search engine architectures as we would like to, I have selected few of these. These were classified in three category:early, old glory, and currrent. The last two are open source projects.

I might add few more. Either way, this lecture might cover two weeks.

Early:

Archie
ALIWEB
WWW Wanderer
WWW Worm
JumpStation
RBSE

Old Glory:

WebCrawler
Lycos

Current:

Google
Lucene
Terrier

Lab Session

Complete previous lab.

For SEO Spammers: AIRWeb 2008 Presentations

April 29, 2008 by E. Garcia

To facilitate mainstream dissemination of the manuscripts presented at AIRWeb 2008 here are the papers as listed over at http://airweb.cse.lehigh.edu/2008/program.html

SEO spammers, whether your life gravitates around a “social network circus” or ”link building” or not, it is time to revisit your drawing board.

8:30 - 10:00

10:30 - 12:00

13:30 - 15:00

15:30 - 17:00

  • Web Spam Challenge
    • (5 min.) Description of the challenge
    • (12 min.) Data Analysis School, Moscow slides
      Konstantin Bauman, Alexey Brodskiy, Sergey Kacher, Elmira Kalimulina, Ruslan Kovalev, Mikhail Lebedev, Dmitry Orlov, Pavel Sushin, Pavel Zryumov, Dmitry Leshchiner and Ilya Muchnik
    • (12 min.) Computer and Automation Research Institute, Hungarian Academy of Sciences slides
      David Siklosi, Andras Benczur
    • (12 min.) Institute of Automation, Chinese Academy of Sciences, Beijing slides
      Guanggang Geng, Xiaobo Jin and Chunheng Wang
    • (5 min.) Announcement of results
  • Panel
    • (45 min.): The Future of Adversarial IR on the Web
      Amit Aggarwal, Zoltán Gyöngyi, Alexandros Ntoulas, Erik Selberg, and Andrew Tomkins

Search Engines Architecture Week 7

April 25, 2008 by E. Garcia

Week 7 Agenda

Lecture Session

Review of a typical
Search Engine Architecture
Regular Expressions for Building Porter’s Stemmer

Lab Session

Finishing Lab 3 and 4

SEOs - Desperate Seeking Clients

April 24, 2008 by E. Garcia

From time to time I receive unsolicited emails from SEOs offering me their services, to list my site in the major search engines and directories. They often send templates-like automatic messages (”Dear website owner”) and appear not to even bother to check if recipients need the service. 

These SEOs often look desperate and sound like snakeoil sellers and crooks. They even claim to be better than other SEOs.

They often pitch the same crap:

  • “I recently visited your site” (Really? Why then send this crap?).
  • “you are not listed in the top search engines and directories” (Really? How do they know?).
  • “we can increase your traffic by X astronomical amount” (Really? Could you double X for me, please?).
  • “we can help you get top rankings in Google” (Really? For which keywords?).
  • “our link building program” (Really? Read here link exchange and link spam).
  • “we have proprietary crap, blah, blah, …” (Really? Sell it or get a patent!).

I just received one of such emails last night, even when my site is known in the IR/SEO spheres and has been listed for many years in the top search engines and directories, and ranking well.

Dear website owner,

I visited your website and noticed that you are not listed in many of the major search engines and directories. If our company can increase your traffic up to 500% by getting you top ranking results on the search engines such as Google would you be interested? We specialize in link building content writing and programming. We have proprietary techniques that work better and are less expensive than any other SEO firm.

Please let me send you a proposal and show you how we can make your website profitable.

Sincerely,

Christian Frank

2060 AVENIDA DE LOS ARBOLES, STE D
THOUSAND OAKS,
CA 91362-1361 - USA

These are the type of companies that give a black eye to the SEO industry. If SEOs send you this type of crap, I feel your pain. Stay away from their businesses or whatever they claim or seem to offer.

Building the Porter Stemmer

April 22, 2008 by E. Garcia

This post is for grad students taking the Search Engines Architecture course.

You should have in your email inbox a pdf of Porter’s original article from 1980 and a revised version of Lab 5 to follow the nomenclature utilized by Porter, along with the expected results to check against. Please disregard previous lab version.

The Stemmer is easy to build in any language. You can take a look at some versions on the Web to get some ideas, but you cannot copy these. Your tool should only do what is required in the experiment.

Deadline will be negotiated next time we meet. If you have any questions, feel free to blog these, whether these are about programming or content.

Important quotes or notes from Porter’s original article

1. “A consonant is a word other than a, e, i, o, u or a letter other than y preceded by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of itself does not make it ambiguous.) So in toy the consonants are t and y, in syzygy they are s, z, and g. If a letter is not a consonant it is a vowel.”–Porter.

2. Any word has a measure m, defined as the VC frequency (‘shift’) or number of times the VC pair is repeated, where V is a vowel sequence and C is a consonant sequence; e.g.,

m = 0 tr, ee, y, by
m = 1 trouble, oats, trees, ivy
m = 2 troubles, private, oaten, orrery

Vector Space Models and Search Engines

April 21, 2008 by E. Garcia

This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/

That said, today’s post is in reaction to the article at
http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.

Blockquoted passages and my comments are next.

Blockquoted Passage

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (”the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Comments

At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?

Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as

IDF = log(1/p) = log(N/n)

Logs at a given base (often 2 or 10) are used.

Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.

Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.

The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.

For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.

IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.

For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 - p)) instead of just p. Inverting and taking logs,

IDFP = ((N - n)/n)

If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.

Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html  

There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.

Blockquoted Passage

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Comments

Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.

This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?

The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.

L = f/fmax
L = 1 + log(f)
L = log(1 + f)

etc, etc, etc.

Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.

There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:

W = L*G*N

A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.

As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.

Blockquoted Passage

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

Comments

In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.

However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.

Blockquoted Passage (from one commenter)

It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

Comments

Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.

There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.

The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).

From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.

For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.

For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.

To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.

Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.

Note

Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:

http://www.webpronews.com/topnews/2001/09/05/google-interview-by-fredrick-marckini

One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html  which incidentally mentions tf and IDF.

Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.

Blockquoted Passage

Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

Comments
Why agree to SEO hearsay? See previous comments

Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.

Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/  or just do a search for “phrase idf”.

Conclusion

To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.

Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.

Search Engines Architecture Week 6

April 18, 2008 by E. Garcia

Week 6 Agenda

Lecture Session

Review on Regular Expressions
Lovins, Krovetz, and Porter Stemmers
Xu & Croft Stemmer (Word Co-occurrence-based Stemmers)
The Parallel Stemmer (LSI/Vector Space-based Stemmer)

Lab Session

Building a Stemmer

Comments: Lab 3 is due next week, but if you prefer to turning in earlier is ok. Lab 4 and this lab (Lab 5) are similar in nature. We can negotiate tomorrow in class their deadlines. A reminder that all labs are due in hard and electronic format to get full credit.

How Search Engines Do Not Work

April 17, 2008 by E. Garcia

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

IRW 2008-04: Principal Component Analysis (PCA)

April 16, 2008 by E. Garcia

PCA

Visualizing the two principal components of a data set.

 

The current issue of IR Watch - The Newsletter should be in your inbox during the day.

It is on Principal Component Analysis and covers the followings:

Introduction
What is PCA
A Reaction Equation Approach
Computing PCA with SVD
A Practical Example
Applying SVD to the Covariance Matrix
Improving Results with SPCA
Beyond the Covariance Matrix
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Experiment in Parsing Techniques

April 14, 2008 by E. Garcia

Here are some useful notes for those taking the Search Engines Architecture grad course.

In Lab 4: Experiment in Parsing Techniques, you are building a Query Normalizer, a Document Parser, and a URL Normalizer.

In section 1, be sure the tool removes all kind of HTML instructions from the search interface.

In section 2, the document parser should does document linearization, tokenization, and filtration, but no stemming. The final output should be a list of tuples consisting of unique terms and occurrences.

Regarding section 3.1, Building a URL Normalizer, I’m adding new restrictions to this part.

This part challenges you to use regular expressions, only. No arrays, no scripting loops, no conditionals (if-then), no lookup libraries, no DNS lookups, just regexps. It should work for all valid urls available on the Web. It can be done.

The parser for the URL normalizer tool should remove from a URL all kind of:

protocols (http, https, ftp, etc)
www prefixes
top-level domains (TLD)
ports (:80, :8000, etc)
file extensions (.html, .php, etc)
parameters (name=value pairs)
named anchors or fragments (#)
URL-forbidden characters, international characters, or script lines

Be sure you understand the difference between top-level domains and subdomains. For example, .com and .pr are TLDs, but .com.pr defines a subdomain.

Let say we have the http://www.xxx.com.pr site. If we remove http://www. and .pr from this we end with xxx.com, but .com is not the TLD in this case. The TLD still is .pr, though. Redirection mechanisms are “another twenty bucks” (”otros veinte pesos”), which might confuse the concept.

You should also know that a TLD can be a generic top-level-domain (gTLD), a country-code top-level domain (ccTLD), international TLDs (iTLDs), or US legacy TLDs (usTLDs). When combined, these define a subdomain. For example,

.edu.pr is a subdomain with .pr as TLD
.co.uk is a subdomain with .uk as TLD

That is, for a url on the Web, only one string sequence works as and defines the top-level domain.

Thus, for:

http://www.google.com.pr
https://www.google.com/adsense/login/en_US/?gsessionid=wfx3oxHhgDU
ftp://www.gobierno.pr/index.html
http://search.music.yahoo.com/search/?m=all&p=britney
http://www.telegraph.co.uk
http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s
http://www.lis.ntu.edu.tw/~mctang/index.htm

your tool should return:

google.com
google
gobierno
search.music.yahoo
telegraph.co
video.google.co
lis.ntu.edu

Recommended Lecture Material:

http://en.wikipedia.org/wiki/.pr
http://www.icann.org/meetings/saopaulo/presentation-dns-conrad-07dec06.pdf
http://www.mattcutts.com/blog/seo-glossary-url-definitions/  

Search Engines Architecture Week 5

April 11, 2008 by E. Garcia

Week 5 Agenda

Lecture Session

Building a Parser
Parsing Techniques and Implementations

Lab Session

Building a Query Normalizer
Building a Document Parser
Building a URL Normalizer

Comments

A reminder that hard and digital copies are required for all lab reports to get full credit. I need to evaluate whether the programs actually work as intended.

Report deadlines: To be announced to accommodate to class needs and previous lab needs.

Few Rants: Microsoft, a Conference, and a Database Site

April 11, 2008 by E. Garcia

I normally don’t rant at this blog about trivial stuff in life since this blog is about IR and search engine research. Today I feel like I want to make an exception. So let see how I can tie few rants about silly every-day things to search engines.

Rant 1

I bought the Home and Student version of Windows Office ($122, through Costco). The learning curve started. I tried to open its case by just pulling off the red tab as suggested. The red tab was detached from the case and still there was no “open Sesame”. I then tried different thing until decided to slice the clear seal at the top of the case with a knife and voila! Nothing like a puertorican solution for a “Made in Puerto Rico” Windows Vista product! Duh!

So the recipe is: (1) get a knife, (2) slice seal, and (3) pull with your fingers the case identations toward your right. The inside case should open.

Out of curiousity I wanted to know if others out there struggled with the design of the case. I ended up googling for how to open windows office case and found this site which discussed the very same problem and the very same solution. I realize I was not alone.

There are now dozen of sites like this one that show users this dumb “how-to”. Many are complaining about the “brilliant” design of the box, which is just an usability and accessibility nightmare.

Read what others at the aforementioned site are commenting. Some there commented that ended up searching for:

open office 2007 box
open vista box
Office Packaging “how to open”
open microsoft office box
how to open MS office 2007 box

Something from the product design side is wrong when soooo many have to Google for just how to open the damn case of a Microsoft product, or of any product for that matter. Some thing is wrong when Microsoft lab rats have to explain online how to open the annoying case.

Rant 2

There is a local conference on information security I was invited to. Down the organization pipeline, something is wrong with a conference when their organizers have to chase for potential presenters one week before the event. I pass and wish them good luck.

Rant 3

There is a local company that created a database-driven site for the upcoming Elections. The problem: how to get politicians and average users to know how to use the technology. Also, the site already needs to be redesigned so it can rank high and gain traffic from search engine users.

All these, kind of belong to the Land of Duh.

Demystifying LSI Video

April 7, 2008 by E. Garcia

Here is a video of my presentation, Demystifying LSI, at the OJOBuscador Congress 2.0, Madrid, Spain, 2007. One year later, nothing has changed. Many of the same crook SEOs exposed during the congress are still deceiving the public about what is LSI.

Unfortunately, the quality of the video and lights are not good enough to see the pdf slides, plus the presentation is in Spanish. Since attendees were not scientists, I talked very slow for over an hour.

Want to get bored for the next hour? View the video.

Thanks to N. Valenzuela Alonso, Director of SEO and Search Engine Marketing of Media Bit, S.L. for the link (www.ithinksearch.com/2008/03/31/video-lsi-de-edel-garcia-desmitificando-lsi/).

Here is also the presentation of Carlos Castillo (Chato), from Yahoo! Research Spain:

Adversarial IR with Web Spam, parts 1 and 2 
(http://www.ojobuscador.com/2007/06/14/ir-con-adversario-y-webspam-videopost/).

I spent great time talking with Carlos, a former grad student of Ricardo Baeza-Yates.

Baeza-Yates, Andrei Broder, Gerald Salton, and Keith van Rijsbergen and few others have helped to shape what is today known as Information Retrieval Research

Talking about Andrei Broder (one of the main researchers behind the old mighty Altavista), here is also a great interview, thanks to ojobuscador site: 
http://www.ojobuscador.com/2006/05/20/entrevista-a-andrei-broder/

 

Search Engines Architecture Week 4

April 4, 2008 by E. Garcia

Week 4 Agenda

Lecture Session

Building an Inverted Index
Incorporating Query Modes
Building the Term-Doc Matrix from the Inverted Index

Lab Session

Programming the Parser
Programming the Inverted Index

Understanding Search Engines

April 3, 2008 by E. Garcia

On April 23rd I’ll be presenting a seminar lecture at University of Puerto Rico, Bayamon (http://www.uprb.edu).

Topic, time, abstract, level, and requirements follows. 

Topic: Understanding Search Engines

Time: 12 Noon

Abstract: What are search engines? How do they work? What are their main components? How do they analyze document relevancy? What it takes to rank a web page in the top 10 search results of Google, Yahoo, and other search engines? These questions will be addressed in this conference.

Level: Beginners.
Requirements: None.

If you are an uprb student, a faculty, or staff and happen to be an SEO or webmaster, don’t miss this rare opportunity to learn answers to these and similar questions.

I will also use that opportunity to promote the upcoming conference we are co-organizing at Polytechnic University:

Search Engines and Information Security

This will be held in October 3 & 4, 2008. Additional information will be provided soon at PUPR.edu and Mi Islita.com sites.

Adressing Some LSI Questions

April 2, 2008 by E. Garcia

At the last Search Engines Architecture lecture we discussed LSI and Terrier. Great questions were raised. Some of these follows:

Q: How many dimensions to keep?
A: This is done by trial and error. I have a research project on the topic. None of the current ways of addressing this problem convince me.

Q: How do we compute a truncated version of the initial matrix, A?
A: After SVDing A, truncate U, S, and V by retaining the first k columns of U and V (rows of V transpose) and the first k diagonal elements of S. Multiply these as discussed in class to get A truncated.

Q: To compute the query vector in the reduced space, do we need to compute A truncated for each query?
A: No. The new coordinates of this vectors are defined as
q = qTUkSk-1
This means that A can be called from the cache. See the fast track tutorial
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf
over at Mi Islita.com site.

Q: Do I need to compute A truncated each time a new document is added or previous are modified?
A: For small matrices the answer is YES. However, for huge matrices we can resource to updating/appending techniques. Some of these add doc vectors without recomputing the previous matrix. There is a point wherein this can compromise orthogonality, though.

Q: How do I use Desktop Terrier?
A: Follow the instructions provided in the updated version of Lab Report 2.

Search Engines Architecture Week 3

March 28, 2008 by E. Garcia

Week 3 Agenda

Lecture Session

Document Indexing
Web Crawling Techniques

Lab Session

For this lab, students should have already signed to download Terrier from http://ir.dcs.gla.ac.uk/terrier. We can use the Desktop API as is, but for development we need JAVA in the local machine.

Lab instructions for using the API will be provided in class. Please read in advanced Terrier documentation. Bring with you a directory (folder) full of documents from the pupr.edu site or your favorite site to play with. This will be analyzed during the lab.

This lab report is due next week.

PCA and SPCA Tutorial

March 25, 2008 by E. Garcia

As promised, the PCA and SPCA Tutorial is now available at Mi Islita.

I wrote this tutorial for two reasons:

1. to help graduate students taking the Search Engines Architecture course with a general overview and review of linear algebra concepts.
2. because most tutorials discuss PCA, but ignore SPCA.

If you are enrolled in the course and have any question, please feel free to post.

Feel also free to comment or make suggestions so we can improve the tutorial.

Verizon, FCC, and the C Block Competition

March 24, 2008 by E. Garcia

Now that the B and C blocks of the spectrum has been allotted by the FCC things are set-ready-go to open mobile broadband U.S. networks, broadband IR, and, yes, to a whole new hacking space. It’s a matter of time. The C Block hacking competition is coming. Never ignore what can be done with such new playground.

I wonder how the FCC is going to enforce regulations on the 22-MHz portion of the spectrum, already handled to Verizon. http://www.pcworld.com/article/id,143705-c,industrynews/article.html

Meanwhile, IR research centered around open broadband networks are needed, so as search engines.

Important Terrier Upgrade and Call for Participation

March 20, 2008 by E. Garcia

Craig Macdonald from University of Glasgow and ECIR, sent me this upgrade and Call for Participation concerning the Terrier platform.

BTW, if you are taking the Search Engines Architecture grad course, after thinking thoroughly I believe is better to go with Terrier (JAVA) as the base architecture for the course main project. I don’t think we have enough time to build and deal with a C++/PHP/Perl Frankenstein, with all associated incompatibility issues, nor is necessary to reinvent the wheel for the next 9 weeks. I wish we all have more time in this life.

So, students, please start downloading Terrier now. We don’t have classes this Saturday, so we have plenty of time to read and catch up with the release. Next time we meet I will explain how to use their Desktop Search component. We will build/tweak as much as we can of the architecture.

The great thing about Terrier is that other researchers at the ECE&CS department can then use it for other experiments on data mining.

Here is Macdonald’s announcement:

Terrier, IR Platform v 2.1 - 19/03/2008

http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.1, the next version of the open source IR platform from the
University of Glasgow (Scotland) has been released.

This is a minor update, which contains some bug fixes, and some minor
improvements. Support for indexing various test collection has been
improved (CLEF, TREC Legal track), and the flexibility of the settings
of some applications such as the Desktop search and the Interactive
Terrier have been enhanced.. Moreover, Terrier 2.1 includes a file
system abstraction layer, which allows various types of files to be
accessed through a uniform API. For example, indexing an HTTP Web page
is as straightforward as indexing a local document. Moreover, a notable
indexing bug affecting only the Windows platform was resolved.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL)
and is available from the Terrier website:

http://ir.dcs.gla.ac.uk/terrier/

Final Call for Participation:

ECIR 2008 Tutorial: Research and building IR applications using Terrier
30th March

This tutorial introduces the main design of a large-scale IR system, and
uses the Terrier platform as an example of how one should be built. We
detail the architecture and data structures of Terrier, as well as the
weighting models included, and describe, with examples, how Terrier can
be used to perform experiments and extended to facilitate new research
and applications.

Handouts containing slides, a Terrier “crib sheet”, and detailed
examples of implementations of common research problems will be
provided, in addition to a bibliography of informative related papers.

Presenters: Craig Macdonald and Ben He, University of Glasgow, UK
http://ecir2008.dcs.gla.ac.uk/tutorial_t3.html

https://mr1.dcs.gla.ac.uk/mailman/listinfo/terrier-announce

Principal Component Analysis Tutorial and Other Algos

March 18, 2008 by E. Garcia

I’m currently writing a principal component analysis (PCA) tutorial for students enrolled in the Search Engines Architecture course. It should be available online soon. If you are enrolled in the course, it will help you with Part 1 Section 4 of Lab 1.

For those not familiar with PCA, it is the equivalent of applying Singular Value Decomposition (SVD) on the covariance matrix of a data set. Mistake not PCA for Latent Semantic Indexing (LSI) as these are different animals.

These days PCA is used as an exploratory tool. For instance, if we want to do K-Means on a data set, we first run PCA and find the principal components. After that we can select a point on the dominant eigenvectors as centroid and apply K-Means.

Depending on the clusters being inspected this simplifies a lot of things. It is not a silver bullet, but just a handy tool in the researcher’s toolbox. It can be combined with or used to support other exploratory analyses, though.

PCA, SVD, Kendall’s Test, Vector Space Theory, and few other statistical algorithms have been applied for years in Chemometrics. Pharmacology, and Analytical Chemistry, even before the Internet and before commercial search engines were around.

For a sample, check Pattern Recognition and NMR Spectroscopy by Ebbels, et. al.

Here is a list of some clustering algorithms used by chemists all over the World:

Pattern Recognition Methods

Unsupervised

Principal Component Analysis (PCA)
Hierarchical cluster analysis (HCA)
Non-linear maps
Kahonen networks
Rule induction
Probabilistic methods (eg Autoclass).

Supervised

Discriminant analysis
Neural Networks (Back propagation)
SIMCA
K-Nearest Neighbour (KNN)
Probabilistic methods (eg Multi-dimensional Gaussian Class Modelling)
Rule induction
Regression techniques: MLR, PLS, PCR.

With the popularization of personal computers and search engines, these are back and in the limelight. There is nothing new under the Sun, sort of speak.

I’m glad I also have a background in Chemistry.

Some times it pays to see the big picture, far away from search engines and SEOs.

Perhaps now you may understand why I find soooo hilarious SEO tales about how SVD and LSI algorithms work.

Search Engines Architecture Week 2

March 14, 2008 by E. Garcia

Week 2 Agenda

Lecture Session

Visualizing Matrix Operations
SVD and PCA Review
If we have time, I will start with:
Overview of Document Indexing and Ranking Algorithms
First-Breadth and Deep-First Web Crawlers
The Terrier Desktop Searches Platform (Java)

Lab Session

Complete Lab 1. Please add the following instructions to the lab.

In Part 3, section 3.1.3, add the following task:

Compute the sum of the eigenvalues of ATA and the trace of this matrix. Do the same for AkTAk. Compare results and draw some conclusions. What important property is confirmed?

In Part 3, section 3.1.4, add the following task:

Finally, column-normalize VkT and construct a similarity matrix from it. Extract scalar clusters from it. Compare with the clusters extracted from AkTAk. Explain your observations.

In Part 4, section 4.1.1, add the following task:

Using EXCEL, reproduce the PCA example given by Smith in reference 4. Show all calculations.

Teaser: Consider the following lecture material list. Which trick is being used to reduce link juice (importance)? How would you add link juice?

Lecture Material

1. Using latent semantic analysis to improve access to textual information; Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286,
PDF

2. Indexing by Latent Semantic Analysis; Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
PDF

3. Association and Scalar Clusters Tutorial; Garcia, E. (2008).
PDF

4. A tutorial on Principal Components Analysis; Smith Lindsay (2002).
PDF

5. A tutorial on Principal Component Analysis; Jon Shlens (2003).
PDF

6. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations
Boldi, P., S. Massimo, V. Sebastiano Vigna (2007)
PDF

Back Mapping for SEOs

March 13, 2008 by E. Garcia

A reader asked me how he could apply back mapping (BM) to SEO work.

Essentially, any algorithm like BM that works at the level of collections can be scaled down and used at a page level. One just needs to decompose a page into passages (chunks of text or text windows) and treat each passage as a pseudo document.

First, convert a document into a text stream; i.e., linearize, tokenize, and filtrate the document. You should end with a linearized piece of text.

 If it turns out that the linearized document is almost identical to the original document,  you might want to go back and partition the original document every x paragraphs or sentences. If not, partition the text stream so you end with segments or text windows. You need to decide if you want to do this very x number of words or characters.

Next, treat each segment or chunk of text as a “document” and construct a term-”document” matrix “A”.

If you want to narrow this down to the level of links, construct a term-link matrix.

 In general, an algorithm that applies to a true term-document matrix A can be applied to “A”.

Welcome to scaling.

How Many SEO Myths In One Sentence?

March 12, 2008 by E. Garcia

When I thought I have read enough SEO myths here is another SEO “expert” combining many of these in one: SEO LSI hearsay + Keyword Density + how a crawler works.

In http://www.trafficvillage.com/Article/website-optimisation/54346  Andy Burrows writes this piece of nonsense:

Google was the first search engine to implement a technology called LSI (Latent Semantic Indexing) to generate search results. LSI requires a Googlebot to take note of the keyword density of specific words on a webpage in addition to caching a page.

WOW…

Quiz: How many incorrect ideas can you spot in this single sentence?

Back Mapping Document Clusters to Terms

March 11, 2008 by E. Garcia
Back Mapping

Back Mapping is a simple technique that
elegantly identifies local and global topics.

Here is a snake preview of the current issue of IR Watch - The Newsletter. It should be in subscribers inbox during the day.

IR Watch 2008-03: Back Mapping Documents to Terms

Keywords: back mapping, documents, terms, association clusters, scalar clusters

In this issue:

Introduction
Revisiting Association and Scalar Clusters
On Back Mapping
Back Mapping Example
Extracting Association and Scalar Clusters
Examining Neighborhood-Induced Similarity
Back Mapping Document Clusters to Terms
Identifying Local and Global Topics
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Search Engines for Penetration Testing Course

March 7, 2008 by E. Garcia

More on the topic of college education and search engines:

In the Caribbean we don’t have seasons, just early spring breaks –compared to back in the states. After a 2-week spring break vacation, the new trimester just started. I am teaching tomorrow the Search Engines Architecture Course. This is going to be fun.

BTW. I am putting together a new graduate course for the next fall semester: Search Engines for Penetration Testing. I look forward to any suggestion from colleagues and from former  or prospective students. This is going to be a nirvana for ethical hackers/spammers.

Search Engines Architecture Week 1

March 6, 2008 by E. Garcia

Week 1 Agenda

Lecture Session 

Course overview and description
Review of linear algebra, including vector analysis
Discussion of SVD and PCA

Lab Session

Lab 1. Linear Algebra with EXCEL
     Part 1. Vectors and Search Engine Marketing (SEM)
     Part 2. Vectors and Crawlers
     Part 3. Singular Value Decomposition (SVD) with LSI
     Part 4. Principal Component Analysis (PCA) with SVD

Lecture Material

http://www.miislita.com/information-retrieval-tutorial/information-retrieval-tutorials.html