• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: April 2008

For SEO Spammers: AIRWeb 2008 Presentations

29 Tuesday Apr 2008

Posted by egarcia in Conferences, Spam

≈ 3 Comments

To facilitate mainstream dissemination of the manuscripts presented at AIRWeb 2008 here are the papers as listed over at http://airweb.cse.lehigh.edu/2008/program.html

SEO spammers, whether your life gravitates around a “social network circus” or ”link building” or not, it is time to revisit your drawing board.

8:30 – 10:00

  • (10 min.) Introduction
  • Usage Analysis
    • (25 min.) A Large-scale Study of Automated Web Search Traffic slides
      Greg Buehrer, Jack Stokes and Kumar Chellapilla
    • (25 min.) Identifying Web Spam with User Behavior Analysis slides
      Yiqun Liu, Rongwei Cen, Min Zhang, Liyun Ru and Shaoping Ma
    • (15 min.) Query-log mining for detecting spam slides
      Carlos Castillo, Claudio Corsi, Debora Donato, Paolo Ferragina and Aristides Gionis

10:30 – 12:00

  • Text Analysis
    • (15 min.) Cleaning Search Results using Term Distance Features slides
      Josh Attenberg and Torsten Suel
    • (15 min.) Exploring Linguistic Features for Web Spam Detection: A Preliminary Study slides
      Jakub Piskorski, Marcin Sydow and Dawid Weiss
    • (15 min.) Latent Dirichlet Allocation in Web Spam Filtering slides
      Istvan Biro, Jacint Szabo and Andras Benczur
  • General
    • (25 min.) Analysing Features of Japanese Splogs and Characteristics of Keywords slides
      Yuuki Sato, Takehito Utsuro, Tomohiro Fukuhara, Yasuhide Kawada, Yoshiaki Murakami, Hiroshi Nakagawa and Noriko Kando
    • (15 min.) Webspam Identification Through Content and Hyperlinks slides
      Jacob Abernethy, Olivier Chapelle and Carlos Castillo

13:30 – 15:00

  • Social Networks
    • (25 min.) Identifying Video Spammers in Online Social Networks slides
      Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida, Chao Zhang and Keith Ross
    • (25 min.) A Few Bad Votes Too Many? Towards Robust Ranking in Social Media slides
      Eugene Agichtein, Jiang Bian, Yandong Liu, and Hongyuan Zha
    • (25 min.) The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems slides
      Beate Krause, Christoph Schmitz, Andreas Hotho and Gerd Stumme
  • Link Analysis
    • (25 min.) Robust PageRank and Locally Computable Spam Detection Features
      Vahab Mirrokni, Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcroft, Kamal Jain and Shang-Hua Teng

15:30 – 17:00

  • Web Spam Challenge
    • (5 min.) Description of the challenge
    • (12 min.) Data Analysis School, Moscow slides
      Konstantin Bauman, Alexey Brodskiy, Sergey Kacher, Elmira Kalimulina, Ruslan Kovalev, Mikhail Lebedev, Dmitry Orlov, Pavel Sushin, Pavel Zryumov, Dmitry Leshchiner and Ilya Muchnik
    • (12 min.) Computer and Automation Research Institute, Hungarian Academy of Sciences slides
      David Siklosi, Andras Benczur
    • (12 min.) Institute of Automation, Chinese Academy of Sciences, Beijing slides
      Guanggang Geng, Xiaobo Jin and Chunheng Wang
    • (5 min.) Announcement of results
  • Panel
    • (45 min.): The Future of Adversarial IR on the Web
      Amit Aggarwal, Zoltán Gyöngyi, Alexandros Ntoulas, Erik Selberg, and Andrew Tomkins

Search Engines Architecture Week 7

25 Friday Apr 2008

Posted by egarcia in Programming, Search Engines Architecture Course

≈ Leave a Comment

Week 7 Agenda

Lecture Session

Review of a typical
Search Engine Architecture
Regular Expressions for Building Porter’s Stemmer

Lab Session

Finishing Lab 3 and 4

SEOs – Desperate Seeking Clients

24 Thursday Apr 2008

Posted by egarcia in Miscellaneous, Spam

≈ Leave a Comment

From time to time I receive unsolicited emails from SEOs offering me their services, to list my site in the major search engines and directories. They often send templates-like automatic messages (“Dear website owner”) and appear not to even bother to check if recipients need the service. 

These SEOs often look desperate and sound like snakeoil sellers and crooks. They even claim to be better than other SEOs.

They often pitch the same crap:

  • “I recently visited your site” (Really? Why then send this crap?).
  • “you are not listed in the top search engines and directories” (Really? How do they know?).
  • “we can increase your traffic by X astronomical amount” (Really? Could you double X for me, please?).
  • “we can help you get top rankings in Google” (Really? For which keywords?).
  • “our link building program” (Really? Read here link exchange and link spam).
  • “we have proprietary crap, blah, blah, …” (Really? Sell it or get a patent!).

I just received one of such emails last night, even when my site is known in the IR/SEO spheres and has been listed for many years in the top search engines and directories, and ranking well.

Dear website owner,

I visited your website and noticed that you are not listed in many of the major search engines and directories. If our company can increase your traffic up to 500% by getting you top ranking results on the search engines such as Google would you be interested? We specialize in link building content writing and programming. We have proprietary techniques that work better and are less expensive than any other SEO firm.

Please let me send you a proposal and show you how we can make your website profitable.

Sincerely,

Christian Frank

2060 AVENIDA DE LOS ARBOLES, STE D
THOUSAND OAKS,
CA 91362-1361 – USA

These are the type of companies that give a black eye to the SEO industry. If SEOs send you this type of crap, I feel your pain. Stay away from their businesses or whatever they claim or seem to offer.

Building the Porter Stemmer

22 Tuesday Apr 2008

Posted by egarcia in Search Engines Architecture Course

≈ 2 Comments

This post is for grad students taking the Search Engines Architecture course.

You should have in your email inbox a pdf of Porter’s original article from 1980 and a revised version of Lab 5 to follow the nomenclature utilized by Porter, along with the expected results to check against. Please disregard previous lab version.

The Stemmer is easy to build in any language. You can take a look at some versions on the Web to get some ideas, but you cannot copy these. Your tool should only do what is required in the experiment.

Deadline will be negotiated next time we meet. If you have any questions, feel free to blog these, whether these are about programming or content.

Important quotes or notes from Porter’s original article

1. “A consonant is a word other than a, e, i, o, u or a letter other than y preceded by a consonant. (The fact that the term ‘consonant’ is defined to some extent in terms of itself does not make it ambiguous.) So in toy the consonants are t and y, in syzygy they are s, z, and g. If a letter is not a consonant it is a vowel.”–Porter.

2. Any word has a measure m, defined as the VC frequency (‘shift’) or number of times the VC pair is repeated, where V is a vowel sequence and C is a consonant sequence; e.g.,

m = 0 tr, ee, y, by
m = 1 trouble, oats, trees, ivy
m = 2 troubles, private, oaten, orrery

Vector Space Models and Search Engines

21 Monday Apr 2008

Posted by egarcia in SEO Myths, Vector Space Models

≈ 2 Comments

This 23rd, I’ll be at UPRB.edu presenting the talk Understanding Search Engines. http://irthoughts.wordpress.com/2008/04/03/understanding-search-engines/

That said, today’s post is in reaction to the article at
http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

The title of that article sends the message that search engines work by using tf*IDF. In addition, IDF itself is mistaken. It is not clear from the article how vector space models work or are used by search engines. The author then seem to agree with few SEOs that search engines do not use these models to rank documents. Thus, a mixed message is sent.

Blockquoted passages and my comments are next.

Blockquoted Passage

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.
By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?
The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (“the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Comments

At grad school we teach tf*IDF models to introduce students to IR. Later on they are exposed to more realistic models. More likely no current top search engine uses plain tf, IDF or tf*IDF to rank docs. How IDF works?

Let N be number of docs in a collection and n be number of docs containing a given term. The probability of randomly choosing a doc containing a given term is p = n/N. This is defined as document frequency (DF). Inverting DF and taking logs gives the so-called Inverse Document Frequency (IDF), defined as

IDF = log(1/p) = log(N/n)

Logs at a given base (often 2 or 10) are used.

Note that p is the fraction of docs containing a given term. Thus, IDF is sometimes obscurely described as the “popularity” of a term within a collection. IDF actually estimates how much discriminatory power a term has in a given collection; no more, no less.

Frequently used terms have a small discriminatory power, regardless if they are relevant to a document. Terms rarely mentioned in a collection have more discriminatory power (large IDF) regardless if they are relevant to the topic of the document mentioning these. Term relevancy and the discriminatory power of a term not always run “side by side”. Some times they do, though.

The discriminatory power of a term, not its relevancy to a document, is determined by its environment. That environment is the collection wherein it resides. This is what IDF estimates.

For example, “job” mentioned in a document about jobs is relevant to the document. If this doc is indexed in a generic collection, “job” probably would be relevant to the doc and be discriminatory within the collection. If the same document is indexed in a collection about jobs, like Monster.com, the “job” term is still relevant and meaningful to the document, but more likely will lose its discriminatory power within the collection. And we haven’t considering yet how relevant “job” or the documents containing the term is to end-users looking for jobs.

IDF was used in the first vector space models of the ’70s-’90s to measure global weights across a collection. It is not the only way of measuring global weights, though.

For instance, we can use IDF probabilistic (IDFP) by considering the odds (p/(1 – p)) instead of just p. Inverting and taking logs,

IDFP = ((N – n)/n)

If a term is now mentioned in 50% of the total docs (n = N/2), it has zero global weight (IDFP = log(1) = 0), effectively acting as a stopword. For n > N/2, IDFP weights are negative. These are the so-called “negative terms”. They often introduce retrieval complications.

Some reassign zero weights to negative terms, effectively forcing such terms to behave as stopwords. This probably would be the case of “job” in a collection about jobs that uses IDFP. Optimizing doc content for such terms often is a futile exercise. Open source versions of search engines, customized to use IDFP (MySQL, Lucene, etc), often rezero these terms, and for good reasons. This is thoroughly explained in http://www.miislita.com/term-vector/term-vector-5-mysql.html  

There are other ways of defining goblal weights other than plain IDF. For instance, we can use entropies. Entropy captures a variety of cases not accounted for by IDF. It is often preferred if the associated computational cost is not an issue.

Blockquoted Passage

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Comments

Like global weights, there are dozen of ways we can define local weights, L. In the original vector space model, L = f was used, where f is the frequency (occurrence) of a term in a doc.

This model is susceptible to keyword spam (word repetition) since it does not attenuate frequencies. A graph of L vs f is simply a 45-degree straight line. Models that attenuate frequencies are preferred. How much attenuation to use?

The extreme case would be a binary model. That is, L = 1 if the term is present in the doc, otherwise L = 0. Middle-ground models that atttenuate frequencies are better choices.

L = f/fmax
L = 1 + log(f)
L = log(1 + f)

etc, etc, etc.

Some local weight models attenuate frequencies and can be used to flag spam. These models render the so-called keyword density tools useless.

There are many ways of defining L and G, not to mention variants of document normalization weights N. These then give a product weight:

W = L*G*N

A term-doc matrix populated with such weights can then undergo normalization so that it will consist of unit vectors. The use of unit vectors simplifies computations and allows for better comparison of large and short documents.

As we can see, W = tf*IDF is a simplistic way of computing term weights and just a particular case of a W = L*G*N scheme.

Blockquoted Passage

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

Comments

In a term space like the one discussed in the referred article, there is only one vector per doc to draw and there is only one vector per query to draw, regardless of how many words are present in the doc or the query.

However, in LSI every word of a doc can be represented as a vector, but this is not what the referred article discusses.

Blockquoted Passage (from one commenter)

It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

Comments

Just the opposite. It is more practical than one might think, if we understand the architecture of a search engine and how it works.

There is a difference between an index and term-doc matrices (from which vectors can be computed). An index can be inverted to conform an addressable “book” of a dictionary (aka vocabulary) plus posting lists. We call this “addressable book or tree” an inverted index. We can put in the posting lists different doc features, like f values, word positions, word spacing, in-title, etc.

The index can be computed and already be in cache before any query. When a query is submitted, search terms are matched against the vocabulary and posting lists are quickly accessed. For each term, IDFs are already precomputed in advanced. We only need to match search terms to terms in the inverted index and address the posting lists. The idea is to avoid exhaustive searches (searching over entire collection).

From matched posting lists we can construct, at query time, a query-dependent term-doc matrix and extract vectors from just those docs. Note these can be a predetermined number of docs from the index. Thus, if million docs are matched by the posting list(s) only the top N ranked are returned. This is only one way of tackling the “beast”: through addressing and divide-and-conquer techniques.

For huge collections, there are other divide-and-conquer techniques to speed up the process. We can also resource to precaching strategies, so a similar analysis can also be done offline, too, (e.g. from a pool of frequently queried terms –the so-called suggestion lists). We can also use prebuilded thesaurus to find similar docs, impacting precision and recall. Processes can be called by geolocation to please a specifc region or regional directory, etc.

For rather small collections, the term-doc matrix can be from all terms in the little collection that is stored on disk or small term-doc matrices can be constructed in advanced from each little posting list. All these, done off-line and before any query. Either way, the query vector is transposed and multiplied against a term-doc matrix, results postprocessed, ranked, and presented to the end user under fractions of a second.

To boldy suggest that vector models are not used by top search engines for ranking docs is plain non sense. Still, link weight only or vector similarity scores only are not enough. These scores can be combined with link weights or with other analytics, to get a final score.

Combining those scores simply does not simplifies computation, but adds another complexity layer while not doing it can leave out meaningful docs and queries.

Note

Since SEOS love to quote each other and I love to quote IRs, let’s have a happy medium and quote both:

http://www.webpronews.com/topnews/2001/09/05/google-interview-by-fredrick-marckini

One way modern search engines have combined link models with vector space models is described in the old patent: Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis (Patent 6112203) http://www.freepatentsonline.com/6112203.html  which incidentally mentions tf and IDF.

Back in 2005 at IPAM, Prabhakar Raghavan from Yahoo! Research explains in Vector Spaces Are Back! http://www.ipam.ucla.edu/publications/gss2005/gss2005_5542.pdf how these are used for ranking docs. The key: avoid exhaustive searching the collection.

Blockquoted Passage

Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

Comments
Why agree to SEO hearsay? See previous comments

Also depending on how was an index constructed, a query no need to “travel” an entire inverted index. Once search terms are matched in the inverted index, we can address the corresponding posting lists, avoiding exhausting searches. That’s why it is known as an “addressable book” or “addressable tree”.

Furthermore, literature on vector models for phrases can be searched on the Web. IDF for phrases are certainly computable. To snoop at the subject or about inverted index strategies, index segmentation, index merging, etc. read http://lucene.sourceforge.net/talks/inktomi/  or just do a search for “phrase idf”.

Conclusion

To sum up, even if no current commercial search engine uses plain tf*IDF this does not mean they don’t use vector space models for classification, retrieval, and ranking.

Vector space models often are present in different flavors and within different levels of an IR architecture. Vector Theory itself is an ancilliary theory used in IR that often shows up beautifully in LSI, co-occurrence, segmentation analysis, and other models.

Search Engines Architecture Week 6

18 Friday Apr 2008

Posted by egarcia in Graduate Courses, Search Engines Architecture Course

≈ Leave a Comment

Week 6 Agenda

Lecture Session

Review on Regular Expressions
Lovins, Krovetz, and Porter Stemmers
Xu & Croft Stemmer (Word Co-occurrence-based Stemmers)
The Parallel Stemmer (LSI/Vector Space-based Stemmer)

Lab Session

Building a Stemmer

Comments: Lab 3 is due next week, but if you prefer to turning in earlier is ok. Lab 4 and this lab (Lab 5) are similar in nature. We can negotiate tomorrow in class their deadlines. A reminder that all labs are due in hard and electronic format to get full credit.

How Search Engines Do Not Work

17 Thursday Apr 2008

Posted by egarcia in Machine Learning, Search Engines Architecture Course, SEO Myths

≈ Leave a Comment

If you are taking the Search Engines Architecture grad course, by now you should have learned what are the main components of a search engine and how to build a web crawler and a parser. You should know how to build an inverted index, how to use this to dynamically generate query-specific term-document matrices, and how to populate these with a variety of scoring models other than plain tf-IDF.

As the course progresses you will learn how to speed up document ranking through caching/updating  and divide-and-conquer multitiered strategies.

By now you should also have realized why most of the stuff published by SEOs about how search engines work are either misconception, myths, or just untrue folklore. Eg., While some have an incorrect idea on how vector space models are used, the bold idea that search engines do not use vector models to rank documents is simply non sense.

To illustrate visit the following two links:

http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html

http://www.atg.wa.gov/uploadedFiles/Home/News/Press_Releases/2004/Internet%20AdvancementComplaint.doc

The first one is about an SEO discussing “how search engines work” and use the Vector Space Model. The second is about the State of Washington suing a marketing company for misselling “search engine optimization” services.

How many factually incorrect statements/assumptions can you spot from the author of the first article and its commenters?

How many impossible facts and untrue statements can you spot in the second by the defendants?

If you have problems visiting the second link, I have a pdf copy for your perusal.

IRW 2008-04: Principal Component Analysis (PCA)

16 Wednesday Apr 2008

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

PCA

Visualizing the two principal components of a data set.

 

The current issue of IR Watch – The Newsletter should be in your inbox during the day.

It is on Principal Component Analysis and covers the followings:

Introduction
What is PCA
A Reaction Equation Approach
Computing PCA with SVD
A Practical Example
Applying SVD to the Covariance Matrix
Improving Results with SPCA
Beyond the Covariance Matrix
Conclusion
References
News, Research, and Events
Terms of Use and Copyright

Experiment in Parsing Techniques

14 Monday Apr 2008

Posted by egarcia in Machine Learning

≈ 1 Comment

Here are some useful notes for those taking the Search Engines Architecture grad course.

In Lab 4: Experiment in Parsing Techniques, you are building a Query Normalizer, a Document Parser, and a URL Normalizer.

In section 1, be sure the tool removes all kind of HTML instructions from the search interface.

In section 2, the document parser should does document linearization, tokenization, and filtration, but no stemming. The final output should be a list of tuples consisting of unique terms and occurrences.

Regarding section 3.1, Building a URL Normalizer, I’m adding new restrictions to this part.

This part challenges you to use regular expressions, only. No arrays, no scripting loops, no conditionals (if-then), no lookup libraries, no DNS lookups, just regexps. It should work for all valid urls available on the Web. It can be done.

The parser for the URL normalizer tool should remove from a URL all kind of:

protocols (http, https, ftp, etc)
www prefixes
top-level domains (TLD)
ports (:80, :8000, etc)
file extensions (.html, .php, etc)
parameters (name=value pairs)
named anchors or fragments (#)
URL-forbidden characters, international characters, or script lines

Be sure you understand the difference between top-level domains and subdomains. For example, .com and .pr are TLDs, but .com.pr defines a subdomain.

Let say we have the http://www.xxx.com.pr site. If we remove http://www. and .pr from this we end with xxx.com, but .com is not the TLD in this case. The TLD still is .pr, though. Redirection mechanisms are “another twenty bucks” (“otros veinte pesos”), which might confuse the concept.

You should also know that a TLD can be a generic top-level-domain (gTLD), a country-code top-level domain (ccTLD), international TLDs (iTLDs), or US legacy TLDs (usTLDs). When combined, these define a subdomain. For example,

.edu.pr is a subdomain with .pr as TLD
.co.uk is a subdomain with .uk as TLD

That is, for a url on the Web, only one string sequence works as and defines the top-level domain.

Thus, for:

http://www.google.com.pr

https://www.google.com/adsense/login/en_US/?gsessionid=wfx3oxHhgDU

ftp://www.gobierno.pr/index.html

http://search.music.yahoo.com/search/?m=all&p=britney

http://www.telegraph.co.uk

http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s

http://www.lis.ntu.edu.tw/~mctang/index.htm

your tool should return:

google.com
google
gobierno
search.music.yahoo
telegraph.co
video.google.co
lis.ntu.edu

Recommended Lecture Material:

http://en.wikipedia.org/wiki/.pr
http://www.icann.org/meetings/saopaulo/presentation-dns-conrad-07dec06.pdf
http://www.mattcutts.com/blog/seo-glossary-url-definitions/  

Search Engines Architecture Week 5

11 Friday Apr 2008

Posted by egarcia in Latent Semantic Indexing, Search Engines Architecture Course

≈ Leave a Comment

Week 5 Agenda

Lecture Session

Building a Parser
Parsing Techniques and Implementations

Lab Session

Building a Query Normalizer
Building a Document Parser
Building a URL Normalizer

Comments

A reminder that hard and digital copies are required for all lab reports to get full credit. I need to evaluate whether the programs actually work as intended.

Report deadlines: To be announced to accommodate to class needs and previous lab needs.

← Older posts

♣  

April 2008
M T W T F S S
« Mar   May »
 123456
78910111213
14151617181920
21222324252627
282930  

♣ Favorite Sites

  • Mi Islita

♣ Pages

  • About IR Thoughts

♣ Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

♣ Recent Posts

  • Puerto Rico’s Science and Technology Trust Fund: Innovation Island Blast II
  • The L’Hôpital Rule: Deriving the Geometric Mean
  • Understanding the L’Hôpital Rule
  • How to Create Windows Metro Style Apps with JavaScript
  • Electronic Drugs and Hackers
  • Why a Social and Search Presence is Important for You
  • NY SES – 2012: My little briefing
  • Hello, World. I’m SWM.
  • SES NY – See You All There!
  • Which separators to use with title tags?
  • A Study of Puerto Rico Newspaper Home Pages
  • Hey, SEOs: On Information Gain, Keyword Wallop, and Relevance
  • Social Media and Puerto Rico Local Brands
  • When and Why not to take arithmetic averages
  • l’Hopital’s Rule and the 0^0 Power Controversy

♣ Archives

  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

♣ Category Cloud

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Image Compression Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.