• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Machine Learning

A Simple Search Strategy to beat them all

01 Tuesday Dec 2009

Posted by egarcia in Data Mining, Machine Learning, Programming, Queries, Vector Space Models

≈ 6 Comments

Now that I’m out of school, I am doing what I love the most: programming and testing IR systems.

I’m currently testing a ranking algorithm for an IR system built over the last years. The answer set is based on a simple matching (SM) search strategy.

Mistake not a simple matching strategy  for a simple or basic search approach as it can evolve into  the most complex one.

Unlike classic boolean searches (i.e., AND, OR, XOR),  SM is suitable for constructing answer sets and subsets based on coordination levels. Add a supporting scoring function (tf-IDF derivatives, RSJ-PM, BM25, etc) and… TA DA: a customizable clustering algorithm for retrieving and ranking search results.

Proper fine tuning allows presenting end-users with answer sets wherein AND results are accumulated at the top of the search results. As users move down the search results, they are presented with OR results and the search experience is perceived as if the system expands the answer set by switching query modes.

I’ve also added a query reduction mechanism for discoverying related searches. Brazilian Wax, nice!

In preliminary tests, results compare favorably with answer sets from search engines that claim to do search expansion/reduction, query mode switching, or clustering.

Next step is to check if with a large corpus and a thesaurus, results compare favorably with results from search engines that claim to use semantics.

So far, my one is cost effective and does not require of extra libraries.

PS: I forget to mention that my ranking algorithm is not based on computing vectors or cosine similarities, so any overhead from a Vector Space Model is avoided. That’s the icing on the cake!

A CBR Sharing Search Engine System

17 Tuesday Mar 2009

Posted by egarcia in Data Mining, Machine Learning, Vector Space Models

≈ Leave a Comment

I’m reading with great interest the paper
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System
, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,

I’m happy to read they referenced our Tutorial on Cosine Similarity Measures. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.

Their abstract follows:

ABSTRACT
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.

Yahoo! BOSS and Google SearchWiki

21 Friday Nov 2008

Posted by egarcia in Data Mining, Machine Learning

≈ Leave a Comment

Search engines are realizing they can earn more revenues by allowing users to manipulate their indexes. Yahoo! BOSS and Google SearchWiki are efforts in that direction.

Essentially these two search platforms provide search refrits of whatever the search engines already have in their indexes, but with some features added for resorting, editing, and making annotations.

We can see some value for using these as aggregation layers for third-party search technologies and for some type of searchers prone to personalization. Such type of searches might be useful for conducting annotated Web Intelligence.

Still these are not platforms designed for searching the Deep Web.

Best Algorithm Combinations for Speech Processing

15 Wednesday Oct 2008

Posted by egarcia in Machine Learning

≈ Leave a Comment

I am happy to read that my Cosine Similarity tutorial is being referenced by Serguei Mokhov in his paper

“Study of best algorithm combinations for speech processing tasks in machine learning using median vs. mean clusters in MARF”

http://portal.acm.org/citation.cfm?id=1370262

The paper was presented in the ACM International Conference Proceeding Series; Vol. 290 Proceedings of the 2008 C3S2E conference and is a great read.

Independence, Disjointness, and IR Flaws

11 Monday Aug 2008

Posted by egarcia in Machine Learning, Vector Space Models

≈ Leave a Comment

Often independent events are mistaken for exclusive (disjoint) events. These are two different animals.

Consider two events, A and B. Let p(A OR B) be their union probability and p(A AND B) their joint probability. The general addition law for probabilities states that for any two events A and B

p(A OR B) = p(A) + p(B) – p(A AND B)

If events are independent

p(A AND B) = p(A)p(B)

Thus,

p(A OR B) = p(A) + p(B) – p(A)p(B)

Whereas if these are exclusive

p(A AND B) = 0

Therefore,

p(A OR B) = p(A) + p(B)

Furthermore,

if p(A AND B) = p(A)p(B) events are independent, occurring by chance.
if p(A AND B) > p(A)p(B) events are positively correlated, occurring more often than by chance.
if p(A AND B) < p(A)p(B) events are negatively correlated, occurring less often than by chance.

Talking in “rice and beans” (Hablando en “arroz con habichelas”):

Exclusive events do not have common outcomes as the occurrence of one excludes the occurrence of the other. By contrast, independent events have common outcomes, but the occurrence of one does not influence the occurrence of the other.

Independence and disjointness are very different things.

In IR, assuming that the IDF of a combination of terms can be taken for the sum of individual term IDF values presumes that terms are independent regardless of the actual data.

Arbitrarily assuming event independence, ignoring the experimental evidence, is one of the main sources of innaccuracies/flaws in many IR models (Cooper, 1991). However, excluding independence altogether is also unreasonable (Sparck-Jones, Walker, and Robertson, 1998).

References

Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In A. Bookstein, Y. Chiaramella, G. Salton, & V. V. Raghavan (Eds.), Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (ACM, SIGIR ’91) (pp 57-61). Chicago, Illinois: ACM.

Sparck Jones, K., Walker, S., & Robertson, S. E. (1998). A probabilistic model of information retrieval: development and status. TR 446, September. Computer Laboratory, University of Cambridge.

Sneak Preview of IRW: Graduate Research

01 Friday Aug 2008

Posted by egarcia in Graduate Courses, Machine Learning, Marketing Research, Theses

≈ Leave a Comment

The current issue of IRW, Graduate Students Research, is out. It consists of short abstracts of research conducted by graduate students.

In this issue:

Introduction
Genetic Algorithms, K-Means, and Fuzzy C-Means
Word Association Patterns
U-Site Search Engine Interface
Enhancement of a U-Site Search Engine Interface
News, Research, and Events
Terms of Use and Copyright

The next issue will go back to its how-to mode.

The Porter Stemmer

18 Friday Jul 2008

Posted by egarcia in Machine Learning, Programming, Theses

≈ Leave a Comment

A grad student asks (name omitted):

Dear Dr. Garcia,

I’m interested in developing a Porter Stemmer for the Irish language.

Would it be possible to send me your lecture notes for Porter Stemmer
development from your graduate course?

I am doing an MSc thesis on developing a search engine for TEI marked up
multilingual texts and hope to use Apache Lucene as a basis.

Thanks for any help,

******
MSc student,
UCC, Cork, Ireland.

Thank you for reading this blog and for emailing me, but I normally don’t release lecture notes. However, the lecture was based on Martin Porter’s site, which can be accessed by visiting the following link:

http://tartarus.org/~martin/PorterStemmer/

You might also want to check the Porter2 Stemmer:
http://snowball.tartarus.org/algorithms/english/stemmer.html

For stemmers in other languages, check the Snowball site:
http://snowball.tartarus.org/

The great thing about the Porter stemmer is that it has been written in many programming flavors and languages.

Claps and Slaps

14 Monday Jul 2008

Posted by egarcia in Latent Semantic Indexing, Machine Learning, SEO Myths, Theses

≈ 2 Comments

Claps

Graduate student David Petar Novakovic ( http://dpn.name/index.php/2007/06/04/seos-caught-out/ ) , who conducts research in LSI and few other great areas at the intersection of IR, NLP, and AI wrote me to mention that he is almost finishing his grad thesis. Thanks, David for referencing my tutorials on LSI/SVD in the thesis. He also submitted a reduced version of the paper to EMNLP. Congrats, David. We are so happy for you.

Slaps

There is something funny about SEOs that sell snake oil ( http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/ ) They get angry to their bones when we expose their myths and lies through IR knowledge, but they seem to praise us when we debunk the myths and lies of other snake oil sellers that compete with them. TFIDF, markov chains, LSI, and keyword density are few examples. Ha, Ha. I’m so glad efforts like AIRWeb, EMNLP, and others are here to stay.

The following links provide additional information about AIRWeb and EMNLP

AIRWeb
http://irthoughts.wordpress.com/2008/04/29/for-seo-spammers-airweb-2008-presentations/  

EMNLP
http://conferences.inf.ed.ac.uk/emnlp08/

IR Quiz

18 Wednesday Jun 2008

Posted by egarcia in Graduate Courses, IR Tutorials, Machine Learning

≈ Leave a Comment

Here is a question I included during the final examination of the Search Engines Architecture course. I am modifying the question. It might serve as a little quiz for non IR readers:

A collection consists of 500 documents. Some documents mention k1 and/or k2 keywords. If 100 mention k1, 200 mention k2, 70 mention k1 and k2, and 25 mention the k1 k2 terms sequence. Calculate the number of results for the following queries first, assuming terms independence and second assuming terms dependence. If the calculation is not possible from the provided data, write NC, ‘Not Computable’.

1. k1 NOT k2

2. k2 NOT k1

3. k1 OR k2 (unconditional OR)

4. k1 OR k2 (conditional OR)

5. NOT k1

6. NOT k2

7. NOT (k1 AND k2)

8. k1 AND k2 NOT (k1 k2)

9. EF-Ratio of the k1 k2 terms sequence

10. c12-index of the k1 k2 terms sequence

11. c12-index of k1 AND k2

12. IDF of k1

13. IDF of k2

14. IDF of k1 AND k2

15. IDF of k1 k2 terms sequence

Total Possible Scores: 15 points for terms independence and 15 points for terms dependence correct results.

Grading Yourself: A (100 – 90), B (89 – 80), C (79 – 70), D (69 -60), F(59 – 0)

Correct answers will be given during the week.

 

PowerSet Semantic Searches in Wikipedia

12 Monday May 2008

Posted by egarcia in Machine Learning, Search Engines Architecture Course

≈ Leave a Comment

According to Reuters,

Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.

Powerset’s technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.

What Google has to say about the topic?

According to PCWorld:

In an interview in October with IDG News Service, Marissa Mayer, Google’s vice president of Search Products & User Experience, acknowledged that the company’s search engine should — and will — overcome its keyword dependence in time.

“People should be able to ask questions and we should understand their meaning, or they should be able to talk about things at a conceptual level. We see a lot of concept-based questions — not about what words will appear on the page but more like ‘what is this about?’. A lot of people will turn to things like the semantic Web as a possible answer to that,” she said.

But she added that Google’s search engine acts smart thanks to the humongous amount of data it crunches. “With a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force,” she said. As examples, she cited a query like “GM,” which the engine interprets as “General Motors” but if the query is “GM foods,” it delivers results for “genetically-modified foods.” “Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart, like it achieved that semantic understanding, but it hasn’t really,” she said.

Hmm…

A search for GM goods and for GM in Powerset returns results relevant to General Motors, while Google does discriminate these searches possibly using brute force.

By contrast, a search for GM foods and for GM in both are discriminated.

PowerSet, Google, and almost all search engines do not seem to discriminate between the following two semantically different searches, which score against aforementioned semantic analysis claims:

Who is the best college junior?

Who is the best junior college?

A simple change in word order affects meaning and the information needs sought. Semantic searches? It is still a long way to go. This gonna be a nice race to watch, from the architectural side.

Talking about search engines architecture, the current issue of IRWatch – The Newsletter is the very same practice test I am giving to my grad students. Since they need to study for the finals, I thought I could kill two birds with one stone. It should reach subscribers inbox today or, at the latest, tomorrow.

← Older posts
Newer posts →
May 2013
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.