• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: November 2007

Call for Papers for ACL:HLT Conference

30 Friday Nov 2007

Posted by egarcia in Conferences

≈ Leave a Comment

Dr. Ellen Voorhees, over at TREC (NIST.GOV), sent me this Call for Papers:

I would like to call the IR research community’s attention to the “ACL 2008: HLT” conference.  This year ACL is making a concerted effort to attract excellent Information Retrieval research papers.  I am serving as PC co-chair for IR, and am helped by Noriko Kando, David Carmel, and Elizabeth Liddy, who are serving as “area chairs” for IR.  We are putting together a group of reviewers who have solid experience in IR, both to ensure that good IR research is recognized and to prevent poor IR research from slipping through.

Please consider submitting your IR research to ACL:HLT this year.  The ACL connection means that there will be some bias toward papers that touch on language technologies such as NLP, speech recognition, machine translation, discourse, and so on.  However, “general” IR papers are entirely within scope, and areas with IR roots or connections are also encouraged: text mining, filtering, recommendation systems, question answering, classification, clustering, sentiment analysis, etc.

The submission deadline for ACL:HLT is January 10, 2008.  ACL will be
held June 15-20 near Ohio State University in Columbus, Ohio.  (The
discount airline Skybus flies there from a large number of places around the US, and guarantees 10 seats for $10 each on every flight.  Of course, they’re probably already taken, but it’s nice to contemplate.)

If you’re considering SIGIR instead or in addition, its submission
deadline is January 28th [abstract due a week earlier].  SIGIR will be
held July 20-24 in Singapore.  SIGIR’s acceptance rate has been running slightly below 20% lately.

The complete call for papers as well as other useful information is
available at
http://acl2008.org
.
 

A Case-Based Experience Sharing Search Engine

28 Wednesday Nov 2007

Posted by egarcia in IR Tutorials, Machine Learning

≈ Leave a Comment

Mobyen Ahmed, Erik Olsson, Peter Funk, Ning Xiong from Department of Computer Science and Electronics, Malardalen University, Sweden have published the paper “Efficient Condition Monitoring and Diagnosis Using a Case-Based Experiene Sharing System”
(
http://www.mrtc.mdh.se/publications/1269.pdf
)  at a workshop at the Swedish Artificial Intelligence Society, p 70-80, in May, 2007.

This is a case-based reasoning search engine system that could be used in an industrial environment. That’s quite interesting.

In their paper, the authors kindly referenced me. That’s an honor. It appears that more CS researchers are happy with my Cosine Similarity Tutorial (
http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
) and with Mi Islita’s content (
http://www.miislita.com/searchito/educational-links.html
). If they are happy, I’m happy.

Web Mining Week 3

26 Monday Nov 2007

Posted by egarcia in Web Mining Course

≈ Leave a Comment

Week 3 Agenda

1. Introduction to Parsing: Building a Query Normalizer (PPT Presentation).
2. Understanding Search Engine Snippets via the SES 2005, San Jose presentation:
Patents on Duplicated Content (PDF Presentation).
3. Demonstration of Snippy Software: A Snippet Generator and Constrain Searcher.
4. Introduction to Association and Scalar Clusters: Keyword Clustering through a Similarity Matrix (PDF Presentation).

Required Reading Material


http://www.grammarbook.com/punctuation/apostro.asp


http://owl.english.purdue.edu/handouts/grammar/g_apost.html


http://tartarus.org/~martin/PorterStemmer/def.txt


http://www.json.org


http://www.json.org/js.html


http://www.crockford.com/javascript/javascript.html


http://www.miislita.com/search-engine-conferences/duplicated-content-patents.html

  Note: The Word and PDF versions of this talk are no longer available on the Web.

Bonus for Take-Home Work 1

This bonus is available to individual students, not to student groups. Each student must provide two separate html working scripts: (a) one containing the binary interface and (b) one containing the cruncher.

After learning this week how to build the search interface of a query normalizer.

1. Modify this search interface so that it only accepts binary data (1’s or 0’s).
2. Modify the search interface so that it becomes an HTML cruncher application. Add customized prototype methods so that the cruncher removes tabs, carriage returns, newlines, and unnecessary white space found in a typical HTML document. In addition, it should be non-invasive; i.e., it should not affect the functionality of HTML documents containing scripts, CSS instructions, or comment lines. You might need to retest the cruncher thoroughly with real documents from the Web, preferably with spam documents and with the source code of emails. To grab the source of an Outlook Express email, open an email you have received and navigate to:

File > Properties > Details > Message Source

Then, right-click message source and click Select to highlight all and right-click again to click Copy. You can also press Crtl+A to select all and Crtl+C to copy. Paste source in your HTML cruncher and crunch it.

A PageRank-Rank Correlation?

20 Tuesday Nov 2007

Posted by egarcia in Marketing Research

≈ 3 Comments

On 11-16, Stephane Labert sent me copy of an article that attempts to correlate Google’s PageRank and the rank of a document in this search engine result pages (SERPs).

In spite of the fact that Labert apparently worked hard on the piece, and besides proper credit given for this, I found the article disappointing on the grounds of the sampling, chosen regression model, and statistical analysis employed.

I suggested Labert few tips and things to look at since my perception was that the article was not ready for prime time. My intentions: To prevent Labert from getting unnecessary “harm”.

I was too late. Apparently, by the time I received it, the piece was already sent to many known SEOs or webmasters. This included some of IRW readers, including expert cloaker Ralph Tegtmeier, aka fantomaster.

On 11-17 Tegtmeier blogged about it. He and other SEOs promptly put into question the article’s statistical analysis. I am not going to go over their reactions since I pretty much agree with their critiques. Besides, the main issues argued by Labert and these SEOs are not knew at all and have been revisited many times. For those interested, reactions to Labert’s article can be read at the following links:


http://fantomaster.com/fantomNews/archives/2007/11/17/pagerank-evolution-and-serp-rankings-analyzed-evaluating-a-statistical-study/


http://sphinn.com/story/14452#wholecomment18087


http://www.timnash.co.uk/11/2007/lies-damn-lies-and-pagerank-statistics/

Rather than echoing their comments I prefer to discuss the experimental of Labert’s article:

Firstly, the sampling:

There is no full disclosure on how the data was collected. To be honest, this goes against the article’s credibility. Which queries were used? How many terms were used per queries: 2, 3, 4…? Which query modes were used: AND (FINDALL), ANY (OR), EXACT, constraining modes…? This is important since many variables, including the query, can influence SERPs. None of this was disclosed in the article.

As mentioned, many variables affect ranking results, and some have interactions. Ignoring these interactions and then isolating one variable and plotting this against an X axis does not provide an accurate picture.

Secondly, the regression model:

Why the data was adjusted to a linear model, when it actually tends to be nonlinear? Why apparent outliers were included in the least square analysis? Which error analysis respect to the slope was used to justify the inclusion/rejection of these apparent outliers? None of this was explained or reported.

Third, variable dependencies:

All graphs show a curve with a very small slope for the adjusted regression straight line. This suggests that changes in the X-axis (Rank) provoke small changes in the Y-axis (PageRank), indicating that variables are almost independent from one another, and that is despite the correlation coefficient value allegedly reported as close to 1.

Indeed, a correlation coefficient close to 1 is not enough. To investigate whether any two variables are dependent of one another or that there is a significant correlation between these we need to do more than just look at a bunch of correlation coefficients. As a matter of fact, an almost flat, orthogonal straight line against a Y-axis actually suggests orthogonality and variable independence.

To assess whether the correlation found is significative one could conduct a two-tail t-test and n – 2 degrees of freedom on the correlation coefficient at a defined confidence level. Once this is done, one would need to make the null hypothesis that there is no correlation between X and Y and compare the experimental t-value versus tabulated values from t-test tables. If t-experimental is greater than t-table the null hypothesis is rejected, that is, we conclude in such a case that a significant correlation does exist. This test was not reported, either.

Labert claims to have conducted a more detailed research to support the aforementioned article claims. I look forward to read that.

Web Mining Week 2

19 Monday Nov 2007

Posted by egarcia in Web Mining Course

≈ Leave a Comment

Week 2 Agenda:

1. The User-Machine Relevance Perception Gap (PPT presentation)
2. Introduction to Document Indexing (PPT presentation)
3. Linearization: markup removal
4. Tokenization: punctuation removal
5. Filtration: stopword removal
6. Stemming: suffix/prefix removal
7. Tools to approximate document linearization
8. Demonstration of Minerazzi software (early demo)
9. Take-Home Work 1: Document Gap Analysis

Required Reading Material

IR Watch Newsletter; 2007-6: The User-Machine Relevance Perception Gap - This is a free newsletter back issue, available only for students taking the course.

http://www.useit.com/alertbox/reading_pattern.html


http://psychology.wichita.edu/surl/usabilitynews/91/eyegaze.html


http://www.miislita.com/fractals/keyword-density-optimization.html


http://irthoughts.wordpress.com/2007/05/09/keyword-density-the-devils-advocate/


http://irthoughts.wordpress.com/2007/05/07/keyword-density-kd-revisiting-an-seo-myth/


https://www.google.com/adsense/support/bin/answer.py?answer=17954


http://www.miislita.com/information-retrieval-tutorial/indexing.html


http://www.dcs.qmul.ac.uk/~mounia/CV/Papers/ker_ruthven_lalmas.pdf

IRW-2007-11: The K-Means Algorithm

14 Wednesday Nov 2007

Posted by egarcia in Newsletters

≈ 4 Comments

By now, current subscribers should have received the November issue of IR Watch – The Newsletter. The following topics are covered:

Introduction

The K-Means Algorithm

Applications

Clustering by Features

K-Means Example

The Sum of Squared Error (SSE)

Selecting a Stopping Condition

Clustering by Cosine Similarities

The Initial Centroid Problem

Bisecting K-Means

Limitations of K-Means

From K-Means to K-Medoids

K-Means and Scaling

From Spherical K-Means to Fractal Clusters

Conclusion

News, Research, and Events

Terms of Use and Copyright

If you are working in this area, this issue will help you a lot. I’m currently advising a grad student working on a K-Means graduate project at the MS level and he found the issue really useful. If you are new to the topic, the material discussed also serves as a handy tutorial.

Dark Web Project and Web Mining

13 Tuesday Nov 2007

Posted by egarcia in Homeland Security

≈ Leave a Comment

Prof Chen, UofArizona, has a fascinating project on Web Mining applied to Homeland Security called the Dark Web Project, over at
http://ai.arizona.edu/research/terror/index.htm

The project is funded by NSF, DHS, CNRI, and Library of Congress. 

From their site:

“The AI Lab Dark Web project is a long-term scientific research program that aims to study and understand the international terrorism (Jihadist) phenomena via a computational, data-centric approach. We aim to collect “ALL” web content generated by international terrorist groups, including web sites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc. “

“We have developed various multilingual data mining, text mining, and web mining techniques to perform link analysis, content analysis,  web metrics (technical sophistication) analysis, sentiment analysis, authorship analysis, and video analysis in our research.”

“The approaches and methods developed in this project contribute to advancing the field of Intelligence and Security Informatics (ISI). Such advances will help related stakeholders to perform terrorism research and facilitate international security and peace. “

“It is our belief that we (US and allies) are facing the dire danger of losing the “The War on Terror” in cyberspace (especially when many young people are being recruited, incited, infected, and radicalized on the web) and we would like to help in our small (computational) way.”

Web Mining Week 1

12 Monday Nov 2007

Posted by egarcia in Web Mining Course

≈ Leave a Comment

Course Description 

The CECS 6824B/21 Special Topics in KDDM graduate course Web Mining: A First Course in Web Mining, Search Engines, and Business Intelligence (Department of Computer Engineering & Computer Sciences of Polytechnic University) starts today.

Syllabus: Available at
http://www.pupr.edu/pdf/Web-Mining.pdf

Time: Monday, 6:30PM – 10:30PM
Location: Turing Laboratory, Room 301
These are also office hours for those working on projects and theses with me can attend.

Grading System 

Grading: Take-Home Work and Final Exam

Three partial take home tests. The lowest score is dropped.

The following scale is used to score a final grade G:

 G = (F)(w) + (ave P)(1 – w)

Where F is the score of the Final Exam and ave P is the average Partial score defined as follows:

ave P = (Bonus points + sum of two highest partial tests)/2

w = weight factor to curve scores.

General Instructions for working in groups:

1. After completion of a project, each group will conduct a 15-minute PPT or PDF presentation. You only need to explain your results and main findings.
2. The day of the presentation the group should submit a written 2-page max report in English or Spanish, in a Word or PDF format. Use a 10-point Arial font and a single space, 1”-margin format.
3. The 2-page report should consist of a centered title, followed by co-author names, and a 50-word max abstract, a one-paragraph Introduction, Procedure (referenced), Results, Conclusion, and a Reference.
4. Raw data, figures, tables, or codes, if any, should be referenced (e.g., “See Figure 1.”) and appended in separate pages as an Appendix section. Number each of these (e.g., Figure 1, Table 1, etc.) and add a descriptive caption to each one.

If you are a student, read this blog (
http://irthoughts.wordpress.com
) for announcements and updates under the Web Mining Course category(
http://irthoughts.wordpress.com/category/web-mining-course/
).

Web Mining is a hands-on course; thus, all weekly agendas are tentative, flexible, and can be extended or shortened according to class needs. Posts at this blog will reflect these changes.

Week 1 Agenda:

1. Overview of the course
2. Falling in Love with Web Mining:
  A Brief History of the Internet and Search Engines (PPT presentation).
3. Search Engines and Search Marketing (PDF presentation)

Required reading material


http://www.cienciapr.org/news_view.php?id=711


http://irthoughts.wordpress.com/2007/06/29/a-week-before-greatness/


http://www.computerhistory.org/internet_history/


http://www.searchenginehistory.com/


http://searchenginewatch.com/showPage.html?page=3422781

Optional reading material:

It might help later in the course if you can start reading a bit about building search, match, and replace applications with regular expressions. Feel free to use your favorite programming language. Any programming flavor is fine, but as long as your applications can be interpreted by a browser (e.g., IE, Firefox, etc). Keep in mind that this is a hands-on course.

Terrier IR 1.1.1 Update

09 Friday Nov 2007

Posted by egarcia in Machine Learning, Programming

≈ Leave a Comment

Craig MacDonald, from the Terrier project at University of Glasgow sent me this update few days ago.

“Terrier, IR Platform v 1.1.1 – 24/10/2007.
http://ir.dcs.gla.ac.uk/terrier/

Terrier, the open source IR platform from the University of Glasgow has been updated to version 1.1.1.

This is a minor update, which contains mostly bug fixes. Some minor code enhancements and a test harness are also included. Moreover, the Snowball stemmers were added to boost support for languages other than English.

Fuller change log at
http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

Terrier is open source software using the Mozilla Public License (MPL) and is available from the Terrier website:
http://ir.dcs.gla.ac.uk/terrier

Back to Business

08 Thursday Nov 2007

Posted by egarcia in Machine Learning

≈ Leave a Comment

Back to business. IRW will soon be out. This month issue covers the popular K-Means Algorithm and its variants. A back issue on Genetic Algorithms will also be sent. This issue was supposed to be out last month. Sorry it took that long. As most of you know, I was away from the internet, attending other duties.

My inbox is full, as expected. I will reply to all of your emails, … eventually. Stay with me.

Well, next week begins the Winter semester at Polytechnic University. I’ll be teaching the graduate course:

Web Mining: A First Course in Web Mining, Search Engines, and Business Intelligence.

Description: This is a hands-on, one-full semester course on Web Mining, search engines, and business intelligence. Students will learn by doing: (a) how search engines index and rank web documents, (b) how to conduct business intelligence from online resources, and (c) how to apply Web Mining strategies and algorithms in their research or workplace.Target: Students in Business, Engineering, and Computer Sciences and from other disciplines are encouraged to register for this special course. Requirements: Calculus II or Permission from advisor or department. Grading: Take-home work and a final exam. Topics: The following topics will be covered, not necessarily in this order: 

  • Document Indexing: Indexing of Web sites and text operations used by search engines including document linearization, tokenization, stop word filtration, stemming, and parsing.
  • Search Engine Optimization: Mining search engine relevance algorithms for ranking high Web pages.
  • Intelligence Searching: Covers undocumented (smart) searches in Google and other search engines; includes Hacking and Penetration through customized searches.
  • Keyword Research and Clustering: Discovery of word patterns and keywords for branding and marketing through Association, Scalar, and Metric Clusters.
  • Term Matching Algorithms: Vector Space Models used by search engines. Scoring of local, global, and entropy term weights.
  • Concept Matching Algorithms: Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) models for clustering and ranking.
  • Link Analysis Models: Google’s PageRank, Hubs & Authorities, and other link-based models.
  • Spam Intelligence: Tools and techniques for spamming search engines and web sites. Includes techniques based on scripts, cloacking, keyword spam techniques, link-bombs, email marketing, viral marketing, Web 2.0, and Web 3.0.
  • Introduction to Business Dashboards (BDs): Overview of dashboard technology, including open source, and customized add-on components.
  • Special Topics: On-Topic Analysis, Co-Occurrence Theory, and Latent Graphs. This is my own area of research.

 Textbook: Web Mining evolves on a daily basis; thus, there is no official textbook. However, the following reference books are recommended for research. Additional references will be provided in class. 

  1. Modern Information Retrieval (Baeza-Yates and Ribeiro-Neto; Addison Wesley).
  2. Information Retrieval – Algorithms and Heuristics (Grossman and Frieder; Springer).

This is going to be fun! 

November 2007
M T W T F S S
« Oct   Dec »
 1234
567891011
12131415161718
19202122232425
2627282930  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.