• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Web Mining Course

The Web Crawler is Back!

26 Tuesday Mar 2013

Posted by egarcia in Data Mining, IR Tools, Programming, Search Engines Architecture Course, Software, Web Mining Course

≈ Leave a Comment

Our popular tool, The Web Crawler, is back! This new iteration of the tool is a lot more faster because is based on a different strategy: extractions of HREF sets and then refinement of these to get URLs that are qualified for status checks. So the tool also works as a link checker.

Another advantage of the above strategy is this:

A set of HREFs may contain information about absolute and relative URLs, visible and hidden links, internal and external file paths, email addresses, css files, local javascript calls, and anchors (#). A subset of HREFs can also be used as pointers to anchor text information. So, a set of HREFs can be more informative than a mere set of links or URLs as it subsumes both.

Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery

12 Tuesday Mar 2013

Posted by egarcia in Data Mining, Fractal Geometry, Software, Web Mining Course

≈ 1 Comment

We have just published this short article, based on The Color Miner tool:

Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery - Based on Fractal Geometry, fractalettes are color palettes within color palettes, where each cell contains color space information and relationships. These types of architectures engage end-users in data mining, critical thinking, and learning through discovery.

AZZOO and WAZZOO: New Similarity Measures for the 21st Century

09 Saturday Mar 2013

Posted by egarcia in Data Mining, IR Tools, Web Mining Course

≈ Leave a Comment

Indeed, the AZZOO measure outperforms all conventional measures in the application of IRIS biometrics and handwritten character recognition.

At least that’s what is claimed.

http://www.jprr.org/index.php/jprr/article/viewFile/20/8

When Big Data leads to Big Errors

04 Monday Jun 2012

Posted by egarcia in Data Mining, Quack Science, SEO Myths, Statistics and Mathematics, Web Mining Course

≈ 7 Comments

The hilarious picture above shows how some SEOs look when playing to be scientists. This often occurs when interpreting big data.

Few specific scenarios:

1. Applying the statistical theory of small samples to extremely large samples, like …
2. …using large amount of data to force very small correlation coefficients to become statistically significant.
3. Trying to arithmetically average ratios (like correlation coefficients, standard deviations, slopes, and cosine similarities).
4. Mistaking Cauchy Distributions for Normal Distributions.
5. Adding together intensive properties.

Fortunately, I know of good folks that are doing a great job at educating their search marketing peers (Mike Grehan, Bruce Clay, Danny Sullivan, etc) without playing to be scientists.

When and Why not to take arithmetic averages

26 Thursday Jan 2012

Posted by egarcia in Data Mining, Marketing Research, Statistics and Mathematics, Web Mining Course

≈ Leave a Comment

Correlation coefficients, coefficients of variations, standard deviations, slopes, tangents, cosines, densities, temperatures, dissimilar ratios, and intensive properties in general are not additive. Therefore, arithmetic averages cannot be computed out of any of these.

Still, from time to time some “experts” and pseudo “scientists” do that.

Want to know why this is not mathematically and statistically possible? This is the subject of a paper I wrote and that is about to be published in Communications in Statistics – Theory and Methods (by Taylor & Francis).

Incidentally, I will provide a preview of the topic to the search marketing community. Thanks to my dear friend, Mike Grehan, this will be the topic I’ll be speaking about at the March, 2012 SES, NY.

A New Weighting Strategy

27 Tuesday Dec 2011

Posted by egarcia in Data Mining, Human-Computer Interaction, Machine Learning, Marketing Research, Programming, Quack Science, Statistics and Mathematics, Web Mining Course

≈ Leave a Comment

I received this morning from the editors of Communications in Statistics: Theory and Methods confirmation that they accepted and will be publishing my peer reviewed paper on a new model for statistical analysis. It should be out this 2012.

Once published, you will understand the SEO (* SEOmoz, I should say) non-sense of computing arithmetic averages of correlation coefficients and why some meta-analysis studies published in the past (* Hunter-Schmidt; Hedges-Olkin) are flawed and invalid.

It took me several meals and research hours to figure it out. I hope that IRs, dataminers, and statistics colleagues find new applications for the model.

The model can be applied to many fields, including marketing, business, risk analysis, data mining, signal processing, engineering, clinical trials, and almost any field or knowledge domain that involves the calculation of weighted statistics. I look forward to discuss it online once it get published.

Happy New Year.

PS. (*) I’ve edited this post to make these points obvious. So, the issue of arithmetically averaging correlations has been raised and killed for good before the scientific and statistical community.

PS. Just in: Last night (Jan-03-2012) I received news from one of the editors of the journal that the paper was assigned to issue 41 (8). Check for its title: The Self-Weighting Model (in Spanish is something like “El Modelo de Autoponderacion“. I forget to mention that this journal is published biweekly; so, things are moving fast. What a way of ending 2011 and starting 2012!!!

IRW 2011-4-2: n-Grams and Association Measures

18 Friday Feb 2011

Posted by egarcia in Data Mining, Newsletters, Web Mining Course

≈ Leave a Comment

n-grams-and-association-measures

 

The current issue of IRW should reach subscribers inboxes during the day.

This is Part Two of the series on statistical analysis of n-grams. This is a text mining analysis technique widely used in information retrieval and data mining in general. In this issue we cover the implementation of association measures derived from contingency tables.

The QA section explains how to conduct a Chi Square Test for tables with many items; i.e., beyond the usual 2 x 2 contingency tables.

Enjoy it.

A Tutorial on Standard Errors

25 Friday Jun 2010

Posted by egarcia in Data Mining, IR Tutorials, SEO Myths, Web Mining Course

≈ 3 Comments

Soon or later those conducting data mining studies will need to compute standard errors for several statistics.

Every statistic from a sample distribution has a standard error that is specific to that statistic. Using the incorrect definition for a standard error invalidates any research study.

A tutorial on standard errors is now available from miislita.com.

New Terrier Update

29 Monday Dec 2008

Posted by egarcia in Data Mining, Web Mining Course

≈ Leave a Comment

Early this year, students of my graduate course (Search Engines Architecture) used Terrier, an experimental search engine, in their lab lessons. I am still using Terrier for indexing and testing.

Few days ago Craig Macdonald from University of Glasgow sent me this new Terrier update. It sounds great, although I haven’t test it yet.

Terrier, IR Platform v 2.2 – 23/12/2008
http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.2, the next version of the open source IR platform from the University of Glasgow (Scotland) has been released.

This is a substantial update, which includes new support for Hadoop, primarily a Hadoop Map Reduce indexing system, allowing large collections of documents to be indexed in a highly distributed fashion. Also included are various minor improvements, including improved support for the IIT CDIP1 (TREC Legal track) collection, and various bug fixes. This is intended to be the ultimate release in the 2.x series.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

This will be my new toy to play with in 2009.

Keyword Density Tools and SEOs

26 Tuesday Feb 2008

Posted by egarcia in SEO Myths, Spam, Vector Space Models, Web Mining Course

≈ Leave a Comment

SEOs are still debating whether keyword density is good for something. The most recent debate is at http://www.hobo-web.co.uk/seo-blog/index.php/keyword-density-seo-myth/

Overall, the agreement is that is not useful.

Two issues that strikes me as these suggest a lack of understanding of how search engines work accomodate to the following questions:

1. Could KD be used by search engines or users to check for spam keyword?
2. Is Vector Space currently in use by modern search engines?

Let me clarify these points.

Could KD be used by search engines or web page creators to check for spam keyword?

Word repetition determined by search engines as spam keyword should be of more concern than what web page creators or a KD tool tags as spam keyword. After all search engines and not designers of web pages are the one that assign a rank to the documents. This goes with the user-machine relevance perception mismatch and the concept of document linearization as a gap analysis. We have thoroughly discussed both in our IRWatch Newsletter, at this blog, and at Mi Islita.

However, this does not mean end users are a zero to the left, as they are the ones that pay the bills. And even if they don’t, why rank high a page just to see users going to some place else after visiting it because is not suitable for human consumption? So, rather than using a KD tool, just write as natural and useful to your prospective clients and readers as you can.

Regarding the use of KD tools for checking for spam, this allegation reminds me of certain seo books, marketers, and community forums that insist in such non sense, just to keep their KD tools relevant and alive.

During the Web Mining Course we debunked almost on a rutinary basis these and similar SEO myths. For instance, grad students learned about several local weight models that attenuate frequencies, hence serving the purpose of both scoring local weights and dampening down the effect of keyword repetition. Two for the price of one!

This is more cost effective at neutralizing keyword repetition than computing and comparing against a whole new and extra ratio, KD. Best of all, it does not require of the two extra loops one would have to use to compute KD (one for every term i in a doc and another for every doc j across a collection). Thus, whatever the % ratio computed by a KD tool, it will be compacted/attenuated within the corresponding scales of the local weight model used. So, from the search engine side, KD is not even a cost-effective tool for fighting spam.

To be sure students understood, I included the following three questions in the Final Exam section that consisted of multiple choices. (The problem-solving section of the test is even more interesting, but is too long to include it here.)

#10. It is a false statement:

a. Distance is anti-similarity.
b. Keyword density estimates keyword relevance.
c. In Vector Space Theory, a document is a vector of terms.
d. In Vector Space Theory, a query is a vector of terms.

#15. Which model does not attenuate frequencies?

a. SQRT
b. FREQ
c. LOGA
d. LOGN

#16. Consider two documents d1 and d2 wherein local term weights are computed using the LOGA model. d1 repeats a term once. How many times this term should be repeated in d2 to triplicate its d1 weight? Assume Log 10 base.

a. 3 times.
b. 30 times
c. 100 times
d. 1000 times

Answers: 10. b, 15. b, and 16. c. (sorry I’ve made a typo).

Is Vector Space currently in use by modern search engines?

Suggesting the contrary is non sense. Vector Space models are used on a regular basis to score and rank documents. Implementation is not that hard across large collections if you use the right scoring system with updating and precaching techniques on a term-doc matrix. In fact, I’ll be teaching this Spring the graduate course Search Engines Architecture.

I will blog the syllabus tomorrow, but is already available from the Electrical & Computer Engineering and Computer Science Department of PUPR.edu. This is a lecture and lab session course. Students will build their own search engines, crawlers, parsers, stemmers, and vector space scoring systems using open source components and some of their own authorship.

On and on, SEOs still have no clue about what a search engine can or cannot do.

← Older posts
May 2013
M T W T F S S
« Apr    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.