• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Category Archives: Machine Learning

Teleporting, Marketing, and Teleportnication

18 Sunday Nov 2012

Posted by egarcia in Machine Learning, Marketing Research

≈ Leave a Comment

Here in Puerto Rico we usually run behind the rest of the Nation, in implementing technology advances and good ideas, by about 5 to 7 years. Just this month and due to the elections, a local TV channel just started experimenting with teleportation. Few years ago that was tested in the USA. It is a catchy marketing technology.

Back then my SEO and marketing friends immediately started to see the next “future” for marketing and, of course, porn: teleportation for meetings, events, and SEO Conferences.

Keep these terms in mind: Teleportnication, Teleportnography => Teleportnicación, Teleportnografía

PS: Mobile apps for that, too? Probably there are some out there now or in the near future (with the right equipment in place) Portnography, anyone?

Electronic Drugs and Hackers

24 Tuesday Apr 2012

Posted by egarcia in Hacking, Human-Computer Interaction, Machine Learning

≈ Leave a Comment

I Doser has been called an addictive electronic drug. It is a common hype in social networks. But, actually it is nothing new, but a well-repacked business.

You can get all kind of e-drugs: from e-marihuana to e-….anything by just using earphones. A dangerous mixture if you are driving a car!

Such e-drugs are based on binaural beats, discovered in 1839 by Dove. These are slow modulations that are perceived when tones of different frequency are presented to each ear. Such auditory beats in the brain can have unexpected results, altering consciousness: A virtual LSD?

In 1973 Oster discovered that binaural beats can be detected by humans when carrier tones are below approximately 1000 Hz. According to Lane et al (see references below)

WHEN two pure auditory signals of similar frequency are mixed together, the phase interference between their waveforms produces a composite signal with a frequency midway between the upper and lower frequencies and an amplitude modulation that occurs with a frequency equal to the difference between the two original frequencies. For example, mixing tones of 100 Hz and 110 Hz yields a signal with a perceived frequency of 105 Hz that rises and falls in amplitude with a frequency of 10 Hz. The amplitude-modulated composite signal is called an auditory beat.

A similar phenomenon occurs when auditory signals of similar frequency are presented separately to the left and right ear through stereo headphones. Although each ear hears only one of the frequencies, the listener perceives the middle frequency and the amplitude modulation, even though the auditory beat does not exist in physical space. This phenomenon, called a ‘‘binaural auditory beat,’’ and described more than 25 years ago (6), is created by the brain’s processing of the two separate auditory signals at the level of the olivary nuclei of the brainstem.

It was a matter of time to see some looking for making a quick cash doing a 2 + 2 math, mixing hungry with necessity (“se juntó el hambre con la necesidad”). So now we can see low level forms of life looking for an escape to their reality through I Doser.

Hackers may soon be able to misuse these e-brain technologies to cause physical harm. A WMD in the making or accident waiting to happen?

References

Binaural Auditory Beats Affect Vigilance Performance and Mood
Auditory Beats in the Brain
Inducing Altered States Auditory and Visual Stimulation
Entraining Tones and Binaural Beats
Research_Frequencies
Audio-Visual Entrainment

Hey, SEOs: On Information Gain, Keyword Wallop, and Relevance

13 Monday Feb 2012

Posted by egarcia in Human-Computer Interaction, Machine Learning, Marketing Research

≈ Leave a Comment

Which words pack more wallop, are more emphatic, are more beefy or juicy? Whatever you want to call it, if you are an SEO or copywriter, you probably know what I mean.

Well, the answer to such a question depends on what you are trying to accomplish.
According to the family of BM25 algorithms,

http://irthoughts.wordpress.com/2011/08/04/bm25-and-bm25f-implications-to-seo-and-web-design/

a term has more information gain during its first occurrences, especially if these occur earlier in a document. This pressumes some kind of relationship between information gain and the position and distribution of words in a document.

Journalists and editors understand the concept. That’s why they like to answer the who, what, when, why, and how early in a copy, although not necessarily in that order.

And that’s why you see so many press release titles written in a ‘who-what’ way!

That strategy might work with search engines, but if you want to emphasize more specific keywords in a natural way you probably need a different keyword positioning strategy, at least if you write in English.

Says who? William Strunk, Jr. in his book The Elements of Style.
Says who? Joe Carrillo and Strunk, and quote:

http://josecarilloforum.com/forum/index.php?topic=496.0;prev_next=next

In his original 1918 edition of The Elements of Style (that was long before E. B. White came up with a chapter on style that made him a co-author of the book), William Strunk, Jr. came up with this perplexing prescription in his discussion of the principles of exposition:

“The proper place for the word, or group of words, which the writer desires to make most prominent is usually the end of the sentence…The word or group of words entitled to this position of prominence is usually the logical predicate, that is, the new element in the sentence…”

Strunk gave the following example to illustrate his point:

The modifying phrase at the tail-end of the sentence: “This steel is principally used for making razors, because of its hardness.”

The logical predicate at the tail-end of the sentence: “Because of its hardness, this steel is principally used in making razors.”

And here is the eye-opening point:

For his final words on the subject, however, Strunk made the following provocative—and as I already said, perplexing—prescription:

“The principle that the proper place for what is to be made most prominent is the end applies equally to the words of a sentence, to the sentences of a paragraph, and to the paragraphs of a composition.”

Carrillo’s essay is an excellent one. He later wrote a follow up post and quote:
http://josecarilloforum.com/forum/index.php?topic=627.0

In spoken English, we can emphasize the ideas we want to emphasize by giving them a stronger stress, leveling off our voice when enunciating minor or neutral ones, and downplaying the points that simply don’t support our contention. In writing, however, the process is rarely that simple. We can achieve emphasis only with our choice of words and how we array them into word clusters, into clauses and phrases, and ultimately into sentences and paragraphs. Mechanical devices exist that help, of course, like underlining, boldface type, italics, headlines and subheadlines, and—in today’s savvy word-processing routines—even colors, clip-arts, and emoticons. But as the aspiring writer soon discovers, much of the emphasis we seek has to be built into the very contours of the individual words as they unfold on the page.

There are three basic word-positioning principles we must know for maximum emphasis in writing English sentences: first, the initial and terminal positions of sentences are by nature more emphatic than their middles; second, when we construct a complex sentence, the main clause gets more emphasis than subordinate clauses; and third, when everything is written and done, the last words of the sentence are normally the most emphatic of all. These are structurally inherent in the English language itself, as we will see more clearly when we study them in closer detail.

Carrillo then mentions three important concepts:
1. The initial and terminal positions of sentences are prime.
2. The main clause gets more emphasis than subordinate clauses.
3. The last words of the sentence are normally the most emphatic.

The take away

Clearly, all this shows that although interrelated, information gain, keyword wallop, and relevancy are not the same thing. Relevancy is more along the lines of “aboutness”, “eliteness”, and few other semantic concepts.

The problem is that there is a relevance perception divide between machines and end-users: topic that we have discussed. See this link:

http://irthoughts.wordpress.com/2007/06/01/sneak-preview-of-ir-watch-2007-6-issue/

Still thinking in the keyword density/spamming crap?

My IPAM Lost Pictures

20 Friday Jan 2012

Posted by egarcia in Data Mining, Latent Semantic Indexing, Machine Learning, Statistics and Mathematics

≈ 7 Comments

On January 23-27, 2006 I was at the Institute for Pure and Applied Mathematics, UCLA, California attending a now infamous Document Space Workshop. I took some pictures, but did not find these until now.

I’ve posted these in my facebook page, posing with back then IPAM director and with world-recognized LSI expert Dr. Michael Berry and his former students. To learn more about the workshop and the speakers, follow this link http://www.miislita.com/ipam/ipam-document-space-workshop.pdf

A New Weighting Strategy

27 Tuesday Dec 2011

Posted by egarcia in Data Mining, Human-Computer Interaction, Machine Learning, Marketing Research, Programming, Quack Science, Statistics and Mathematics, Web Mining Course

≈ Leave a Comment

I received this morning from the editors of Communications in Statistics: Theory and Methods confirmation that they accepted and will be publishing my peer reviewed paper on a new model for statistical analysis. It should be out this 2012.

Once published, you will understand the SEO (* SEOmoz, I should say) non-sense of computing arithmetic averages of correlation coefficients and why some meta-analysis studies published in the past (* Hunter-Schmidt; Hedges-Olkin) are flawed and invalid.

It took me several meals and research hours to figure it out. I hope that IRs, dataminers, and statistics colleagues find new applications for the model.

The model can be applied to many fields, including marketing, business, risk analysis, data mining, signal processing, engineering, clinical trials, and almost any field or knowledge domain that involves the calculation of weighted statistics. I look forward to discuss it online once it get published.

Happy New Year.

PS. (*) I’ve edited this post to make these points obvious. So, the issue of arithmetically averaging correlations has been raised and killed for good before the scientific and statistical community.

PS. Just in: Last night (Jan-03-2012) I received news from one of the editors of the journal that the paper was assigned to issue 41 (8). Check for its title: The Self-Weighting Model (in Spanish is something like “El Modelo de Autoponderacion“. I forget to mention that this journal is published biweekly; so, things are moving fast. What a way of ending 2011 and starting 2012!!!

Minerazzi Crawler and Whois Updates: Email Addresses, Reverse DNS, IPv4 Mapping, Navigation

11 Monday Jul 2011

Posted by egarcia in Data Mining, Homeland Security, IR Quizzes, Machine Learning, Programming, Software

≈ Leave a Comment

We keep improving the Minerazzi site (http://www.minerazzi.com). We moved all pages to a php format. In addition, here are recent changelogs for the Web Crawler (http://www.minerazzi.com/labs/crawlinker.php):

07-05-11: Email address extraction, deduplication, and sorting capabilities added.
07-04-11: Design and copy changes.
07-03-11: Navigation menu restored and bug fixed.
07-03-11: Navigation menu removed to test bug.
07-02-11: Top-bottom quick navigation menu added.
07-02-11: Day/Time Stamp, Reverse DNS, and IPv4 List capabilities added.
07-02-11: Integration to Whois Tool.

The Whois Database Retriever (http://www.minerazzi.com/labs/whois.php) now features suffix/prefix stripping capabilities. This means that users only need to enter a candidate domain name without any alias or extension and the tool scans multiple registrar databases. We expect to add some additional features to this time-saving application.

In the meantime, we keep beta testing the engine. Our staff of ‘miners’ are doing just a great job.

An online Crawler for the masses

11 Monday Apr 2011

Posted by egarcia in Data Mining, Human-Computer Interaction, Machine Learning, Programming, Software

≈ Leave a Comment

Since at this time we haven’t launch an official blog, this post goes…

We are excited to announce several updates to the minerazzi crawler. This is the online version of the indexing crawler used by the minerazzi search engine (beta).

The long-term goal is to turn this version into a multifunctional mining platform and a crawler for the IT masses; i.e., a crawler to be used by IR researchers, data miners, webmasters, developers, etc. That is, a crawler that even Web designers and the average public can use.

You’re welcome to give it a try. Keep in mind the tool is still in beta. While you are there, feel also free to test the multiple whois domain name tool.

IRW-10-2010:Inverted Index Architectures Part Three

25 Monday Oct 2010

Posted by egarcia in Machine Learning, Newsletters, Programming

≈ Leave a Comment

Subcribe to IRW

The current issue of IRW should arrive today to subscribers inbox. It is full of meaty stuff. The featuring article is Part Three of the series on inverted index architectures. We cover positional inverted indexes. It is shown with a simple example how these indexes processs advance searches (AND, NEAR, and EXACT) in order to retrieve documents.

The QA column covers hypothesis testing with correlation coefficients at a given sample size and confidence level.

Enjoy it!

IRW-3-7-2010: Artificial Languages

28 Wednesday Jul 2010

Posted by egarcia in Human-Computer Interaction, Machine Learning, Newsletters

≈ Leave a Comment

artificial languages

 

The current issue of IR Watch – The Newsletter should arrive to subscribers today or at the latest tomorrow. IRW reaches research centers from the academic and industry “world”. Centers can then forward the newsletter to their members, many of which elect to have their own subscription. That’s a huge reach. And the best part is that IRW is free.

This issue of IRW covers some artificial language algorithms originally investigated by Claude Shannon in their infamous work A Mathematical Theory of Communication.

In that work, frequencies associated to a pool of strings were used. In his tests, Shannon used the 26 letters from the English alphabet plus the space. He also used entire words.

Despite of their simplicity, students often have problems understanding these algorithms. In this issue we show how teachers and students can reproduce Shannon’s algorithms. To adhere to his experiments, we reproduce his comments and findings.

Enjoy it.

The Readability War

05 Monday Apr 2010

Posted by egarcia in Machine Learning, Programming

≈ Leave a Comment

Yesterday we posted on Kincaid’s ARI. Although not mentioned, the main reason for bloggin about it was that we are testing some readability index (RI) tools to be integrated to Mi Islita.com.

There is a kind of endless readability “war” since the invention of the first tests. Thanks to the Internet this war is in full swing among academics. When it comes to computing such scores/indexes, should we:

include or exclude random samples from a given text? (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.5934&rep=rep1&type=pdf)
include or exclude syllables? (http://www.jstor.org/pss/27531033)
include or exclude punctuation?
include or exclude typos?
include or exclude spaces?
include or exclude non-dictionary terms?

Since punctuation can affect readability, semantics, coherence etc (http://www.aelfe.org/documents/text4-Sancho.pdf), why remove punctuation from the analysis? Unfortunately, current readability formulas only measure structural difficulty (http://www.iacis.org/iis/2008_iis/pdf/S2008_1071.pdf).

To do or not to do tokenization… that is the question.

Funny that some algorithms define “words” as a continuous sequence of nonspaces, regardless if these are punctuation characters or valid words. But then, word lists free-from punctuation and spaces are used for assessing the performance of such algorithms. That’s smell,… and it is not lavanda.

There is also the problem of scoring non-narrative text as found in most Web documents with formulas intended to be used for scoring narrative text. How futile is that?

And how futile is scoring Web documents with ever changing dynamic content?

Do readability formulas work? (http://blogs.wsj.com/numbersguy/do-readability-formulas-work-297/tab/article/)

And if so: which tool is better? It all depends on who is counting, how, and why.

← Older posts
June 2013
M T W T F S S
« May    
 12
3456789
10111213141516
17181920212223
24252627282930

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.