• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: July 2009

Data Mining Poetry

28 Tuesday Jul 2009

Posted by egarcia in Data Mining

≈ Leave a Comment

I am intrigued with the subject of data mining poetry. This is an interesting topic for a grad student thesis since:

EXACT querying search engines for “data mining poetry” returns a small answer set.

Unlike other type of content, all words, including those considered stopwords, might matter; i.e., these must be counted as might act as content-bearing terms –thus, there is no such thing as stopwords in poetry.

Word statistics (e.g., word counts per lines) and specific tokens matter, unless we talk about the so-called free-style poetry.

Metric makes poetry suitable for building language-specific and writing style-specific parsers.

Any help will be kindly appreciated. Meanwhile, here are some relevant links:

poetry parser, anyone?

Poema Materotico – This is a blog-popular poem written in Spanish

poemas matematicos – Spanish resources

mathematical poems

mathematical poetry

Hacking the Cloud: Getting Google’s User Data by Hacking Twitter

17 Friday Jul 2009

Posted by egarcia in Hacking, Homeland Security

≈ Leave a Comment

A day ago Michael Arrington’s Techrunch published excerpts from “leaked” documents stolen from the Google Apps account of a Twitter Employee which included over 300 confidential files meant for “internal” Twitter consumption. “Hacker Croll” sent TechCrunch a zip file with 310 private files from inside Twitter.
(http://www.techtree.com/India/News/Leaked_Documents_Twitter_TechCrunch_Faceoff/551-104503-643.html).

It appears HC essentially used a cracker tool of some sort to brute-guess weak passwords. Once inside the first security ring, …

Cloud Programs: A Web Vulnerability Paradise for Hackers

Twitter relies heavily on cloud-based apps (Web-centric programs such as Google Docs or Web-based e-mail), and these services are becoming increasingly interconnected. Even social Web apps are beginning to share data: Facebook Connect and Google Friend Connect, for example, let you log in to multiple sites with a simple Facebook or Google account, raising the vulnerability of your entire online identity.
(http://www.switched.com/2009/07/17/twitter-employee-accounts-hacked-business-documents-leaked/)

The documents coming out of the hacker seem to be pretty significant. The “problem” is that if you have a Google Apps email account compromised, you also have shared calendar, Docs, Contacts, Wikis(Sites), etc.
(http://www.pcworld.com/article/168572/google_apps_security_questioned_after_twitter_leak.html)

This might be a good case study for students planning to take the AIR Web: Web Spam and Internet Vulnerability course.

Google Voice: A Call to Rule all Voice-based Services

16 Thursday Jul 2009

Posted by egarcia in Data Mining

≈ Leave a Comment

Google yesterday announced Google Voice for Android and BlackBerry, a cell phone mobile service which brings voicemail transcriptions, the ability to call and text with your Voice number, and cheap international dialing to yourmobile phone. It is like one number to rule them all!

According to Google Mobile Blog:

“The Google Voice app integrates seamlessly with your phone’s native address book, making it even easier to call or text with your Voice number. Voicemail transcriptions are now available, and the app will highlight individual words during playback just like your favorite karaoke song. It also lets you take advantage of Google Voice’s low-priced international call rates, starting at only $0.02/minute.”

It is expected the service to eventually serve as a glue for other Google services like GMail, Web Searches, YouTube, SMS, etc.

These are great news, considering that as mentioned in the current issue of IRW, texting is a new playground for data miners. Imagine then a similar mining playground involving voice!

http://googlemobile.blogspot.com/2009/07/google-voice-for-android-and-blackberry.html

See also

http://searchengineland.com/google-voice-for-mobile-one-number-to-rule-them-all-22417

The Most Influential Paper that Gerard Salton Never Wrote

15 Wednesday Jul 2009

Posted by egarcia in Vector Space Models

≈ 5 Comments

It is surprising how even serious information retrieval researchers and journals quote papers that were never written!

This is the thesis of David Dubin’s 2004 great article
The Most Influential Paper Gerard Salton Never Wrote

Dubin wrote:

“In giving credit to Salton for the vector model, a number of authors cite an overview paper titled “A Vector Space Model for Information Retrieval,” which some show as published in the JASIS in 1975 and others as published in the Communications of the Association for Computing Machinery (CACM) in 1975. In fact, no such article was ever published, and citations to it usually represent a confusion of two 1975 articles (Salton, Wong, & Yang, 1975; Salton, Yang, & Yu, 1975), neither of which were overviews of the VSM as it is generally understood (see section 5 below). Some of Salton’s own colleagues have been guilty of this mistake: both Cardie et al. and Singhal cite the CACM version, for example (Singhal, 2001; Cardie, Ng, Pierce, & Buckley, 2000). The paper is even cited in a few of the very last articles on which Salton is listed as a coauthor (Singhal, Salton, Mitra, & Buckley, 1996; Singhal & Salton, 1995). These papers were published close to or shortly after the time of his death, and so the errors cannot be blamed on Salton (remembered by his colleagues as a very careful and meticulous writer).”

Somehow far too many IRs misquote Salton’s 1975 paper titled “A vector space model for automatic indexing“. This causes digital libraries to create a spurious record attached to many cross-referenced articles.

I searched Google for “a vector space model for information retrieval” + salton and indeed there are many reputed publications and researchers citing a paper that was never published! What a shame.

That says a lot about researchers, editors, and reviewers that were lazy enough to never bother about the accuracy of the references.

Centering Data With Excel

10 Friday Jul 2009

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

The QA column of the current issue of IR Watch Newsletter has a great question that might help IR, CS, and stats students.

Q: Centering Data with Excel- In Excel, how do you center a data set?

 A: To center a data set, use the STANDARDIZE function which converts x values into z-scores; i.e.

z = (x – a)/s

where a and s respectively are the population arithmetic mean and standard deviation. The following table emulates an Excel spreadsheet.

 

A

B

C

1

Age, x(A)

Weight, x(W)

Height, x(H)

2

64

57

8

3

71

59

10

4

53

49

6

5

67

62

11

6

55

51

8

7

58

50

7

8

77

55

10

9

57

48

9

10

56

42

10

11

51

42

6

12

76

61

12

13

68

57

9

14

     

15

z(A)

z(W)

z(H)

16

0.14

0.62

-0.44

17

0.92

0.92

0.61

18

-1.09

-0.55

-1.49

19

0.47

1.36

1.14

20

-0.86

-0.26

-0.44

21

-0.53

-0.40

-0.97

22

1.59

0.33

0.61

23

-0.64

-0.70

0.09

24

-0.75

-1.58

0.61

25

-1.31

-1.58

-1.49

26

1.47

1.21

1.67

27

0.58

0.62

0.09

Rows 2 – 13 contains the data set x(A), x(W), and x(H). In rows 16 – 27 the set was centered by typing in cell A16 the formula

 =STANDARDIZE(A2,AVERAGE(A$2:A$13),STDEV(A$2:A$13))

 Pasting this formula in cells A16 through C27 centers the data set. That was easy!

IRW-7-2009: Data Mining Texting

06 Monday Jul 2009

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

data mining texting

The current issue of IRW the newsletter is out.

Featuring Article:

Data Mining Texting
TTMD OMG MOS CU

“My parents send email, I text.” This illustrates the obvious: a digital divide between parents and teens. While parents are busy replying to email or blogging at the most, their kids probably are busy developing their own language to alert their peers when mom or dad is trying to figure out what they are texting about. Did you know that MOS  CU means ‘Mother over shoulder’. ‘See you’. And how about PW CUL? (‘Parents watching. See you later’).

Indeed… Texting is not just for teens:

Texting not only is revolutionizing the way businesses are being conducted in 2009, but is an emerging data mining playground. The number of behavioral patterns in connection with texting is on the rise at different diffusion fronts: from sexting and sextcasting (transmission of conversations, videos, photos with sexual content) to dealing (transmission of conversations in connection with illegal drug activities), to encoding conversations about Wall Street transactions, industrial espionage, and so forth.

Random notes prior to 4th July weekend

03 Friday Jul 2009

Posted by egarcia in Miscellaneous, Newsletters, SEO Myths

≈ Leave a Comment

As the 4th of July weekend approaches, here are some notes before hitting to planet oblivious.

1. Yesterday we had an interesting business entrepreneur meeting with the CIO of the Government of Puerto Rico at El Palacio Rojo, Fortaleza.

2. IRW should be out by Monday. Main article: Data Mining Texting.

3. Only monkeys still believe in KD Myths. Ha, Ha.

July 2009
M T W T F S S
« Jun   Aug »
 12345
6789101112
13141516171819
20212223242526
2728293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.