Global Terrorism Database

November 3, 2009 by E. Garcia

If you are into homeland security oriented data mining, this post is for you.

The University of Maryland has a Global Terrorism Database (GTD; http://www.start.umd.edu/gtd/) with information on over 80,000 terrorist attacks that intelligence researchers can tap into.

GTD is an open-source database including information on terrorist events around the world from 1970 through 2007 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 80,000 cases.

You can search by keywords or browse by region, country, perpetraror, weapon, attack, or target.

It also has advanced search capabilites. To perform an advanced search you need to select all categories you wish to search. If you do not check any options then your search will include all content from that category, for example, selecting Algeria from the “Country” list will restrict your search to incidents in Algeria, while leaving it blank searches all countries.

Incident searches can be restricted to specific years using several pull-down menus.

I tested by querying [puerto rico] and indeed was able to obtain incident records related with Los Macheteros. The answer set, however, included results not relevant to the Island of Puerto Rico.

The database is pretty small, but can come handy at times. Definitively, I will use it for one of my next graduate courses on search engine architectures.

2009-10-IRW: Network Connectivity

October 29, 2009 by E. Garcia

network connectivity

The october issue of IRW newsletter is out, but late due to my academic duties. It should arrive to subscriber’s inbox today (or at the latest tomorrow). Sorry for the inconvenience.

In the featuring article, I included material from one of my grad exams. The QA column features IP-to-MAC conversions. The Who is Who section features Van Jacobson, creator of UDP-based traceroute and of many other tools.

Hopefully, the november issue will not be that delayed. BTW, it will feature the late Mike Muuss, inventor of Ping plus some super-fast track on the art of subnetting.

 

FIRE: Forum for Information Retrieval Evaluation

October 14, 2009 by E. Garcia

Ellen Voorhees, Director of TREC at NIST.Gov sent me this Call for Participation, reproduced below to facilitate its dissemination:

CALL FOR PARTICIPATION

FIRE
(Forum for Information Retrieval Evaluation)
Workshop
DAIICT, Gandhinagar, India
19-21 February 2010

http://www.isical.ac.in/~fire

The success of TREC, CLEF, and NTCIR has clearly established the importance
of building reusable, large-scale standard test collections in Information
Access research. The aim of FIRE is to encourage research in Indian language
Information Access by creating a similar platform for Indian languages that
provides the data and a common forum for comparing models and techniques.

The Tasks:
==========
1) Ad-hoc monolingual document retrieval in Bengali, Hindi and Marathi.

2) Ad-hoc cross-lingual document retrieval
- documents in Bengali, Hindi, Marathi, and English,
- queries in Bengali, Hindi, Marathi, Tamil, Telugu and English.
- Bengali and Hindi topics will also be transliterated and made available
in Roman script. Adhoc monolingual task participants are encouraged to
submit runs using these queries as well.

3) Retrieval and classification from mailing lists and forums.
This is a pilot task being offered by IBM India Research Lab.

4)  Ad-hoc Wikipedia-entity retrieval from news documents
- Entities mined from English Wikipedia
- Query documents from English news website
This is a pilot task being offered by Yahoo! Labs, Bangalore.

Important Dates:
================
Ad-hoc monolingual and cross-lingual document retrieval:
Training data release   Aug 15 ‘09
Test data release       Nov 01 ‘09
Adhoc run submission    Nov 25 ‘09
Results released        Feb 01 ‘10

Retrieval and classification from mailing lists and forums:
Training data release   Oct 16 ‘09
Test data release       Nov 01 ‘09
Run submission          Nov 25 ‘09
Results declared        Feb 01 ‘10

Ad-hoc Wikipedia-entity retrieval from news documents:
Training data release   Oct 15 ‘09
Test data release       Nov 01 ‘09
Run submission          Nov 25 ‘09
Results declared        Feb 01 ‘10

Task Co-ordinators:
===================
Ad-hoc retrieval:
Pushpak Bhattacharyya (pb@cse.iitb.ac.in)
IIT Bombay
Dipasree Pal (dipasree_t@isical.ac.in)
ISI Kolkata

Retrieval and classification from mailing lists and forums:
Debapriyo Majumdar (debapriyo@in.ibm.com)
IBM India Research Lab
Ayan Bandyopadhyay (ayan_t@isical.ac.in)
ISI Kolkata

Ad-hoc Wikipedia-entity retrieval from news documents:
Ashwin Tengli (ashwint@yahoo-inc.com)
Yahoo! Labs, Bangalore
Pabitra Mitra (pabitra@cse.iitkgp.ernet.in)
IIT Kharagpur

Overall co-ordinators:
Prasenjit Majumder (p_majumder@daiict.ac.in)
DAIICT, Gandhinagar
Mandar Mitra (mandar@isical.ac.in)
ISI Kolkata

International Advisory Committee for FIRE:
==========================================
Amit Singhal, Google Fellow, USA
Carol Peters, ISTI-CNR, Italy
Christian Fluhr, CEA, France
Donna Harman, National Institute of Standards and Technology, USA
Doug Oard, University of Maryland, USA
Ee Peng Lim, Nanyang Technological University, Singapore
Ellen Voorhees, National Institute of Standards and Technology, USA
Fabrizio Sebastiani, ISTI-CNR, Italy
Gareth Jones, Dublin City University, Ireland.
Hsin-Hsi Chen, National Taiwan University, Taipei, Taiwan
Hwee Tou Ng, National University of Singapore, Singapore
Iadh Ounis, University of Glasgow, UK
Ian Soboroff, National Institute of Standards and Technology, USA
Jacques Savoy, University of Neuchatel, Switzerland
James Allan, University of Massachusetts Amherst, USA
Krishna Kummamuru, IBM Research Lab, India
Mark Sanderson, University of Sheffield, UK
Mun Kew Leong, Institute for Infocomm Research, Singapore
Norbert Fuhr, University of Duisburg, Germany
Noriko Kando, National Institute of Informatics, Japan
Paul McNamee, Johns Hopkins University, USA
Prabhakar Raghavan, Yahoo! Research Labs, USA
Ricardo Baeza-Yates, Yahoo! Research Labs, Spain
Stephen Robertson, Microsoft Research, Cambridge, UK
Sung Hyon Myaeng, KAIST, South Korea
Tat-Seng Chua, National University of Singapore, Singapore
Tetsuya Sakai, Microsoft Research Asia, Beijing

DNS Intelligence

October 13, 2009 by E. Garcia

Today’s Internet Engineering Part 1 course lecture will be on DNS Intelligence and how we can use DNS records to understand virus and worm attacks as well as remote network topologies. Quite handy these days.

Please check Lecture 8

The Danger of Microsoft: Data Lost

October 11, 2009 by E. Garcia

According to a Techcrunch 10-10-09 news a crash at Microsoft’s Danger servers resulted in the lost of all user personal data and they don’t have a backup!

The news says:

T-Mobile and Danger, the Microsoft-owned subsidiary that makes the Sidekick, has just announced that they’ve likely lost all user data that was being stored on Microsoft’s servers due to a server failure. That means that any contacts, photos, calendars, or to-do lists that haven’t been locally backed up are gone.

And there is no backup for the data. Really smart, Microsoft people. That says a lot!

This is gonna be in an information security textbook near you. How about in a textbook on Human-Computer No-Interaction?

All About Email Headers

October 6, 2009 by E. Garcia

If you are enrolled in the IE-Part 1 course, here is some reference material on Email Headers for today’s lecture:

Exposing email headers

http://www.abs-comptech.com/EmailHeaders.htm

Tracking the source of email spam

http://www.rahul.net/falk/mailtrack.html

How to read email headers

http://www.emailaddressmanager.com/tips/header.html

Reading the email header

http://antivirus.about.com/od/windowsbasics/a/emailheaders.htm

Reading email headers

http://www.tinhat.com/email/read_email_headers.html

Spamlinks: Reading email headers

http://spamlinks.net/track-trace-headers.htm

ACCC: Reading Email Headers

http://www.uic.edu/depts/accc/newsletter/adn29/headers.html

E-mail Headers and SMTP Commands

http://www.avolio.com/columns/E-mailheaders.html

All About Email Headers

http://www.stopspam.org/index.php?option=com_content&view=article&id=45&Itemid=56

Security Optimization Strategies in the Workplace

http://www.miislita.com/searchito/security-optimization-strategies.html

Email Protocols

October 5, 2009 by E. Garcia

If you are a student enrolled in the Internet Engineering I graduate course, check the Lecture 7 update.

We will be covering email protocols such as SMTP, POP3, and IMAP. The exercise section covers email headers intelligence and email crawlers.

DNS Configuration

September 28, 2009 by E. Garcia

If you are a student enrolled in the Internet Engineering I graduate course, check the Lecture 6 update.

I will be covering all about DNS configuration files. For the hands-on exercise section, we will be using nslookup commands to snoop at all relevant records of remote Web domains.

Use nslookup/? to access the options helper
Use nslookup followed by ? in a different line to access the commands helper
To quit nslookup, press ctrl C or either type quit or exit.

Migrating from IPv4 to IPv6: The Next Nightmare?

September 24, 2009 by E. Garcia

Two weeks ago, venerable Vinton Cerf urged the Internet community to migrate from IPv4 to IPv6. According to Cerf, co-designer of the TCP/IP protocols, IPv4 will run out of addresses next year or in early-2011.

However, there is a problem.

Back in March, it was reported of an allegued Fatal Flaw for IPv6: it’s Not Backwards Compatible

Both news are equally intriguing.

IPv6 migration: Your Next Nightmare?

Internet Engineering I: Course Lectures

September 21, 2009 by E. Garcia

The following are the lecture and exercise topics covered in the PUPR.edu core graduate course Internet Engineering, Part I. Students enrolled in the course might want to revisit this post as it will be updated.

Lecture 0

History of the Internet & Search Engines

Internet Basics

Lecture 1

RFCs (Request for Comments)

Network Types

IP (Internet Protocol)

Exercise 1 – RFCs, Network types, IP calculations

Lecture 2

OSI Reference Model

ARP

ICMP

Exercise 2 – IP-MAC Mapping, Prompt Commands (arp, ipconfig, nslookup)

Lecture 3

Man-in-the-Middle ARP Attacks

IGMP

IP Packets

Exercise 3 – Broadcast & Multicast IPs, Prompt Commands (netstat, ping, tracert, ipconfig, arp, nslookup)

Lecture 4

Fragmentation Offset

FO Overlapping Attacks

FO Gap Attacks

Tiny FO Attacks

TCP Protocol & Buffers

Exercise 4 – TCP buffers, Congestion Windows, Advertised Windows

Lecture 5

PING

PING of Death

Smurfing

TRACEROUTE-based Intelligence

Exercise 5 – Prompt Commands (arp, ipconfig, nslookup, netstat, ping, tracert)

Lecture 6

BIND & WINDOWS DNS (Domain Name Server)

Internet backbone root servers

Configuration Files

DNS Configuration Errors

Forward Lookup (Zone) Files

Reverse Lookup Files

Exercise 6 – Prompt Commands (interactive/non-interactive nslookup modes)

Lecture 7

SMTP

POP3

IMAP

Email Headers

Exercise 7 – Email Intelligence.

Lecture 8

DNS Intelligence

Using DNS records to understand Virus & Worm Attacks

Network Topology Intelligence from DNS records

Exercise 8 – DNS Intelligence

Lecture 9

General Review

Practice Test

Lecture 10

Final Exam, Oct 27

Course Grading System

8 out of 9 hands-on exercises count (worse exercise grade dropped)
1st partial exam = average of first 4 exercises
2nd partial exam = average of last 4 exercises
These amount to 75% of total grade
Final Exam amounts to 25 % of total grade and it will be curved.

After that, total letter grade will be curved.

Course Letter Grades
A (100-89%)
B (88-77%)
C (76-60%)
D (59-50%)
F (49-0%)

2009-9-IRW: TCP/IP Practice Exam

September 12, 2009 by E. Garcia

TCP/IP Review Test

The current issue of IR Watch is out.  Sorry it was a bit delayed. The featuring article is a practice exam on TCP/IP that I’m giving to students enrolled in my Internet Engineering I graduate course.

The test was designed to review what students have learned during the first five lectures. Students need to describe about 10 TCP/IP-related vulnerability/hacking practices. So the test also is a great jump start for those interested in such weaknesses.

I have included an Excel gooddie for making IP conversions (IPv4/hexadecimal/decimal equivalent/binary) as well as some material from Tim Berners-Lee 1989 WWW proposal.

Enjoy it.

New Graduate Courses

September 1, 2009 by E. Garcia

As PUPR students know by now, the AIRWeb and Internet Engineering courses have been consolidated into a single course called Internet Engineering I (IE-I), which is on Tuesday’s.

This was a decision made strictly by the administration. 12 graduate students are enrolled –a big number for a grad course. We are now in the fourth week of IE-I and I can tell that is a lot of fun.

This coming Winter semester I’m scheduled to teach a new grad course called Advanced Search Engine Architecture (ASEA). Both, IE-I and ASEA are hands-on. This means students need to get their hands and feet wet, not just learning the theory.

What we are trying to accomplish in IE-I is to understand how hackers and spammers use Internet architectures at the level of TCP/IP and Search Engines to game the system. I’ll open a special blog category for it during the week.

First lecture (Lecture 1) was briefly summarized in the August 2009 issue of IR Watch. BTW. Tonight’s lecture (Lecture 4) covers the following:

IP Protocol (MAC and IP Mapping)

ICMP Protocol

ARP Hacking Attacks

ICMP Hacking Attacks

Firewall’s Fragmentation Offset  Attacks

Meanwhile, ASEA is an expanded version of the previous Search Engine Architecture (SEA) course I’ve taught before. Students interested in registering, can search this blog for the SEA category and check what we have covered in the past. This will give them an idea of what to expect from the Advanced SEA course. One thing I’m planning to do different is to build an inverted index from scratch using AJAX. The most recent version of Terrier will also be used for testing/benchmarking experimentals.

Last but not least, September Issue of IRW will be a bit delayed.

TCP/IP: The paper that started it all

August 25, 2009 by E. Garcia

Here is the paper that started all: A Protocol for Packet Network Intercommunication, by Vince Cerf and Bob Kahn.

2009-8-IRW: Internet & Search Engines: Early Days

August 14, 2009 by E. Garcia

Internet & Search Engines

The current issue of IRW is already out –a bit delayed due to reasons previously mentioned. Enjoy it.

Vector Notation

August 10, 2009 by E. Garcia

I’ve been asked what is the standard notation for vectors. I normally use loose notation, unless I need to write or review a formal piece, in which case I follow the APS style. See also here.

A vector should be represented by a letter, in boldface or with a right arrow on top.

A caret should be used to indicate a unit vector.

An inner product should be indicated by placing a dot between two letters representing vectors.

Note that Dirac Notation is a different animal.

The APS Style Guide has additional guidelines.

Thesaurus as a Complex Network

August 6, 2009 by E. Garcia

I came across Thesaurus as a complex network, a fascinating 2003 paper written by Adriano de Jesus Holanda, Ivan Torres Pisa, Osame Kinouchi, Alexandre Souto Martinez and Evandro Eduardo Seron Ruiz from Universidade Sao Paulo, Brazil in which they model thesauri using graph theory. The abstracts reads:

“A thesaurus is one, out of many, possible representations of term (or word) connectivity. The terms of a thesaurus are seen as the nodes and their relationship as the links of a directed graph. The directionality of the links retains all the thesaurus information and allows the measurement of several quantities. This has lead to a new term classification according to the characteristics of the nodes, for example, nodes with no links in, no links out, etc. Using an electronic available thesaurus we have obtained the incoming and outgoing link distributions. While the incoming link distribution follows a stretched exponential function, the lower bound for the outgoing link distribution has the same envelope of the scientific paper citation distribution proposed by Albuquerque and Tsallis [1]. However, a better fit is obtained by simpler function which is the solution of Ricatti’s differential equation. We conjecture that this differential equation is the continuous limit of a stochastic growth model of the thesaurus network. We also propose a new manner to arrange a thesaurus using the “inversion method”.”

The study is important because it provides an interesting look at word relationships. They have identified an underlying power law, which in my opinion might be worth to be investigated as to whether it is at core of semantic relationships.

They briefly mentioned the limitations of LSA.:

“However, LSA has been criticized as a poor approach for predicting semantic neighborhood”.

Indeed, LSA (or LSI) not necessarily describes or predicts semantics, as originally thought. In my view, LSA/LSI itself is a misnomer. Research references can be provided to support this view.

I do have one additional comment on the paper. In it, LSA is described as a PCA technique. The authors write:

“Another interesting way to treat data is the Latent Semantic Analysis (LSA) [5] which deals with word covariance in a corpus. LSA is a principal component analysis (PCA) technique , i.e., the covariance matrix is diagonalized and from the most important eigenvalues (around 300) the eigenvectors are considered to span an Euclidean vector space.”

This might not be entirely accurate. Let see why:

1. PCA was invented by Karl Pearson in 1901 so is more than half a century  older than Golub and Kahan’s SVD algorithm which was published in 1965. See G. Golub and W. Kahan, J. SIAM, Numer. Anal. SEr. B, Vol 2, No. 2 (1965). 

2. In 1988 Dumais, et al applied Golub’s SVD to text and called that LSA (LSI). See Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286, Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988). See also, Improving information retrieval using Latent Semantic Indexing. Proceedings of the 1988 annual meeting of the American Society for Information Science. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Beck, L. (1988).

3. In LSA (LSI) the SVD algorithm can be applied to matrices that not necessarily are populated with covariance values.

4. It was only later realized that SVD can be applied to a covariance matrix to obtain the PCA components.

5. See the PCA & SPCA Tutorial

6. PCA is not LSI. See http://irthoughts.wordpress.com/2007/05/05/pca-is-not-lsi/

Random Notes Before School Starts

August 3, 2009 by E. Garcia

1. The current issue of IR Watch will be out over the weekend–a bit delayed due to getting ready for school, preparing lessons and research projects. If things go as expected, my academic schedule will be a bit busy between teaching and research at two different universities.

2. I’m researching for a manuscript that deals with affine transformations applied to several IR problems. It expands on Vector Space Theory and allows one to think out of the “term-document” box. Great stuff.

3. Here is a great grad project in ppt format: Semantically Motivated Information Retrieval. I thank its author for referencing my  SVD Fast Track Tutorial.

4. Talking about semantically motivated, sentiment analysis, spam, etc… Funny how some folks in the SEO world like to damage the reputation of others without presenting any evidence. This time the trolls took on Kim Krause Berge ( http://cre8pc.com/archives/1489 ). I always admire Kim’s work, consider her an usability icon, and had the privilege of meeting her back in 2005. I was surprised to see these folks having a field day at her expense at Rand’s site. Kim, I feel your pain. However, more than one SEO forum/blog had lose credibility by allowing these folks, most of which think they can be socially “ranked” by attacking whoever is at the “top”. The fact is that most trolls are paper tigers that go hidding at the first Cease & Desist or defamation lawsuit.

Data Mining Poetry

July 28, 2009 by E. Garcia

I am intrigued with the subject of data mining poetry. This is an interesting topic for a grad student thesis since:

EXACT querying search engines for “data mining poetry” returns a small answer set.

Unlike other type of content, all words, including those considered stopwords, might matter; i.e., these must be counted as might act as content-bearing terms –thus, there is no such thing as stopwords in poetry.

Word statistics (e.g., word counts per lines) and specific tokens matter, unless we talk about the so-called free-style poetry.

Metric makes poetry suitable for building language-specific and writing style-specific parsers.

Any help will be kindly appreciated. Meanwhile, here are some relevant links:

poetry parser, anyone?

Poema Materotico – This is a blog-popular poem written in Spanish

poemas matematicos – Spanish resources

mathematical poems

mathematical poetry

Hacking the Cloud: Getting Google’s User Data by Hacking Twitter

July 17, 2009 by E. Garcia

A day ago Michael Arrington’s Techrunch published excerpts from “leaked” documents stolen from the Google Apps account of a Twitter Employee which included over 300 confidential files meant for “internal” Twitter consumption. “Hacker Croll” sent TechCrunch a zip file with 310 private files from inside Twitter.
(http://www.techtree.com/India/News/Leaked_Documents_Twitter_TechCrunch_Faceoff/551-104503-643.html).

It appears HC essentially used a cracker tool of some sort to brute-guess weak passwords. Once inside the first security ring, …

Cloud Programs: A Web Vulnerability Paradise for Hackers

Twitter relies heavily on cloud-based apps (Web-centric programs such as Google Docs or Web-based e-mail), and these services are becoming increasingly interconnected. Even social Web apps are beginning to share data: Facebook Connect and Google Friend Connect, for example, let you log in to multiple sites with a simple Facebook or Google account, raising the vulnerability of your entire online identity.
(http://www.switched.com/2009/07/17/twitter-employee-accounts-hacked-business-documents-leaked/)

The documents coming out of the hacker seem to be pretty significant. The “problem” is that if you have a Google Apps email account compromised, you also have shared calendar, Docs, Contacts, Wikis(Sites), etc.
(http://www.pcworld.com/article/168572/google_apps_security_questioned_after_twitter_leak.html)

This might be a good case study for students planning to take the AIR Web: Web Spam and Internet Vulnerability course.

Google Voice: A Call to Rule all Voice-based Services

July 16, 2009 by E. Garcia

Google yesterday announced Google Voice for Android and BlackBerry, a cell phone mobile service which brings voicemail transcriptions, the ability to call and text with your Voice number, and cheap international dialing to yourmobile phone. It is like one number to rule them all!

According to Google Mobile Blog:

“The Google Voice app integrates seamlessly with your phone’s native address book, making it even easier to call or text with your Voice number. Voicemail transcriptions are now available, and the app will highlight individual words during playback just like your favorite karaoke song. It also lets you take advantage of Google Voice’s low-priced international call rates, starting at only $0.02/minute.”

It is expected the service to eventually serve as a glue for other Google services like GMail, Web Searches, YouTube, SMS, etc.

These are great news, considering that as mentioned in the current issue of IRW, texting is a new playground for data miners. Imagine then a similar mining playground involving voice!

http://googlemobile.blogspot.com/2009/07/google-voice-for-android-and-blackberry.html

See also

http://searchengineland.com/google-voice-for-mobile-one-number-to-rule-them-all-22417

The Most Influential Paper that Gerard Salton Never Wrote

July 15, 2009 by E. Garcia

It is surprising how even serious information retrieval researchers and journals quote papers that were never written!

This is the thesis of David Dubin’s 2004 great article
The Most Influential Paper Gerard Salton Never Wrote

Dubin wrote:

“In giving credit to Salton for the vector model, a number of authors cite an overview paper titled “A Vector Space Model for Information Retrieval,” which some show as published in the JASIS in 1975 and others as published in the Communications of the Association for Computing Machinery (CACM) in 1975. In fact, no such article was ever published, and citations to it usually represent a confusion of two 1975 articles (Salton, Wong, & Yang, 1975; Salton, Yang, & Yu, 1975), neither of which were overviews of the VSM as it is generally understood (see section 5 below). Some of Salton’s own colleagues have been guilty of this mistake: both Cardie et al. and Singhal cite the CACM version, for example (Singhal, 2001; Cardie, Ng, Pierce, & Buckley, 2000). The paper is even cited in a few of the very last articles on which Salton is listed as a coauthor (Singhal, Salton, Mitra, & Buckley, 1996; Singhal & Salton, 1995). These papers were published close to or shortly after the time of his death, and so the errors cannot be blamed on Salton (remembered by his colleagues as a very careful and meticulous writer).”

Somehow far too many IRs misquote Salton’s 1975 paper titled “A vector space model for automatic indexing“. This causes digital libraries to create a spurious record attached to many cross-referenced articles.

I searched Google for “a vector space model for information retrieval” + salton and indeed there are many reputed publications and researchers citing a paper that was never published! What a shame.

That says a lot about researchers, editors, and reviewers that were lazy enough to never bother about the accuracy of the references.

Centering Data With Excel

July 10, 2009 by E. Garcia

The QA column of the current issue of IR Watch Newsletter has a great question that might help IR, CS, and stats students.

Q: Centering Data with Excel- In Excel, how do you center a data set?

 A: To center a data set, use the STANDARDIZE function which converts x values into z-scores; i.e.

z = (x – a)/s

where a and s respectively are the population arithmetic mean and standard deviation. The following table emulates an Excel spreadsheet.

 

A

B

C

1

Age, x(A)

Weight, x(W)

Height, x(H)

2

64

57

8

3

71

59

10

4

53

49

6

5

67

62

11

6

55

51

8

7

58

50

7

8

77

55

10

9

57

48

9

10

56

42

10

11

51

42

6

12

76

61

12

13

68

57

9

14

     

15

z(A)

z(W)

z(H)

16

0.14

0.62

-0.44

17

0.92

0.92

0.61

18

-1.09

-0.55

-1.49

19

0.47

1.36

1.14

20

-0.86

-0.26

-0.44

21

-0.53

-0.40

-0.97

22

1.59

0.33

0.61

23

-0.64

-0.70

0.09

24

-0.75

-1.58

0.61

25

-1.31

-1.58

-1.49

26

1.47

1.21

1.67

27

0.58

0.62

0.09

Rows 2 – 13 contains the data set x(A), x(W), and x(H). In rows 16 – 27 the set was centered by typing in cell A16 the formula

 =STANDARDIZE(A2,AVERAGE(A$2:A$13),STDEV(A$2:A$13))

 Pasting this formula in cells A16 through C27 centers the data set. That was easy!

IRW-7-2009: Data Mining Texting

July 6, 2009 by E. Garcia

data mining texting

The current issue of IRW the newsletter is out.

Featuring Article:

Data Mining Texting
TTMD OMG MOS CU

“My parents send email, I text.” This illustrates the obvious: a digital divide between parents and teens. While parents are busy replying to email or blogging at the most, their kids probably are busy developing their own language to alert their peers when mom or dad is trying to figure out what they are texting about. Did you know that MOS  CU means ‘Mother over shoulder’. ‘See you’. And how about PW CUL? (‘Parents watching. See you later’).

Indeed… Texting is not just for teens:

Texting not only is revolutionizing the way businesses are being conducted in 2009, but is an emerging data mining playground. The number of behavioral patterns in connection with texting is on the rise at different diffusion fronts: from sexting and sextcasting (transmission of conversations, videos, photos with sexual content) to dealing (transmission of conversations in connection with illegal drug activities), to encoding conversations about Wall Street transactions, industrial espionage, and so forth.

Random notes prior to 4th July weekend

July 3, 2009 by E. Garcia

As the 4th of July weekend approaches, here are some notes before hitting to planet oblivious.

1. Yesterday we had an interesting business entrepreneur meeting with the CIO of the Government of Puerto Rico at El Palacio Rojo, Fortaleza.

2. IRW should be out by Monday. Main article: Data Mining Texting.

3. Only monkeys still believe in KD Myths. Ha, Ha.

Official: MIC Puerto Rico

June 23, 2009 by E. Garcia

Back in April, I mentioned that Microsoft will be co-launching with Interamerican University of Puerto Rico, Metropolitan Campus the Microsoft Innovation Center (MIC) of Puerto Rico.

Well, tomorrow is the official inauguration. the university generously has provided me with lab and office space to start an interesting research project within the MIC building. These are exciting news. I cannot comment much about the project, except to say that it is at the interface of search engines, social networks, and information security.

It looks like I will have my hands full between workig at two universities, blogging, and doing consulting work.

IR Videos in Spanish

June 22, 2009 by E. Garcia

I normally do not put online my lecture notes (ppt, pdf, videos). However, there are two public conferences that event organizers taped. Both last over 1 hour and are in Spanish, but with slides in English. Here are the links. The quality of the videos is so-so.

Since the videos were made available few months later after the events, these are not properly dated. I have included below the actual date of the events. If you don’t know Spanish, you are out of luck.

1. Understanding Search Engines (Entendiendo a los Buscadores), University of Puerto Rico, Bayamon, 4-23-2008

http://video.google.com/videoplay?docid=-653964730907023811

This one last for about two hours. The audience consisted of grad students and researchers. Unfortunately, the video has an audio-visual mismatch of about one slide. If you can coupe with this, I hope you like it.

2. Demystifying LSI (Desmitificando LSI)- OJOBuscador Congress, Madrid, Spain, 3-09-2007.

http://www.ojotube.com/videos/congreso-ojobuscador-2007-ponencia-desmitificando-lsi-de-dr-e-garcia/

This one last for over one hour. Since it was for a non-scientific audience  (most Spanish SEOs)  I tried to talk very slow.

What is a Similarity Matrix?

June 16, 2009 by E. Garcia

Soon or later CS students, in particularly those in IR, will need to deal with similarity matrices.

In simple terms, any matrix M that exhibits the following five characteristics is a similarity matrix.

Squaredness = M must have the same number of rows and columns.
Non-Negativity = all elements of M must be real, non-negative numbers.
Boundedness = all elements of M must adopt values between 0 and 1.
Reflexivity = all diagonal elements of M (i.e. from left to bottom) must be filled with 1.
Symmetry = all ij elements must be identical to all ji elements.

A matrix that fails to exhibit any of these characteristics is not a similarity matrix.

Accordingly, some matrices found in the literature on LSI and whose elements have been referred to as similarities are not so since the corresponding matrix does not conform to the above definition.

Note. This information will help those that took the IR Quiz on Matrices to realize how well they did.

Computing Co-Occurrence Matrices with Excel

June 5, 2009 by E. Garcia

The QA column of the current issue of IR Watch – The Newsletter features the following question:

Question: In Excel, how do you convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?

Answer:

Let A be a matrix populated with term occurrences (frequencies).
Let AT be its transpose.

Then, T = AAT is a term-term co-occurrence matrix, and D = ATA is a document-document co-occurrence matrix.

The following table emulates an Excel spreadsheet.

 

A

B

C

D

1  A =

d1

d2

d3

2

t1

0

1

0

3

t2

0

0

1

4

t3

1

1

1

5

 

 

 

 

6

T = AAT

t1

t2

t3

7

t1

1

0

1

8

t2

0

1

1

9

t3

1

1

3

10

 

 

 

 

11

D = ATA

d1

d2

d3

12

d1

1

1

1

13

d2

1

2

1

14

d3

1

1

2

In the table, T was computed by selecting a destination array, entering in its first empty cell (B7) the formula =MMULT(B2:D4,TRANSPOSE(B2:D4)), pressing the f2 key and then the Ctrl+Shift+Enter keys.

Similarly, D was computed by selecting a destination array, entering in its first empty cell (B12) the formula =MMULT(TRANSPOSE(B2:D4),B2:D4), pressing the f2 key and then the Ctrl+Shift+Enter keys.

That was easy!

Note that none of these are similarity matrices. Can you tell why?

IRW-2009-6:Hackers: Taxonomy & Writing Styles

June 1, 2009 by E. Garcia

hackers

The current issue of IRW should reach subscribers inbox during the day or at the latest, tomorrow.

In this issue:

  • Featuring article: Hackers: Taxonomy and Writing Styles
    Due to the increasing interest in developing Information Retrieval and Data Mining courses at the intersection of Information Security, this issue of the newsletter covers a brief taxonomy on hackers and their writing styles.
  • QA: Excel Matrix Multiplications: How to convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?
  • Vacuum Tubes & Transistors Historical
  • Who is Who in IR: Thomas K. Landauer
  • Top CS Departments: Dartmouth College
  • Outstanding Graduate Theses
  • Calls and Events
  • IR Blogs
  • and more…

On Term Repetition and Local Models

May 27, 2009 by E. Garcia

I’m putting together a piece on several local term weight models. It should be ready in few weeks.

It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.

The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:

<last nail in KD coffin  style=”intensity:100%;”>

“It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:

 Pertinence (P); e.g., that a term repeated x times is x times more pertinent to the document.

Aboutness (A); e.g., that the document is x times more about the term.

Importance (I); i.e., that there is a term-document relationship of pertinence and aboutness.

Relevance (R);i..e., that a document repeating a term x times is x times more relevant.

Accordingly, fulfilling such ‘PAIR criteria’ on a regular basis is hard to accomplish with any model that lacks of attenuation.”

</last nail in KD coffin>