Archive for the ‘Data Mining’ Category
July 10, 2009
The QA column of the current issue of IR Watch Newsletter has a great question that might help IR, CS, and stats students.
Q: Centering Data with Excel- In Excel, how do you center a data set?
A: To center a data set, use the STANDARDIZE function which converts x values into z-scores; i.e.
z = (x – a)/s
where a and s respectively are the population arithmetic mean and standard deviation. The following table emulates an Excel spreadsheet.
| |
A
|
B
|
C
|
|
1
|
Age, x(A)
|
Weight, x(W)
|
Height, x(H)
|
|
2
|
64
|
57
|
8
|
|
3
|
71
|
59
|
10
|
|
4
|
53
|
49
|
6
|
|
5
|
67
|
62
|
11
|
|
6
|
55
|
51
|
8
|
|
7
|
58
|
50
|
7
|
|
8
|
77
|
55
|
10
|
|
9
|
57
|
48
|
9
|
|
10
|
56
|
42
|
10
|
|
11
|
51
|
42
|
6
|
|
12
|
76
|
61
|
12
|
|
13
|
68
|
57
|
9
|
|
14
|
|
|
|
|
15
|
z(A)
|
z(W)
|
z(H)
|
|
16
|
0.14
|
0.62
|
-0.44
|
|
17
|
0.92
|
0.92
|
0.61
|
|
18
|
-1.09
|
-0.55
|
-1.49
|
|
19
|
0.47
|
1.36
|
1.14
|
|
20
|
-0.86
|
-0.26
|
-0.44
|
|
21
|
-0.53
|
-0.40
|
-0.97
|
|
22
|
1.59
|
0.33
|
0.61
|
|
23
|
-0.64
|
-0.70
|
0.09
|
|
24
|
-0.75
|
-1.58
|
0.61
|
|
25
|
-1.31
|
-1.58
|
-1.49
|
|
26
|
1.47
|
1.21
|
1.67
|
|
27
|
0.58
|
0.62
|
0.09
|
Rows 2 – 13 contains the data set x(A), x(W), and x(H). In rows 16 – 27 the set was centered by typing in cell A16 the formula
=STANDARDIZE(A2,AVERAGE(A$2:A$13),STDEV(A$2:A$13))
Pasting this formula in cells A16 through C27 centers the data set. That was easy!
Posted in Data Mining, Newsletters | Leave a Comment »
July 6, 2009

The current issue of IRW the newsleter is out.
Featuring Article:
Data Mining Texting
TTMD OMG MOS CU
“My parents send email, I text.” This illustrates the obvious: a digital divide between parents and teens. While parents are busy replying to email or blogging at the most, their kids probably are busy developing their own language to alert their peers when mom or dad is trying to figure out what they are texting about. Did you know that MOS CU means ‘Mother over shoulder’. ‘See you’. And how about PW CUL? (‘Parents watching. See you later’).
Indeed… Texting is not just for teens:
Texting not only is revolutionizing the way businesses are being conducted in 2009, but is an emerging data mining playground. The number of behavioral patterns in connection with texting is on the rise at different diffusion fronts: from sexting and sextcasting (transmission of conversations, videos, photos with sexual content) to dealing (transmission of conversations in connection with illegal drug activities), to encoding conversations about Wall Street transactions, industrial espionage, and so forth.
Posted in Data Mining, Newsletters | Leave a Comment »
June 23, 2009
Back in April, I mentioned that Microsoft will be co-launching with Interamerican University of Puerto Rico, Metropolitan Campus the Microsoft Innovation Center (MIC) of Puerto Rico.
Well, tomorrow is the official inauguration. the university generously has provided me with lab and office space to start an interesting research project within the MIC building. These are exciting news. I cannot comment much about the project, except to say that it is at the interface of search engines, social networks, and information security.
It looks like I will have my hands full between workig at two universities, blogging, and doing consulting work.
Posted in Data Mining | Leave a Comment »
May 25, 2009
What is the (^H^H^H) best definition for data mining and database? It depends on who you ask and in which context.
According to Section 126 of the USA Patriot Act,
(1) DATA-MINING- The term `data-mining’ means a query or search or other analysis of one or more electronic databases, where
(A) at least one of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement;
(B) the search does not use personal identifiers of a specific individual or does not utilize inputs that appear on their face to identify or be associated with a specified individual to acquire information; and
(C) a department or agency of the Federal Government is conducting the query or search or other analysis to find a pattern indicating terrorist or other criminal activity.
(2) DATABASE- The term `database’ does not include telephone directories, information publicly available via the Internet or available by any other means to any member of the public, any databases maintained, operated, or controlled by a State, local, or tribal government (such as a State motor vehicle database), or databases of judicial and administrative opinions.
Asking the government or a KDDM researcher the question and using LSI to clusters results for the above question can be a futile exercise.
It is like asking President Obama or Vice President Cheney to agree on: “What is Torture?”
Posted in Data Mining | Leave a Comment »
April 8, 2009

The current issue of IRW should be in subscribers inbox today or tomorrow, at the latest.
In this issue of the newsletter we cover Rich Internet Applications (RIAs) and how these can be used for Web/Data Mining. A RIA is a browser-independent application that can be compiled and run from the desktop.
In this issue:
Featuring article: Web & Data Mining with RIAs
QA: Recommended RIAs
Who is Who in IR: Bruce Croft
Top CS Departments: UMass, Amherst
Historical Notes: John von Neumann and Bugs
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…
IRW currently reaches a fine audience of university and government researchers and their labs. If you are a graduate student or IR practitioner and want to be known within this exclusive circle, submit a short article (2, 3 pages, IRW format, free from marketing and sale pitches) for its consideration
Posted in Data Mining, Newsletters | Leave a Comment »
March 17, 2009
I’m reading with great interest the paper
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,
I’m happy to read they referenced our Tutorial on Cosine Similarity Measures. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.
Their abstract follows:
ABSTRACT
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.
Posted in Data Mining, Machine Learning, Vector Space Models | Leave a Comment »
March 5, 2009
I am presenting at The Seminario Interuniversitario de Investigación en Ciencias Matemáticas (Interuniversity Seminar on Mathematical Sciences Research, SIDIM).
This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (http://sidim2009.uprr.pr/)
This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book of abstracts is available at http://sidim.uprh.edu/libroSIDIM2009.pdf
I will be presenting new research work on IDF and a new model for the conditional specificity of terms. If you have followed previous posts on the topic of inverse document frequency, now you will understand why I have dissected the topic several times. Thank you all for your private comments and feedback on the topic.
My abstract follows:
Scaled Inverse Document Frequency: A Model for the Evaluation of the Conditional Specificity of Query Terms in Search Engine Collections
Edel Garcia, Internet Business Development Center, Interamerican University of Puerto Rico, Metropolitan Campus
Inverse document frequency (IDF) is a measure of the specificity of query terms over a collection of D number of documents that has been successfully incorporated into numerous vector space information retrieval models. Since these models assume term independence, the specificity of a given term, present in different queries, is assumed to be unique and independent from other query terms. To the best of our knowledge, there are no known models that condition the specificity of terms to the presence of other terms in a query.
This paper proposes a new measure called scaled inverse document frequency (SIDF) which evaluates the conditional specificity of query terms over a subset S of D and without making any assumption about term independence. S can be estimated from search results, OR searches, or computed from inverted index data. We have evaluated SIDF values from commercial search engines by submitting queries relevant to the financial investment domain. Results compare favorably across search engines and queries. Our approach has practical applications for `real-world’ scenarios like in Web Mining, Homeland Security, and keyword-driven marketing research scenarios. SIDF can be incorporated into a variety of information retrieval models as a global weight scoring system.
Keywords: inverse document frequency, conditional term specificity, web mining, search engines
Posted in Conferences, Data Mining, Homeland Security, Queries, Vector Space Models | 4 Comments »
March 4, 2009
Unit vectors are frequently used in information retrieval and data mining studies because simplify further calculations and analyses.
In the current issue of IR Watch, we show how easy is to convert column vectors into unit vectors with Excel. It is assumed you know how to define spreadsheet arrays in Excel and how to enter formulas in it.
Say we have two vectors in columns A and B each with four elements. To convert these into unit vectors, do this:
1. In cell C1, enter the formula =A1/(SQRT(SUMSQ(A$1:A$4)))
2. Paste content of C1 into cell D1. This creates a modified instance of this formula.
3. Paste content of C1 and D1 cells into remaining empty cells of these columns by selecting these at once. This also creates modified instances of these formulas.
C and D columns represent the unit vectors.
A figure with a step-by-step example is given in IRW (free subscription)
Below is another example, but with the final results.
| A |
B |
C |
D |
| 1 |
8 |
0.13 |
0.36 |
| 2 |
10 |
0.26 |
0.45 |
| 4 |
12 |
0.53 |
0.53 |
| 6 |
14 |
0.79 |
0.62 |
That was easy!
If you use the first row to label columns, as in this example, be sure to readjust the formulas so these start at cell 2 and run up to cell 5.
If you still have questions on how to do this, email me or subscribe to IRW.
Posted in Data Mining, IR Tutorials, Newsletters, Vector Space Models | 1 Comment »
March 2, 2009

The current issue of the IRW newsletter is available now.
In this issue:
Featuring article: Data Mining Dates
QA: Excel Vector Normalization
Who is Who in IR: Stephen Robertson
Top CS Departments: School of Informatics, City University, London
Historical Notes: Mark and Colossus Computers
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…
The abstract of the featuring article is given below.
In this issue of the newsletter we examine the extraction of intelligence from dates. At first, a discussion on dates seems an unnecessary exercise. After all, many are inclined to take dates at face-value. But a date is more than a one-liner of information extracted from a calendar, headline, or footer. In the intelligence community, for example, dates provide a great amount of information about events, people, organized crime, terrorism, money laundering, unexpected situations, accidents, plots, chains of custody, validations, etc. Indeed, a date is a unique form of metadata, not to mention that these can be either relative or absolute. They can also be part of encryption schemes.
Posted in Data Mining, Newsletters | Leave a Comment »
February 25, 2009
I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:
According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word “round,” for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier “only.” He then showed that by placing the word “only” in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like “never,” “should,” “nothing,” and “usually” are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word “nothing” was a favorite word often used by Lewis Carroll.
Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *
This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to The Original Sin of IR models: The Term Independence Assumption.
* I have modified a bit this assertive question to make the point more clear.
References
http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/
Hill, I.D., “Wouldn’t it be nice if we could write computer programs in ordinary English – or would it?” The Computer Bulletin, June 1972, pp. 306-312.
Neumann, Peter G., “Only his Only Grammarian Can Only Say What Only He Means,” ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.
Posted in Data Mining, Queries, Vector Space Models | Leave a Comment »
February 20, 2009
Please read Part I, Part II, and Part III before reading this post.
I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur’s book Discrete Dynamical Systems (Oxford, 1990).
1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?
2. Consider the model of glottochronology. Assume a language is given today.
(a) How long will it take for 1/4 of the words to change?
(b) How long will it take for 10 per cent of the words to change?
3. Suppose that person A knows 60 per cent of a list of 1000 words, person B knows 70 per cent of that list, and person C knows 30 per cent of that list.
(a) How many words do you expect all three people know?
(b) What per cent of the words is known by A and B but not by C?
Hints:
Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle to A, B, and C.
Posted in Data Mining, Queries, Vector Space Models | Leave a Comment »
February 19, 2009
Please read Part I and Part II before proceeding with this post.
Applications to cultures
Sandefur provides the following example:
Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, A and B. At time 0, they have the same language, so that for a given list of L words, A(0) = B(0) = 1 is the per cent of the words they both know (and have in common).
If we contact each of these cultures k years later, culture A will know A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k) per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,
Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)
What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent Q of this list of words is known by both cultures. Thus in the equation for Q above, Q is known, but k, the number of years since the two cultures separated, is unknown. Solving for k gives
k = 500lnQ/ln 0.805
To understand the significance of this ratio we need to look at some examples.
Examples
Sandefur provides several examples.
Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is Q = 180/300 = 0.6. Then
k = 500ln 0.6/ln 0.805 = 1177.5
We then conclude that the natives of these two islands came from a common ancestry, approximately 1200 years ago.
Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation n for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.
Drawbacks and Pitfalls of Glottochronology
The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance. But we know that
If p(A1A2) = p(A1)p(A2) event cooccurance is by chance.
If p(A1A2) > p(A1)p(A2) event cooccurance is more than by chance.
If p(A1A2) < p(A1)p(A2) event cooccurance is less than by chance.
One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?
As noted by Sandefur
…how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes ’slightly’, we will still count it as being on the list. If the meaning of a word changes ’sigificantly’, we will delete it from the list. Thus, there is some subjectivity in determining Q which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly assumed that all words were equally likely to change. This can throw the results off.
The moral of this is that you need to be careful not to make more claims about your model than are justified.
Posted in Data Mining, Queries, Vector Space Models | 1 Comment »
February 18, 2009
Please read Glottochronology Part I before reading this post.
Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is
p(A1A2) = p(A1)p(A2)
Some authors like Sandefur call this the multiplication principle.
As Sandefur noted and quote:
Suppose two people each ‘know’ a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list L and Sue knows 80 per cent of list L, where L contains 100 words. Given any random sublist of words from list L, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.
Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank’s 70 words, that is, 56 of Frank’s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.
Multiplication principle: suppose person A knows P per cent of a list of L words and person B knows Q per cent of the same list of L words (where P and Q are given as decimals). Given no additional information, we woud expect A and B to both know PQ per cent of the words.
In Part III we will provide some examples of this principle to the evolution of cultures.
Later in this series we will explain how the independence assumption affects some of the reasonings and claims behind language dating models.
In the meantime: How relevant this model is to IR? Well, assume that A and B are not Franks and Sues, but passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.
Posted in Data Mining, Queries, Vector Space Models | 2 Comments »
February 16, 2009
Although not back-to-back during this week I will be posting on glottochronology.
Glottochronology is a combination of greek terms which essentially means language dating.
Looking at some of my “old” collection of books on applied Chaos and Fractals from the ’90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book Discrete Dynamical Systems (Chapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.
Sandefur wrote:
We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of L words (say L=250). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.
Let one unit of time be 1 year. Thus, time n will be n years. Let A(n) represent the per cent of the original list of words still in use n years later, given as a decimal. The basic assumption is that the percent A(n+1) of the original list of words in use at time n+1 is proportional to the per cent of the original list of words in use at time n, that is,
A(n+1) =rA(n),
where r is a positive constant less than one. At time 0, all of the original list of words is in use, so A(0)=1. Therefore, at time k, A(k)=r^k(1) = r^k is the percent of the original list of words still in use, as a decimal.
Since languages change slowly, r should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate r^1000. This number r also depends on the particular language. But glottochronologists have found that the number r^1000 is usually close to 0.805. So for languages with no written history, that is, for languages in which we cannot estimate r, we will assume that
r^1000 = 0.805
Thus, the per cent of the original list of words that are still in use k years later is
r^k = (0.805)^(0.001 k).
Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.
It seems that term independence is The Original Sin in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I’m working on a paper presentation on the subject.
Posted in Data Mining, Queries, Vector Space Models | 3 Comments »
February 9, 2009
As part of ongoing research, I’m building a search engine with ondemand query reduction capabilities. To our knowledge, none of the current commercial search engines provides such features.
Experimental machines that do this require the use of training sets, decision graphs and decision trees. For references on this topic, read
Query Expansion and Query Reduction in Document Retrieval
A Two-Step Approach for Tree-structured XPath Query Reduction
Unfortunately, these type of search engines are not popular, in part because are not practical at the scale of the Web and because require retraining of both the search engine and users –not to mention that these type of search machines are not precisely user-friendly.
Think about this: In general, average users are lazy searchers. They are also too busy to do neither query expansion or query reduction as we do in IR, nor they are prone to consult lookup lists, thesaurus, query logs, etc to refine their searches while surfing across databases. At any given point in time of the year the mentality of non-IR searchers is: “Don’t make me think”.
Thus, building a search engine that does ondemand query reductions for the Web (and that users will use without being forced to think) is not that easy.
We would like to hear of others working on similar research as we believe we have found a promising solution, at least partially. Ours is different from the approaches given in the above two references.
Posted in Data Mining, Queries | 2 Comments »
February 4, 2009
The title of this post might be a bit confusing, but we couldn’t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.
Consider the key term [man].
Providing a meaning or perception for [man] during a good or a bad Economy is a good example.
[man] means something different, depending if you ask to an employed or unemployed [man].
This can be illustrated by reading Why losing a job can hurt men more.
Although it might be a quite depressing article (especially for those with a pink slip in their foreheads), note the key words/phrases that define its topic and overall semantics. Incidentally, the article starts with
“Thomas Schuler is a man.”
and almost at the ends says:
“A man is what he does.”
The key words/phrases of the article can be taken as a semantics state in time attached to the [man] key term
Posted in Data Mining, Latent Semantic Indexing, Queries | Leave a Comment »
February 2, 2009

The current issue of IR Watch – The Newsletter will be available during the day. It consists of the following sections.
Featuring Article: Data Mining Credit Cards
In this issue of the newsletter we cover Luhn’s Algorithm, also known as the Modulus 10 or Mod-10 Test. This algorithm is used for data mining and validation of credit cards. Credit cards fraud is a topic that never goes away.
QA: Types of Links
What is the difference between in-links, out-links, co-citation, and co-reference?
Historical Notes: The Whirlwind Project
Top CS: State University of New Jersey, Rutgers
Who is Who in IR: Tefko Saracevic
Graduate Theses
Data Mining Blogs
and more.
Posted in Data Mining, Hacking, Newsletters | Leave a Comment »
January 30, 2009
We are currently doing some testing with a new experimental engine. The experiment consists in using OR as the default mode and IDF-only for scoring terms. IDF is precomputed straight from the inverted index which is also computed at query time. We are also trying replacing IDF with Entropy scores.
With large collections, the inverted index is written to a text file and read at query time.
Since local information (e.g., term freq) is ignored, keyword spam is not an issue.
Instead of a Vector Space Model, we use a cummulative sum of scores over IDF scores, such that is not necessary to compute cosine similarities (*).
So far the results of the experiment is that with multi-term queries two extreme clusters are obtained:
1. the top N ranked documents almost behave as being queried in AND mode and as obeying the Cluster Hypothesis.
2. the M ranked documents at the bottom behave as being queried either in EXACT mode or with a single-term query. (**)
Between these extremes we have some noisy results.
If some have tried this before, we would love to hear about it. Contact us by email.
PS.
(*) In this way we don’t need to make independence assumptions.
(**) With few changes, M now behaves as being queried with single-term queries or few query terms, which is what we expected. The N set still is the more interesting. The middle cases are now quite noisy.
Posted in Data Mining, IR Tools, Queries | 1 Comment »
January 27, 2009
MP3 Confidentials: I saw this morning on CNN a technology news about how military records, including the names, SSNs, phones, etc of soldiers were discovered stored in an MP3 Player. According to the news,
Chris Ogle of New Zealand was in Oklahoma about a year ago when he bought a used MP3 player from a thrift store for $9. A few weeks ago, he plugged it into his computer to download a song, and he instead discovered confidential U.S. military files.
“The more I look at it, the more I see, and the less I think I should be,” Ogle said with a nervous laugh in an interview with TVNZ.
The files included the home addresses, Social Security numbers and cell phone numbers of U.S. soldiers. The player also included what appeared to be mission briefings and lists of equipment deployed to hot spots in Afghanistan and Iraq.
Pentagon officials told CNN that they are aware of the MP3 player, but can’t talk about it until investigators confirm that the information came from the U.S. Department of Defense.
“The government isn’t doing a good job of protecting the information that it collects,” said Marc Rotenberg of the Electronic Privacy Information Center in Washington.
Despite government efforts to protect sensitive information, this is a growing problem, privacy experts say.
Two years ago, the Department of Veterans Affairs lost track of a laptop with the personal information of millions of soldiers. And computer hard drives with classified military information have been found for sale at street markets in Afghanistan.
“When you can identify American personnel, when you have their names, their home address, their cell phone numbers, you put people in a dangerous position,” Rotenberg said.
It might be time to cover data mining of MP3 Players.
Posted in Data Mining, Homeland Security | Leave a Comment »
January 22, 2009
We learned about this news from a business associate:
According to USAToday, Heartland Payment Systems (HPY) on Tuesday disclosed that intruders hacked into the computers it uses to process 100 million payment card transactions per month for 175,000 merchants.
In IRW – The Newsletter, we have covered data mining of VINs, SSNs, web analytic frauds, and email headers. It might be time to cover credit card mining so readers will understand the risks involved when servers, even test servers, are not properly secured or supervised.
Data Mining at the intersection of Information Retrieval, Business Intelligence, and Information Security is here to stay.
Posted in Data Mining, Hacking, Newsletters | Leave a Comment »
January 2, 2009

This is a sneak preview of the current issue of IRW The Newsletter.
In this Issue:
Featuring article: Data Mining Email Headers – and Senders and ISPs
Questions/Answers: EXCEL Matrix Multiplications
Who is Who in IR: William S. Cooper
Top CS Departments: UC, Berkeley
Historical Notes: ABC and Z3 Computers
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more…
Featuring article abstract:
In this issue of the newsletter we reproduce written material relevant to the mining of email headers, senders, and ISPs. Spammers, viruses, and hackers love to fake email headers as these provide setting information that might serve as entry points around which a strategy could be crafted. Intelligence data miners also love email headers. Since these can be faked, they can be used to encode hidden messages. Actually, email steganography is an exciting area.
Posted in Data Mining, Newsletters | Leave a Comment »
December 30, 2008
Although it is not a new architecture, for those interested in having a hands-on experience setting up and running an IR system, a good choice is UNESCO’s CDS/ISIS platform.
According to the pdf documentation,
CDS/ISIS is a menu-driven generalized Information Storage and Retrieval system designed specifically for the computerized management of structured non-numerical data bases. One of the major advantages offered by the generalized design of the system is that CDS/ISIS is able to manipulate an unlimited number of data bases each of which may consist of completely different data elements. Although some features of CDS/ISIS require knowledge of and experience with computerized information systems, once an application has been designed the system may be used by persons having had little or no prior computer experience.
Another toy to play with during 2009.
PS. I changed title to “free text searching” for accuracy.
Posted in Data Mining | Leave a Comment »
December 29, 2008
Early this year, students of my graduate course (Search Engines Architecture) used Terrier, an experimental search engine, in their lab lessons. I am still using Terrier for indexing and testing.
Few days ago Craig Macdonald from University of Glasgow sent me this new Terrier update. It sounds great, although I haven’t test it yet.
Terrier, IR Platform v 2.2 – 23/12/2008
http://ir.dcs.gla.ac.uk/terrier/
Terrier 2.2, the next version of the open source IR platform from the University of Glasgow (Scotland) has been released.
This is a substantial update, which includes new support for Hadoop, primarily a Hadoop Map Reduce indexing system, allowing large collections of documents to be indexed in a highly distributed fashion. Also included are various minor improvements, including improved support for the IIT CDIP1 (TREC Legal track) collection, and various bug fixes. This is intended to be the ultimate release in the 2.x series.
Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html
This will be my new toy to play with in 2009.
Posted in Data Mining, Web Mining Course | Leave a Comment »
December 17, 2008
Nobody believes Madoff acted alone in his 50 Billion Ponzi fraud that is evolving at Wall Street.
After reading this Marketwatch article on how the SEC plans to investigate the Madoff scandal, the perception is that SEC officials were aware of it and that some SEC names somehow were part of it or had conflict of interests.
Many now believes the SEC should not be part of any investigation. As more cans of worms are opened, for sure big companies will be named as accessories.
It appears that Madoff eluded the system, in part because he did his workarounds in the conventional way, avoiding any questionable electronic paper trail. That might be true for Madoff as the primary source, but not for secondary sources.
I bet you that as the story and investigations move forward, email mining and document data mining will be put into work. This is a great opportunity for IRs to put to the test new and experimental tools.
Remember the Enron Scandal and how data mining became the smoking gun?
Posted in Data Mining | Leave a Comment »
December 12, 2008
Here is a great quiz for IR students (2 points each).
Explain and give example for the following ways of searching:
1. horizontal
2. vertical
3. local
4. global
5. proximity
6. adjacency
7. smart
8. expanded
9. advanced
10. relevance feedback
11. remote
12. fusion-based
13. steepest ascent
14. steepest descent
15. binary
16. sequential
17. random
18. deterministic
19. recycled
20. agglomerative
21. constrained
22. contextual
23. regexp
24. boolean
25. exact
26. semantic-based
27. ontology-based
28. thesauri-based
29. log-based
30. popularity-based
31. genetic
32. cellular automata
33. markovian
34. difussion-based
Posted in Data Mining, Queries | Leave a Comment »
December 4, 2008
The current issue of IRW also covers:
- Henry Freiser’s Pointer Function for visualizing all real roots of a polynomial.
- Vanevar Bush’s first computers and Bell’s CNC.
- Cyril Cleverdon: the IR Father of Precision and Recall.
- More graduate students CS/IR Theses.
- MIT’s CS Department.
- More IR blogs.
- Call for Papers.
Posted in Data Mining, Newsletters | Leave a Comment »
December 3, 2008
In the current issue of IRW we explain why facilitating social security numbers (SSNs) online is an enabling crime; one that is relevant to Homeland Security (1). We show that, ironically, government agencies and universities are the first facilitators of SSNs on the Web.
We examined how crafting smart queries in Google and other search engines allows users to find incidents wherein SSNs have been released for the entire world to see online. Althought nothing new, it is a widespread problem across the Web. It is a shame when administrators of the above two offenders (government and university dependencies) ignore the problem or justify it in the name of what is practical.
We show why the common practice of facilitating the last four digits of a SSN is a very bad idea. With SSN Allocation tables, we can map the first three digits to the region wherein the SSN application was filed, by US State and territory. If the last four digits are known, only the middle two digits need to be guessed. Identity thieves and stalkers might be having a field day.
There is still hope, though. We cover how Northern Michigan University (2) and John Hopkins University (3) are proactively becoming part of the solution and not part of the problem. In the case of NMU, they have published a one year case study outlining the full eradication of SSNs as identifiers from NMU campus.
References
1. The Homeland Security and Terrorism Threat: From Document Fraud, Identity Theft and Social Security Number Misuse
http://finance.senate.gov/hearings/testimony/2003test/091003pctest.pdf
2. Full Eradication of Social Security Number as an Identifier
http://net.educause.edu/ir/library/pdf/EDU04144.pdf
3. Policy on Social Security Number Protection and Use
http://education.jhu.edu/catalog/academic-policies/policy-on-ssn-protection-and-use/
Posted in Data Mining, Homeland Security, Newsletters | Leave a Comment »
December 1, 2008

This is a sneak preview of IR Watch. In this issue the main article, Identity Thefts through Search Engines, covers quite old and well known incidents wherein social security numbers have been released for the entire world to see. These are accessible through search engines.
Although not a new problem, facilitating a SSN, even a portion of it, has been labeled as an enabling crime. This is a must-read topic for those conducting data mining and web mining at the intersection of information assurance and homeland security.
Ironically, the biggest offenders are government agencies and universities.
During the week, we will blog on other sections of the newsletter.
Posted in Data Mining, Homeland Security, Newsletters | Leave a Comment »
November 21, 2008
Search engines are realizing they can earn more revenues by allowing users to manipulate their indexes. Yahoo! BOSS and Google SearchWiki are efforts in that direction.
Essentially these two search platforms provide search refrits of whatever the search engines already have in their indexes, but with some features added for resorting, editing, and making annotations.
We can see some value for using these as aggregation layers for third-party search technologies and for some type of searchers prone to personalization. Such type of searches might be useful for conducting annotated Web Intelligence.
Still these are not platforms designed for searching the Deep Web.
Posted in Data Mining, Machine Learning | Leave a Comment »
October 29, 2008
This post shows the connection between Pearson and Spearman’s Correlation Coefficients with Cosine Similarity and Dot Products. As mentioned in previous posts:
Pearson’s Coefficient is equivalent to Spearman’s Coefficient for raw data that has been transformed into ranks.
http://irthoughts.wordpress.com/2008/08/28/spearman-and-pearson-correlation-coefficients/
Similarity is not Distance.
http://irthoughts.wordpress.com/2008/10/15/seos-and-their-semantic-distance-myths/
The ’similarity distance’ expression is an oxymoron that should be avoided.
http://irthoughts.wordpress.com/2008/10/20/having-fun-with-oxymorons/
Arbitrary transformations between Distance and Similarity must be avoided.
http://irthoughts.wordpress.com/2007/09/17/binary-similarity-calculator/
It is important to know how to define Distance
http://irthoughts.wordpress.com/2008/10/23/why-defining-distance-is-important/
Transforming Pearson’s Coefficients into Cosine Similarities
A Pearson’s Correlation Coefficient is equivalent to the cosine of the angle of a paired data represented as vectors, provided that the raw data is first transformed into z-scores:
z(xi) = (xi – mean_x)/s_x
z(yi) = (yi – mean_y)/s_y
where the mean is deducted and the difference normalized respect to the standard deviation. Thus, we end up with a mean-centered, deviation-normalized data. A z-score data is often called a ‘centered data’.
An Example
To convince yourself, Pearson’s Correlation Coefficient for the following data is 0.8837…
If you center the data and calculate its cosine similarity, it should be the same.
If the centered data is converted into column unit vectors, the dot product of the vectors is also 0.8837… since for unit vectors cosine angle = dot product
Now try this.
Rank the raw data. Next, center it, and then convert into unit vectors. Convince yourself that in this case
Spearman’s = Pearson’s = Cosine Angle = Dot Product = 0.8000…
Since Similarity is not Distance, these are not Distances.
Posted in Data Mining, SEO Myths, Vector Space Models | 2 Comments »