• About IR Thoughts

IR Thoughts

~ Thoughts on Information Retrieval, Data Mining, and Search Engines

IR Thoughts

Monthly Archives: December 2008

UNESCO Free Text Searching

30 Tuesday Dec 2008

Posted by egarcia in Data Mining

≈ Leave a Comment

Although it is not a new architecture, for those interested in having a hands-on experience setting up and running an IR system, a good choice is UNESCO’s CDS/ISIS platform.

According to the pdf documentation,

CDS/ISIS is a menu-driven generalized Information Storage and Retrieval system designed specifically for the computerized management of structured non-numerical data bases. One of the major advantages offered by the generalized design of the system is that CDS/ISIS is able to manipulate an unlimited number of data bases each of which may consist of completely different data elements. Although some features of CDS/ISIS require knowledge of and experience with computerized information systems, once an application has been designed the system may be used by persons having had little or no prior computer experience.

Another toy to play with during 2009.

PS. I changed title to “free text searching” for accuracy.

New Terrier Update

29 Monday Dec 2008

Posted by egarcia in Data Mining, Web Mining Course

≈ Leave a Comment

Early this year, students of my graduate course (Search Engines Architecture) used Terrier, an experimental search engine, in their lab lessons. I am still using Terrier for indexing and testing.

Few days ago Craig Macdonald from University of Glasgow sent me this new Terrier update. It sounds great, although I haven’t test it yet.

Terrier, IR Platform v 2.2 – 23/12/2008
http://ir.dcs.gla.ac.uk/terrier/

Terrier 2.2, the next version of the open source IR platform from the University of Glasgow (Scotland) has been released.

This is a substantial update, which includes new support for Hadoop, primarily a Hadoop Map Reduce indexing system, allowing large collections of documents to be indexed in a highly distributed fashion. Also included are various minor improvements, including improved support for the IIT CDIP1 (TREC Legal track) collection, and various bug fixes. This is intended to be the ultimate release in the 2.x series.

Fuller change log at http://ir.dcs.gla.ac.uk/terrier/doc/whats_new.html

This will be my new toy to play with in 2009.

Data Mining and Madoff’s 50 Billion Ponzi Fraud

17 Wednesday Dec 2008

Posted by egarcia in Data Mining

≈ Leave a Comment

Nobody believes Madoff acted alone in his 50 Billion Ponzi fraud that is evolving at Wall Street.

After reading this Marketwatch article on how the SEC plans to investigate the Madoff scandal, the perception is that SEC officials were aware of it and that some SEC names somehow were part of it or had conflict of interests.

Many now believes the SEC should not be part of any investigation. As more cans of worms are opened, for sure big companies will be named as accessories.

It appears that Madoff eluded the system, in part because he did his workarounds in the conventional way, avoiding any questionable electronic paper trail. That might be true for Madoff as the primary source, but not for secondary sources.

I bet you that as the story and investigations move forward, email mining and document data mining will be put into work. This is a great opportunity for IRs to put to the test new and experimental tools.

Remember the Enron Scandal and how data mining became the smoking gun?

It is that time of the Year

16 Tuesday Dec 2008

Posted by egarcia in Miscellaneous

≈ 2 Comments

It is that time of the year:

Layoffs:

It is that time of the year where companies give pink slips. Yahoo layoffs started early this month. Sun Micro slashed 6,000, Sony 8,000, and CBS plans to send pink slips at TV.com, MP3.com, CNET, and Gamespot.com.

http://www.newsoxy.com/cbs/article11488.html

True colors showing:

It is that time of the year where companies show their true colors. According to WSJ, search engines and ISPs are betraying network bandwith neutrality.

http://online.wsj.com/article/SB122929270127905065.html

IR Quiz: Searching

12 Friday Dec 2008

Posted by egarcia in Data Mining, Queries

≈ Leave a Comment

Here is a great quiz for IR students (2 points each).

Explain and give example for the following ways of searching:

1. horizontal

2. vertical

3. local

4. global

5. proximity

6. adjacency

7. smart

8. expanded

9. advanced

10. relevance feedback

11. remote

12. fusion-based

13. steepest ascent

14. steepest descent

15. binary

16. sequential

17. random

18. deterministic

19. recycled

20. agglomerative

21. constrained

22. contextual

23. regexp

24. boolean

25. exact

26. semantic-based

27. ontology-based

28. thesauri-based

29. log-based

30. popularity-based

31. genetic

32. cellular automata

33. markovian

34. difussion-based

More LSI Snakeoil

10 Wednesday Dec 2008

Posted by egarcia in Latent Semantic Indexing

≈ Leave a Comment

Here is another SEO resource (http://www.billhartzer.com/pages/latent-symantec-indexing-lsi-is-the-key-to-great-search-engine-rankings/) that a la Aaron Wall is still promoting LSI SEO non sense in connection with ranking high in search engines. Like if these marketers really know what is LSI or how it works. Otherwise, they will never publish such crap.

There is no such thing as “LSA/LSI sites” nor SEOs can manipulate LSI to influence ranking results. It is this type of snakeoil marketing what is a black eye in the face of the SEO industry.

SSN Myths

08 Monday Dec 2008

Posted by egarcia in Miscellaneous, SEO Myths

≈ Leave a Comment

From time to time we hear of some urban legends and myths in connection with social security numbers (SSNs).

One myth has it that SSNs label citizens based on their race or origins. Another myth is that a number can be decoded to spell out names. Let’s debunk these myths.

Regarding the first  myth, according to the SS Administration site (http://www.ssa.gov/history/ssnmyth.html):

“Apparently due to the fact that the middle digits of the SSN are referred to as the “group number,” some people have misconstrued this to mean that the “group number” refers to racial groupings. So a myth goes around from time-to-time that encoded in a person’s SSN is a key to their race. This simply is not true.”

“As should be clear from the explanation of the SSN numbering scheme, the “group number” refers only to the numerical groups 01-99. For filing purposes, the “area numbers” are broken down into these numerical subgroups. So, for example, for area numbers starting with 527 there would be 99 subgroups, one for every number starting with 527-01, and one for every number starting with 527-02, and so on. This was done back in 1936 because in that era there were no computers and all the records were stored in filing cabinets. The early program administrators needed some way to organize the filing cabinets into sub-groups, to make them more manageable, and this is the scheme they came up with.”

“So the “group number” has nothing whatever to do with race.”

Still, some folks like this Google user heard that the fifth digit of a SSN is odd for whites, but even for african-americans and minorities. Not true.

Regarding the second myth. Some have claimed that flipping a SSN might closely spell or encode a name, word, message, etc.

For instance in Feb of 2008, Google won the Dylan Stephen Jayne v. Google Founders lawsuit. Jayne claimed that his social security number upside down spelled ‘Google’. He was seeking a $5 billion compensation.

The United States Court of Appeals for the Third Circuit on appeal from the United States District Court for the Middle District of Pennsylvania(PDF) dismissed the case and resolved in favor of Google that:

“As explained by the District Court, Google and its founders are not state actors, and Jayne’s allegation concerning his coded social security number does not constitute a violation of the Constitution or federal law. We also agree that any amendment of the complaint would be futile.”

I don’t know about you, but to me and based on pure speculations and font-family, flipping upside down ‘Google’ resembles 216009. 

But, there is a problem: this sequence can appear anywhere in a candidate SSN (beginning, end, etc).

True that we can narrow down possible sequences since according to the SSN site the middle two digits cannot be ’00′ in order to be a valid SSN. With all, three missing numbers are needed to complete a 9-digit sequence. Can you guess how to obtain these?

Still, this guessing exercise does not amount to a case. When it comes to guessing/gaming, you have the right to guess/game all you want to guess/game.

Now for those that believe in things like Numerology, Kabbalah, etc, 216009 can be reduced to 18 and then to 9. Upside down 9 resembles a 6 or a G, which is the first letter in Game, Gaming, Google, and God.

I have placed this post in the SEO Myth category just because the underlying nature of the above myths resembles the dumb nature of many of the myths promoted by SEOs.

More SSNs Compromised

05 Friday Dec 2008

Posted by egarcia in Hacking, Homeland Security

≈ Leave a Comment

As mentioned in recent posts, the current issue of IRW features an article covering incidents where social security numbers (SSNs) have been leaked to the Web. Along the same line, a cardinal rule in Web security is to never provide a connection between an intranet and the Internet. Once such a connection is established (hardware-based or via links), chances are that you no longer have an intranet. So, why take the risks?

In addition, never place sensitive information in a test server with access to the Web. Unfortunately the first offenders frequently are government agencies and universities. Stubborn IT administrators never get it!

For instance adding insult to injury, here is a report from the Orlando Sentinel, wherein 250,000+ users accounts containing SSNs were compromised:

http://blogs.orlandosentinel.com/news_politics/2008/12/state-agency-pu.html

According to this news and quote:

“The state Agency for Workforce Innovation blamed a “security breach” Wednesday for why it accidentally placed the names and Social Security numbers of 250,000 job-seekers on a “test server” that could have been accessed online.”

“The names and information were online for 19 days and removed in late October after the state Department of Revenue came across it during “routine work,” officials said. The only common denominator among the names placed online was that they all got services over the last six years from one of the 81 Florida “career centers” that provide job-training and resources around the state.”

The breach is giving bad publicity to Agency for Workforce Innovation (AWI). According to http://infosecurity.us/?p=4041, the Liberty Coalition asked AWI the following questions:

  1. Why did the Agency for Workforce Innovation store sensitive Excel files on a server at all?
  2. Why was this website left open to the public for more than a month, undetected by AWI’s IT department?
  3. Why were the files on the server not behind a firewall, password protected or encrypted?
  4. How many other servers store sensitive personal information, and how many of those are available to the public right now?
  5. How many AWI employees have access to clients’ social security numbers, and do they all need access?
  6. How do you plan to train employees to appropriately handle sensitive personal information?
  7. Do you have a regular schedule of scanning your internal networks and external servers for personal information? If so, why was this breach not discovered?
  8. Does the Agency for Workforce Innovation intend to pay for identity theft protection services for the victims of this breach?
  9. Will the Agency notify victims by mail?

Infosecurity states that the Liberty Coalition has raised the following issues:

  1. AWI has not offered to protect victims with identity theft protection services.
  2. AWI relied on public search engines and a member of the public 800 miles away to discover the breach.
  3. The Agency should destroy the information, not just restrict access.
  4. How many other AWI servers are currently exposing personal information.
  5. Why the need for AWI to collect minors’ social security numbers.
  6. AWI has not indicated how many employees have access to clients’ social security numbers, and whether these employees require access to fulfil their job descriptions.
  7. AWI does not appear to regularly scans its networks for sensitive personal information.

To play pr/damage control after the facts and gross incompetence, the FloridaJobs.org site published the following:

“The Agency for Workforce Innovation is continuing to take action to address a security breach that recently occurred on a test server. Upon discovery, the Agency immediately contacted the appropriate law enforcement agencies, began a thorough investigation and promptly coordinated with all major external search engine companies to ensure the information was no longer accessible to the public. The Agency has no reason to believe any personal information has been accessed for unlawful purposes.”
http://www.floridajobs.org/publications/news_rel/securityBreach.html

They have “no reason to believe any personal information has been accessed for unlawful purposes.” Good pr try. How do they know that? After their comedy of errors, why would anyone want to submit resumes to their databases? The rest of their pr excuses are a wall of smoke.

Note also how they quickly contacted search engines, just in case these have indexed the documents. At least they are realizing the power of search engines. Chances are they have cached copies of these documents.

More on IRW

04 Thursday Dec 2008

Posted by egarcia in Data Mining, Newsletters

≈ Leave a Comment

The current issue of IRW also covers:

  1. Henry Freiser’s Pointer Function for visualizing all real roots of a polynomial.
  2. Vanevar Bush’s first computers and Bell’s CNC.
  3. Cyril Cleverdon: the IR Father of Precision and Recall.
  4. More graduate students CS/IR Theses.
  5. MIT’s CS Department.
  6. More IR blogs.
  7. Call for Papers.

Search Engines and SSNs

03 Wednesday Dec 2008

Posted by egarcia in Data Mining, Homeland Security, Newsletters

≈ Leave a Comment

 In the current issue of IRW we explain why facilitating social security numbers (SSNs) online is an enabling crime; one that is relevant to Homeland Security (1). We show that, ironically, government agencies and universities are the first facilitators of SSNs on the Web.

We examined how crafting smart queries in Google and other search engines allows users to find incidents wherein SSNs have been released for the entire world to see online. Althought nothing new, it is a widespread problem across the Web. It is a shame when administrators of the above two offenders (government and university dependencies) ignore the problem or justify it in the name of what is practical.

We show why the common practice of facilitating the last four digits of a SSN is a very bad idea. With SSN Allocation tables, we can map the first three digits to the region wherein the SSN application was filed, by US State and territory. If the last four digits are known, only the middle two digits need to be guessed. Identity thieves and stalkers might be having a field day.

There is still hope, though. We cover how Northern Michigan University (2) and John Hopkins University (3) are proactively becoming part of the solution and not part of the problem. In the case of NMU, they have published a one year case study outlining the full eradication of SSNs as identifiers from NMU campus.

 References

1. The Homeland Security and Terrorism Threat: From Document Fraud, Identity Theft and Social Security Number Misuse
http://finance.senate.gov/hearings/testimony/2003test/091003pctest.pdf
2. Full Eradication of Social Security Number as an Identifier
http://net.educause.edu/ir/library/pdf/EDU04144.pdf
3. Policy on Social Security Number Protection and Use
http://education.jhu.edu/catalog/academic-policies/policy-on-ssn-protection-and-use/

← Older posts
December 2008
M T W T F S S
« Nov   Jan »
1234567
891011121314
15161718192021
22232425262728
293031  

Favorite Sites

  • Mi Islita

Pages

  • About IR Thoughts

Categories

  • AIRWeb Course
  • Conferences
  • Data Mining
  • Dynamics
  • Fractal Geometry
  • Graduate Courses
  • Hacking
  • Homeland Security
  • Human-Computer Interaction
  • Image Compression
  • Internet Engineering
  • IR Quizzes
  • IR Tools
  • IR Tutorials
  • Latent Semantic Indexing
  • Legacy Posts
  • Machine Learning
  • Marketing Research
  • Miscellaneous
  • News
  • Newsletters
  • Programming
  • Quack Science
  • Queries
  • Scripts
  • Search Engines Architecture Course
  • SEO Myths
  • Software
  • Spam
  • Statistics and Mathematics
  • Theses
  • Vector Space Models
  • Web Mining Course

Recent Posts

  • “Powered by” in Spanish
  • Some nice features added to the Image Crawler
  • The Images Crawler
  • A nice service for my locals
  • An update to the Web Crawler
  • New similarity measures
  • The Web Crawler is Back!
  • Tracking Users: An Email Crawler on Steroids
  • The Email Crawler: A Tool for Gathering Emails
  • The Binary Distance Calculator – a tool for comparing binary sets
  • Fractalettes: A Fractal Design Strategy to Color Mining and Learning through Discovery
  • AZZOO and WAZZOO: New Similarity Measures for the 21st Century
  • The Binary Similarity Calculator
  • From Harlem Shake to Link Shake: The Qualified Links Shake
  • Web Vulnerabilities and Search Engines

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007

AIRWeb Course Conferences Data Mining Fractal Geometry Graduate Courses Hacking Homeland Security Human-Computer Interaction Internet Engineering IR Quizzes IR Tools IR Tutorials Latent Semantic Indexing Legacy Posts Machine Learning Marketing Research Miscellaneous Newsletters Programming Quack Science Queries Scripts Search Engines Architecture Course SEO Myths Software Spam Statistics and Mathematics Theses Vector Space Models Web Mining Course

Blog at WordPress.com. Theme: Chateau by Ignacio Ricci.