<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>IR Thoughts</title>
	<atom:link href="http://irthoughts.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://irthoughts.wordpress.com</link>
	<description>Thoughts on Information Retrieval &#38; Data Mining</description>
	<lastBuildDate>Fri, 10 Jul 2009 17:31:43 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<image>
		<url>http://www.gravatar.com/blavatar/b50f2f199631fcb269aa9a1b8b9bcda4?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>IR Thoughts</title>
		<link>http://irthoughts.wordpress.com</link>
	</image>
			<item>
		<title>Centering Data With Excel</title>
		<link>http://irthoughts.wordpress.com/2009/07/10/centering-data-with-excel/</link>
		<comments>http://irthoughts.wordpress.com/2009/07/10/centering-data-with-excel/#comments</comments>
		<pubDate>Fri, 10 Jul 2009 16:58:47 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=1020</guid>
		<description><![CDATA[The QA column of the current issue of IR Watch Newsletter has a great question that might help IR, CS, and stats students.
Q: Centering Data with Excel- In Excel, how do you center a data set?
 A: To center a data set, use the STANDARDIZE function which converts x values into z-scores; i.e. 
z = (x [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1020&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>The QA column of the current issue of IR Watch Newsletter has a great question that might help IR, CS, and stats students.</p>
<p><strong>Q:</strong> <strong>Centering Data with Excel</strong>- In Excel, how do you center a data set?</p>
<p> <strong>A:</strong> To center a data set, use the STANDARDIZE function which converts <strong><em>x</em></strong> values into <strong><em>z-scores</em></strong>; <em>i.e. </em></p>
<p><strong><em>z = (x – a)/s</em></strong></p>
<p>where <strong><em>a</em></strong> and <strong><em>s</em></strong> respectively are the population arithmetic mean and standard deviation. The following table emulates an Excel spreadsheet.</p>
<table style="text-align:center;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top"> </td>
<td valign="top">
<p align="center"><strong>A</strong></p>
</td>
<td valign="top">
<p align="center"><strong>B</strong></p>
</td>
<td valign="top">
<p align="center"><strong>C</strong></p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>1</strong></p>
</td>
<td valign="top">
<p align="center"><strong><em>Age, x(A)</em></strong></p>
</td>
<td valign="top">
<p align="center"><strong><em>Weight, x(W)</em></strong></p>
</td>
<td valign="top">
<p align="center"><strong><em>Height, x(H)</em></strong></p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>2</strong></p>
</td>
<td valign="top">
<p align="center">64</p>
</td>
<td valign="top">
<p align="center">57</p>
</td>
<td valign="top">
<p align="center">8</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>3</strong></p>
</td>
<td valign="top">
<p align="center">71</p>
</td>
<td valign="top">
<p align="center">59</p>
</td>
<td valign="top">
<p align="center">10</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>4</strong></p>
</td>
<td valign="top">
<p align="center">53</p>
</td>
<td valign="top">
<p align="center">49</p>
</td>
<td valign="top">
<p align="center">6</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>5</strong></p>
</td>
<td valign="top">
<p align="center">67</p>
</td>
<td valign="top">
<p align="center">62</p>
</td>
<td valign="top">
<p align="center">11</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>6</strong></p>
</td>
<td valign="top">
<p align="center">55</p>
</td>
<td valign="top">
<p align="center">51</p>
</td>
<td valign="top">
<p align="center">8</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>7</strong></p>
</td>
<td valign="top">
<p align="center">58</p>
</td>
<td valign="top">
<p align="center">50</p>
</td>
<td valign="top">
<p align="center">7</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>8</strong></p>
</td>
<td valign="top">
<p align="center">77</p>
</td>
<td valign="top">
<p align="center">55</p>
</td>
<td valign="top">
<p align="center">10</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>9</strong></p>
</td>
<td valign="top">
<p align="center">57</p>
</td>
<td valign="top">
<p align="center">48</p>
</td>
<td valign="top">
<p align="center">9</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>10</strong></p>
</td>
<td valign="top">
<p align="center">56</p>
</td>
<td valign="top">
<p align="center">42</p>
</td>
<td valign="top">
<p align="center">10</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>11</strong></p>
</td>
<td valign="top">
<p align="center">51</p>
</td>
<td valign="top">
<p align="center">42</p>
</td>
<td valign="top">
<p align="center">6</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>12</strong></p>
</td>
<td valign="top">
<p align="center">76</p>
</td>
<td valign="top">
<p align="center">61</p>
</td>
<td valign="top">
<p align="center">12</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>13</strong></p>
</td>
<td valign="top">
<p align="center">68</p>
</td>
<td valign="top">
<p align="center">57</p>
</td>
<td valign="top">
<p align="center">9</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>14</strong></p>
</td>
<td valign="top"> </td>
<td valign="top"> </td>
<td valign="top"> </td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>15</strong></p>
</td>
<td valign="top">
<p align="center"><strong><em>z(A)</em></strong></p>
</td>
<td valign="top">
<p align="center"><strong><em>z(W)</em></strong></p>
</td>
<td valign="top">
<p align="center"><strong><em>z(H)</em></strong></p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>16</strong></p>
</td>
<td valign="top">
<p align="center">0.14</p>
</td>
<td valign="top">
<p align="center">0.62</p>
</td>
<td valign="top">
<p align="center">-0.44</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>17</strong></p>
</td>
<td valign="top">
<p align="center">0.92</p>
</td>
<td valign="top">
<p align="center">0.92</p>
</td>
<td valign="top">
<p align="center">0.61</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>18</strong></p>
</td>
<td valign="top">
<p align="center">-1.09</p>
</td>
<td valign="top">
<p align="center">-0.55</p>
</td>
<td valign="top">
<p align="center">-1.49</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>19</strong></p>
</td>
<td valign="top">
<p align="center">0.47</p>
</td>
<td valign="top">
<p align="center">1.36</p>
</td>
<td valign="top">
<p align="center">1.14</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>20</strong></p>
</td>
<td valign="top">
<p align="center">-0.86</p>
</td>
<td valign="top">
<p align="center">-0.26</p>
</td>
<td valign="top">
<p align="center">-0.44</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>21</strong></p>
</td>
<td valign="top">
<p align="center">-0.53</p>
</td>
<td valign="top">
<p align="center">-0.40</p>
</td>
<td valign="top">
<p align="center">-0.97</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>22</strong></p>
</td>
<td valign="top">
<p align="center">1.59</p>
</td>
<td valign="top">
<p align="center">0.33</p>
</td>
<td valign="top">
<p align="center">0.61</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>23</strong></p>
</td>
<td valign="top">
<p align="center">-0.64</p>
</td>
<td valign="top">
<p align="center">-0.70</p>
</td>
<td valign="top">
<p align="center">0.09</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>24</strong></p>
</td>
<td valign="top">
<p align="center">-0.75</p>
</td>
<td valign="top">
<p align="center">-1.58</p>
</td>
<td valign="top">
<p align="center">0.61</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>25</strong></p>
</td>
<td valign="top">
<p align="center">-1.31</p>
</td>
<td valign="top">
<p align="center">-1.58</p>
</td>
<td valign="top">
<p align="center">-1.49</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>26</strong></p>
</td>
<td valign="top">
<p align="center">1.47</p>
</td>
<td valign="top">
<p align="center">1.21</p>
</td>
<td valign="top">
<p align="center">1.67</p>
</td>
</tr>
<tr>
<td valign="top">
<p align="center"><strong>27</strong></p>
</td>
<td valign="top">
<p align="center">0.58</p>
</td>
<td valign="top">
<p align="center">0.62</p>
</td>
<td valign="top">
<p align="center">0.09</p>
</td>
</tr>
</tbody>
</table>
<p style="text-align:left;">Rows 2 – 13 contains the data set x(A), x(W), and x(H). In rows 16 – 27 the set was centered by typing in cell A16 the formula</p>
<p style="text-align:left;"> =STANDARDIZE(A2,AVERAGE(A$2:A$13),STDEV(A$2:A$13))</p>
<p style="text-align:left;"> Pasting this formula in cells A16 through C27 centers the data set. That was easy!</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/1020/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/1020/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/1020/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/1020/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/1020/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/1020/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/1020/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/1020/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/1020/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/1020/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1020&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/07/10/centering-data-with-excel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IRW-7-2009: Data Mining Texting</title>
		<link>http://irthoughts.wordpress.com/2009/07/06/irw-7-2009-data-mining-texting/</link>
		<comments>http://irthoughts.wordpress.com/2009/07/06/irw-7-2009-data-mining-texting/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 16:17:11 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=1008</guid>
		<description><![CDATA[
The current issue of IRW the newsleter is out.
Featuring Article:
Data Mining Texting
TTMD OMG MOS CU
&#8220;My parents send email, I text.” This illustrates the obvious: a digital divide between parents and teens. While parents are busy replying to email or blogging at the most, their kids probably are busy developing their own language to alert their [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1008&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:center;"><img class="aligncenter" src="http://www.miislita.com/irw/data-mining-texting.gif" alt="data mining texting" /></p>
<p>The current issue of IRW the newsleter is out.</p>
<p>Featuring Article:</p>
<p><strong>Data Mining Texting</strong><br />
<em>TTMD OMG MOS CU</em></p>
<blockquote><p>&#8220;My parents send email, I text.” This illustrates the obvious: a digital divide between parents and teens. While parents are busy replying to email or blogging at the most, their kids probably are busy developing their own language to alert their peers when mom or dad is trying to figure out what they are texting about. Did you know that MOS  CU means ‘Mother over shoulder’. ‘See you’. And how about PW CUL? (‘Parents watching. See you later’).</p></blockquote>
<p>Indeed&#8230; Texting is not just for teens:</p>
<blockquote><p>Texting not only is revolutionizing the way businesses are being conducted in 2009, but is an emerging data mining playground. The number of behavioral patterns in connection with texting is on the rise at different diffusion fronts: from sexting and sextcasting (transmission of conversations, videos, photos with sexual content) to dealing (transmission of conversations in connection with illegal drug activities), to encoding conversations about Wall Street transactions, industrial espionage, and so forth.</p></blockquote>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/1008/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/1008/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/1008/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/1008/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/1008/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/1008/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/1008/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/1008/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/1008/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/1008/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1008&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/07/06/irw-7-2009-data-mining-texting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://www.miislita.com/irw/data-mining-texting.gif" medium="image">
			<media:title type="html">data mining texting</media:title>
		</media:content>
	</item>
		<item>
		<title>Random notes prior to 4th July weekend</title>
		<link>http://irthoughts.wordpress.com/2009/07/03/random-notes-prior-to-4th-july-weekend/</link>
		<comments>http://irthoughts.wordpress.com/2009/07/03/random-notes-prior-to-4th-july-weekend/#comments</comments>
		<pubDate>Fri, 03 Jul 2009 13:21:37 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Newsletters]]></category>
		<category><![CDATA[SEO Myths]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=1006</guid>
		<description><![CDATA[As the 4th of July weekend approaches, here are some notes before hitting to planet oblivious.
1. Yesterday we had an interesting business entrepreneur meeting with the CIO of the Government of Puerto Rico at El Palacio Rojo, Fortaleza.
2. IRW should be out by Monday. Main article: Data Mining Texting.
3. Only monkeys still believe in KD Myths. Ha, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1006&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As the 4th of July weekend approaches, here are some notes before hitting to planet oblivious.</p>
<p>1. Yesterday we had an interesting business entrepreneur meeting with the CIO of the Government of Puerto Rico at El Palacio Rojo, Fortaleza.</p>
<p>2. IRW should be out by Monday. Main article: Data Mining Texting.</p>
<p>3. Only <a href="http://www.bloggeries.com/forum/seo-search-engine-optimization/13517-free-keyword-counter-articles.html">monkeys</a> still believe in KD Myths. Ha, Ha.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/1006/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/1006/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/1006/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/1006/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/1006/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/1006/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/1006/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/1006/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/1006/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/1006/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1006&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/07/03/random-notes-prior-to-4th-july-weekend/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Official: MIC Puerto Rico</title>
		<link>http://irthoughts.wordpress.com/2009/06/23/official-mic-puerto-rico/</link>
		<comments>http://irthoughts.wordpress.com/2009/06/23/official-mic-puerto-rico/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 15:32:13 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=1000</guid>
		<description><![CDATA[Back in April, I mentioned that Microsoft will be co-launching with Interamerican University of Puerto Rico, Metropolitan Campus the Microsoft Innovation Center (MIC) of Puerto Rico.
Well, tomorrow is the official inauguration. the university generously has provided me with lab and office space to start an interesting research project within the MIC building. These are exciting news. I cannot [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1000&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Back in April, I mentioned that Microsoft will be co-launching with Interamerican University of Puerto Rico, Metropolitan Campus the <a href="http://irthoughts.wordpress.com/2009/04/29/microsoft-inter-metro-to-co-launch-a-mic/">Microsoft Innovation Center (MIC)</a> of Puerto Rico.</p>
<p>Well, tomorrow is the official inauguration. the university generously has provided me with lab and office space to start an interesting research project within the MIC building. These are exciting news. I cannot comment much about the project, except to say that it is at the interface of search engines, social networks, and information security.</p>
<p>It looks like I will have my hands full between workig at two universities, blogging, and doing consulting work.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/1000/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/1000/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/1000/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/1000/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/1000/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/1000/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/1000/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/1000/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/1000/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/1000/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=1000&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/06/23/official-mic-puerto-rico/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IR Videos in Spanish</title>
		<link>http://irthoughts.wordpress.com/2009/06/22/ir-videos-in-spanish/</link>
		<comments>http://irthoughts.wordpress.com/2009/06/22/ir-videos-in-spanish/#comments</comments>
		<pubDate>Mon, 22 Jun 2009 05:00:55 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Conferences]]></category>
		<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[Latent Semantic Indexing]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=991</guid>
		<description><![CDATA[I normally do not put online my lecture notes (ppt, pdf, videos). However, there are two public conferences that event organizers taped. Both last over 1 hour and are in Spanish, but with slides in English. Here are the links. The quality of the videos is so-so.
Since the videos were made available few months later after the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=991&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I normally do not put online my lecture notes (ppt, pdf, videos). However, there are two public conferences that event organizers taped. Both last over 1 hour and are in Spanish, but with slides in English. Here are the links. The quality of the videos is so-so.</p>
<p>Since the videos were made available few months later after the events, these are not properly dated. I have included below the actual date of the events. If you don&#8217;t know Spanish, you are out of luck.</p>
<p>1. Understanding Search Engines (Entendiendo a los Buscadores), University of Puerto Rico, Bayamon, 4-23-2008</p>
<p><a href="http://video.google.com/videoplay?docid=-653964730907023811">http://video.google.com/videoplay?docid=-653964730907023811</a></p>
<p>This one last for about two hours. The audience consisted of grad students and researchers. Unfortunately, the video has an audio-visual mismatch of about one slide. If you can coupe with this, I hope you like it.</p>
<p>2. Demystifying LSI (Desmitificando LSI)- OJOBuscador Congress, Madrid, Spain, 3-09-2007.</p>
<p><a href="http://www.ojotube.com/videos/congreso-ojobuscador-2007-ponencia-desmitificando-lsi-de-dr-e-garcia/">http://www.ojotube.com/videos/congreso-ojobuscador-2007-ponencia-desmitificando-lsi-de-dr-e-garcia/</a><a href="http://lo-mas-buscado-en-google.guca.es/2008/03/26/lsi-latent-semantic-indexing/"></a></p>
<p>This one last for over one hour. Since it was for a non-scientific audience  (most Spanish SEOs)  I tried to talk very slow.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/991/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/991/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/991/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/991/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/991/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/991/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/991/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/991/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/991/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/991/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=991&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/06/22/ir-videos-in-spanish/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>What is a Similarity Matrix?</title>
		<link>http://irthoughts.wordpress.com/2009/06/16/what-is-a-similarity-matrix/</link>
		<comments>http://irthoughts.wordpress.com/2009/06/16/what-is-a-similarity-matrix/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 14:23:57 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Quizzes]]></category>
		<category><![CDATA[Latent Semantic Indexing]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=986</guid>
		<description><![CDATA[Soon or later CS students, in particularly those in IR, will need to deal with similarity matrices.
In simple terms, any matrix M that exhibits the following five characteristics is a similarity matrix.
Squaredness = M must have the same number of rows and columns.
Non-Negativity = all elements of M must be real, non-negative numbers.
Boundedness = all elements [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=986&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Soon or later CS students, in particularly those in IR, will need to deal with similarity matrices.</p>
<p>In simple terms, any matrix <strong>M</strong> that exhibits the following five characteristics is a similarity matrix.</p>
<p><strong>Squaredness</strong> = <strong>M</strong> must have the same number of rows and columns.<br />
<strong>Non-Negativity</strong> = all elements of <strong>M</strong> must be real, non-negative numbers.<br />
<strong>Boundedness</strong> = all elements of <strong>M</strong> must adopt values between 0 and 1.<br />
<strong>Reflexivity</strong> = all diagonal elements of <strong>M</strong> (i.e. from left to bottom) must be filled with 1.<br />
<strong>Symmetry</strong> = all ij elements must be identical to all ji elements.</p>
<p>A matrix that fails to exhibit any of these characteristics is not a similarity matrix.</p>
<p>Accordingly, some matrices found in the literature on LSI and whose elements have been referred to as similarities are not so since the corresponding matrix does not conform to the above definition.</p>
<p>Note. This information will help those that took the <a href="http://irthoughts.wordpress.com/2009/05/13/ir-quiz-matrices/">IR Quiz on Matrices</a> to realize how well they did.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/986/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/986/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/986/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/986/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/986/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/986/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/986/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/986/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/986/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/986/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=986&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/06/16/what-is-a-similarity-matrix/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Computing Co-Occurrence Matrices with Excel</title>
		<link>http://irthoughts.wordpress.com/2009/06/05/computing-co-occurrence-matrices-with-excel/</link>
		<comments>http://irthoughts.wordpress.com/2009/06/05/computing-co-occurrence-matrices-with-excel/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 14:37:46 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=976</guid>
		<description><![CDATA[The QA column of the current issue of IR Watch &#8211; The Newsletter features the following question:
Question: In Excel, how do you convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?
Answer:
Let A be a matrix populated with term occurrences (frequencies).
Let AT be its transpose.
Then, T = AAT is a term-term co-occurrence matrix, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=976&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>The QA column of the current issue of IR Watch &#8211; The Newsletter features the following question:</p>
<p>Question: In Excel, how do you convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?</p>
<p>Answer:</p>
<p>Let <strong>A</strong> be a matrix populated with term occurrences (frequencies).<br />
Let <strong>A<sup>T</sup> </strong>be its transpose.</p>
<p>Then, <strong>T = AA<sup>T</sup></strong> is a term-term co-occurrence matrix, and <strong>D = A<sup>T</sup>A </strong>is a document-document co-occurrence matrix.</p>
<p>The following table emulates an Excel spreadsheet.</p>
<table style="text-align:center;" border="1" cellspacing="0" cellpadding="0" width="276">
<tbody>
<tr>
<td width="20" valign="bottom">
<p align="center"> </p>
</td>
<td width="64" valign="bottom">
<p align="center">A</p>
</td>
<td width="64" valign="bottom">
<p align="center">B</p>
</td>
<td width="64" valign="bottom">
<p align="center">C</p>
</td>
<td width="64" valign="bottom">
<p align="center">D</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">1</td>
<td width="64" valign="bottom"><strong> A =</strong></td>
<td width="64" valign="bottom">
<p align="center">d1</p>
</td>
<td width="64" valign="bottom">
<p align="center">d2</p>
</td>
<td width="64" valign="bottom">
<p align="center">d3</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">2</td>
<td width="64" valign="bottom">
<p align="right">t1</p>
</td>
<td width="64" valign="bottom">
<p align="center">0</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">0</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">3</td>
<td width="64" valign="bottom">
<p align="right">t2</p>
</td>
<td width="64" valign="bottom">
<p align="center">0</p>
</td>
<td width="64" valign="bottom">
<p align="center">0</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">4</td>
<td width="64" valign="bottom">
<p align="right">t3</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">5</td>
<td width="64" valign="bottom">
<p align="right"> </p>
</td>
<td width="64" valign="bottom">
<p align="center"> </p>
</td>
<td width="64" valign="bottom">
<p align="center"> </p>
</td>
<td width="64" valign="bottom">
<p align="center"> </p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">6</td>
<td width="64" valign="bottom">
<p align="center"><strong>T = AA<sup>T</sup></strong></p>
</td>
<td width="64" valign="bottom">
<p align="center">t1</p>
</td>
<td width="64" valign="bottom">
<p align="center">t2</p>
</td>
<td width="64" valign="bottom">
<p align="center">t3</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">7</td>
<td width="64" valign="bottom">
<p align="right">t1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">0</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">8</td>
<td width="64" valign="bottom">
<p align="right">t2</p>
</td>
<td width="64" valign="bottom">
<p align="center">0</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">9</td>
<td width="64" valign="bottom">
<p align="right">t3</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">3</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">10</td>
<td width="64" valign="bottom">
<p align="right"> </p>
</td>
<td width="64" valign="bottom">
<p align="center"> </p>
</td>
<td width="64" valign="bottom">
<p align="center"> </p>
</td>
<td width="64" valign="bottom">
<p align="center"> </p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">11</td>
<td width="64" valign="bottom">
<p align="center"><strong>D = A<sup>T</sup>A</strong></p>
</td>
<td width="64" valign="bottom">
<p align="center">d1</p>
</td>
<td width="64" valign="bottom">
<p align="center">d2</p>
</td>
<td width="64" valign="bottom">
<p align="center">d3</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">12</td>
<td width="64" valign="bottom">
<p align="right">d1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">13</td>
<td width="64" valign="bottom">
<p align="right">d2</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">2</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
</tr>
<tr>
<td width="20" valign="bottom">14</td>
<td width="64" valign="bottom">
<p align="right">d3</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">1</p>
</td>
<td width="64" valign="bottom">
<p align="center">2</p>
</td>
</tr>
</tbody>
</table>
<p>In the table, <strong>T</strong> was computed by selecting a destination array, entering in its first empty cell (<strong>B7</strong>) the formula <strong>=MMULT(B2:D4,TRANSPOSE(B2:D4))</strong>, pressing the <strong>f2</strong> key and then the <strong>Ctrl+Shift+Enter</strong> keys.</p>
<p>Similarly, <strong>D </strong>was computed by selecting a destination array, entering in its first empty cell (<strong>B12</strong>) the formula <strong>=MMULT(TRANSPOSE(B2:D4),B2:D4)</strong>, pressing the <strong>f2</strong> key and then the <strong>Ctrl+Shift+Enter</strong> keys.</p>
<p>That was easy!</p>
<p>Note that none of these are similarity matrices. Can you tell why?</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/976/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/976/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/976/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/976/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/976/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/976/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/976/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/976/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/976/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/976/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=976&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/06/05/computing-co-occurrence-matrices-with-excel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IRW-2009-6:Hackers: Taxonomy &amp; Writing Styles</title>
		<link>http://irthoughts.wordpress.com/2009/06/01/irw-2009-6hackers-taxonomy-writing-styles/</link>
		<comments>http://irthoughts.wordpress.com/2009/06/01/irw-2009-6hackers-taxonomy-writing-styles/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 13:56:18 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Homeland Security]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=973</guid>
		<description><![CDATA[
The current issue of IRW should reach subscribers inbox during the day or at the latest, tomorrow.
In this issue:

Featuring article: Hackers: Taxonomy and Writing Styles
Due to the increasing interest in developing Information Retrieval and Data Mining courses at the intersection of Information Security, this issue of the newsletter covers a brief taxonomy on hackers and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=973&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:center;"><img class="aligncenter" src="http://www.miislita.com/irw/hackers.gif" alt="hackers" /></p>
<p>The current issue of IRW should reach subscribers inbox during the day or at the latest, tomorrow.</p>
<p>In this issue:</p>
<ul>
<li>Featuring article: Hackers: Taxonomy and Writing Styles<br />
Due to the increasing interest in developing Information Retrieval and Data Mining courses at the intersection of Information Security, this issue of the newsletter covers a brief taxonomy on hackers and their writing styles.</li>
<li>QA: Excel Matrix Multiplications: How to convert a term-document occurrence matrix into a term-term or document-document co-occurrence matrix?</li>
<li>Vacuum Tubes &amp; Transistors Historical</li>
<li>Who is Who in IR: Thomas K. Landauer</li>
<li>Top CS Departments: Dartmouth College</li>
<li>Outstanding Graduate Theses</li>
<li>Calls and Events</li>
<li>IR Blogs</li>
<li>and more&#8230;</li>
</ul>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/973/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/973/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/973/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/973/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/973/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/973/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/973/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/973/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/973/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/973/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=973&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/06/01/irw-2009-6hackers-taxonomy-writing-styles/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://www.miislita.com/irw/hackers.gif" medium="image">
			<media:title type="html">hackers</media:title>
		</media:content>
	</item>
		<item>
		<title>On Term Repetition and Local Models</title>
		<link>http://irthoughts.wordpress.com/2009/05/27/on-term-repetition-and-local-models/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/27/on-term-repetition-and-local-models/#comments</comments>
		<pubDate>Wed, 27 May 2009 17:08:34 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[SEO Myths]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=968</guid>
		<description><![CDATA[I&#8217;m putting together a piece on several local term weight models. It should be ready in few weeks.
It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=968&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I&#8217;m putting together a piece on several local term weight models. It should be ready in few weeks.</p>
<p>It is a research paper that can be used as a tutorial. It describes a systematic approach for the derivation of any kind of local term weighting model. Students can use it as a recipe for proposing their own candidate models.</p>
<p>The article touches on some aspects of the problem of trusting models that lack of attenuation. Here is one snippet on the subject:</p>
<p>&lt;last nail in KD coffin  style=&#8221;intensity:100%;&#8221;&gt;</p>
<p>&#8220;It should be stressed that term repetition not necessarily satisfies users’ queries nor is evidence of:</p>
<p> <em><strong>Pertinence (P)</strong></em>; e.g., that a term repeated x times is x times more pertinent to the document.</p>
<p><strong><em>Aboutness (A)</em></strong>; e.g., that the document is x times more about the term.</p>
<p><strong><em>Importance (I)</em></strong>; i.e., that there is a term-document relationship of <em>pertinence </em>and<em> aboutness</em>.</p>
<p><strong><em>Relevance (R)</em></strong>;i..e., that a document repeating a term x times is x times more relevant.</p>
<p>Accordingly, fulfilling such <em>‘PAIR criteria’</em> on a regular basis is hard to accomplish with any model that lacks of attenuation.&#8221;</p>
<p>&lt;/last nail in KD coffin&gt;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/968/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/968/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/968/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/968/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/968/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/968/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/968/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/968/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/968/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/968/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=968&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/27/on-term-repetition-and-local-models/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Defining Data Mining and Database</title>
		<link>http://irthoughts.wordpress.com/2009/05/25/defining-data-mining-and-database/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/25/defining-data-mining-and-database/#comments</comments>
		<pubDate>Mon, 25 May 2009 16:00:58 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=959</guid>
		<description><![CDATA[What is the (^H^H^H) best definition for data mining and database? It depends on who you ask and in which context.
According to Section 126 of the USA Patriot Act,
(1) DATA-MINING- The term `data-mining&#8217; means a query or search or other analysis of one or more electronic databases, where
(A) at least one of the databases was [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=959&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>What is the (^H^H^H) best definition for data mining and database? It depends on who you ask and in which context.</p>
<p>According to <a href="http://thomas.loc.gov/cgi-bin/cpquery/?&amp;dbname=cp109&amp;sid=cp109QWRIU&amp;refer=&amp;r_n=hr333.109&amp;item=&amp;sel=TOC_124051&amp;">Section 126 of the USA Patriot Act</a>,</p>
<blockquote><p>(1) <strong>DATA-MINING</strong>- The term `data-mining&#8217; means a query or search or other analysis of one or more electronic databases, where</p>
<p>(A) at least one of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement;</p>
<p>(B) the search does not use personal identifiers of a specific individual or does not utilize inputs that appear on their face to identify or be associated with a specified individual to acquire information; and</p>
<p>(C) a department or agency of the Federal Government is conducting the query or search or other analysis to find a pattern indicating terrorist or other criminal activity.</p>
<p>(2) <strong>DATABASE-</strong> The term `database&#8217; does not include telephone directories, information publicly available via the Internet or available by any other means to any member of the public, any databases maintained, operated, or controlled by a State, local, or tribal government (such as a State motor vehicle database), or databases of judicial and administrative opinions.</p></blockquote>
<p>Asking the government or a KDDM researcher the question and using LSI to clusters results for the above question can be a futile exercise.</p>
<p>It is like asking President Obama or Vice President Cheney to agree on: &#8220;What is Torture?&#8221;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/959/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/959/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/959/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/959/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/959/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/959/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/959/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/959/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/959/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/959/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=959&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/25/defining-data-mining-and-database/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>When Noise is a Good Thing.</title>
		<link>http://irthoughts.wordpress.com/2009/05/22/when-noise-is-a-good-thing/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/22/when-noise-is-a-good-thing/#comments</comments>
		<pubDate>Fri, 22 May 2009 14:03:07 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Latent Semantic Indexing]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=951</guid>
		<description><![CDATA[Today, a reader (name removed to protect confidentiality) asked me:
My name is **** ****. I working as a junior research fellow in a project in India. I red the SVD techniques from the web page http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html#right-eigenvectors. I found it is quite satisfactory for me. Now I can understand how SVD works. But I have a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=951&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Today, a reader (name removed to protect confidentiality) asked me:</p>
<blockquote><p>My name is **** ****. I working as a junior research fellow in a project in India. I red the SVD techniques from the web page <a href="http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html#right-eigenvectors">http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html#right-eigenvectors</a>. I found it is quite satisfactory for me. Now I can understand how SVD works. But I have a query as follows.</p>
<p>query:</p>
<p>As mentioned in this tutorial that we have arrange these eigen-values in descending order. Cold you please tell me if I put these values in ascending order or arbitrary what will be wrong with the SVD.</p>
<p>Looking forward your early kind response.</p>
<p>Thanking you.</p>
<p>With best regards.</p>
<p>*******</p></blockquote>
<p>My answer follows.</p>
<p>It depends on what you are trying to address.</p>
<p>SVD is used to identify singular values interpreted as dimensions. When used as a dimensionality reduction technique, the largest N singular values are normally retained and thus retaining the smaller singular values is meaningless.  The largest singular values capture most of the information of the original data set and is therefore a noise minimization approach.</p>
<p>If the retention criterion used is reversed (smaller singular values are retained) this implies retaining the more noisy dimensions such that the reconstructed matrix will be a matrix of the hidden (latent) data noise. This is a noise maximization approach.</p>
<p>If the retention criterion is based on a random selection, the resultant reconstructed matrix might be one representing a data structure with randomized noise.</p>
<p>These scenarios depend on the original data under examination. </p>
<p>In Image Compression, these approaches have been already explored. If the goal is a stability study and not just SVD dimensionality reduction, &#8220;the ratio between the highest singular value and the lowest singular value of the Jacobian matrix quantifies the spread of the Jacobian’s singular values, which in practice, reflects the extent of the solution’s instability with respect to small changes in the observation&#8221;  (<a href="http://www.mathcs.emory.edu/~horesh/publications/thesis/thesis_all_in_one21.pdf">Horesh&#8217;s Thesis</a> )</p>
<p>Having said all that, we should not render noise in a data set as something that must be discarded at all cost.</p>
<p>This is intimate linked with the so-called Inverse Problem. Incorporating noise and <em>a priori </em>SVD information can provide the complete information in a linear sense. Qianqian Fang has a beautiful PPT presentation &#8220;<a href="http://bbs.dartmouth.edu/~fangq/Presentation/RIP2003/LookClosertoInverseProblem.ppt">Look Closer to Inverse Problem</a>&#8221; on the subject. If you want to visualize the MATRIX Problem, this presentation is for you.</p>
<p>I&#8217;m thinking in putting together a tutorial on the Singular Value Expansion algorithm (SVE), if I ever find the time.</p>
<p>I hope this helps.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/951/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/951/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/951/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/951/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/951/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/951/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/951/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/951/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/951/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/951/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=951&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/22/when-noise-is-a-good-thing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Ethical Hacking: An Oxymoron, a Misnomer, or Both?</title>
		<link>http://irthoughts.wordpress.com/2009/05/18/ethical-hacking-an-oxymoron-a-misnomer-or-both/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/18/ethical-hacking-an-oxymoron-a-misnomer-or-both/#comments</comments>
		<pubDate>Mon, 18 May 2009 12:51:28 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=944</guid>
		<description><![CDATA[According to a report from the British Computer Society (BCS) covering a Security Panel Strategic Forum, &#8220;ethical hacking&#8221; is an oxymoron.
The report highligths do&#8217;s and don&#8217;t when it comes to defining terms like &#8220;hacker&#8221;, &#8220;ethical hacking&#8221;, &#8220;penetration tester&#8221;, &#8220;white/black hats&#8221;, and derivatives terms. These labels are frequently used in the IT industry. The report also [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=944&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>According to a report from the <a href="http://www.bcs.org/upload/pdf/ethical-hacking.pdf">British Computer Society</a> (BCS) covering a Security Panel Strategic Forum, &#8220;ethical hacking&#8221; is an oxymoron.</p>
<p>The report highligths do&#8217;s and don&#8217;t when it comes to defining terms like &#8220;hacker&#8221;, &#8220;ethical hacking&#8221;, &#8220;penetration tester&#8221;, &#8220;white/black hats&#8221;, and derivatives terms. These labels are frequently used in the IT industry. The report also underscores which terms should not be used by schools offering IT courses.</p>
<p>The problem with defining and redefining such labels is that there will always be others disagreeing with/circumventing said definitions.</p>
<p>For instance, in the December 1986 issue of MicroTimes, Bob Bickford wrote:</p>
<p>&#8220;A Hacker is any person who derives joy from discovering ways to circumvent limitations.&#8221;</p>
<p>If we accept this definition then a person that <strong>doesn&#8217;t </strong>derive any joy from discovering ways to circumvent limitations <strong>is not</strong> a hacker. Similarly a spouse cheater, an SEO, a spammer, a politician, a mobster, or a kid trying to get some candies from mom is a hacker.</p>
<p>I am taking this extreme, off-topic interpretation to illustrate the problem of semantics when it comes to defining things.</p>
<p>Whether you agree or disagree partial or totally with the report, it is a good read. For sure it will be a good piece for students planning to take my AIRWeb graduate course.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/944/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/944/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/944/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/944/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/944/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/944/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/944/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/944/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/944/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/944/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=944&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/18/ethical-hacking-an-oxymoron-a-misnomer-or-both/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Google Accused of Conversion-Inflation Syndication Fraud</title>
		<link>http://irthoughts.wordpress.com/2009/05/15/google-accused-of-conversion-inflation-syndication-fraud/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/15/google-accused-of-conversion-inflation-syndication-fraud/#comments</comments>
		<pubDate>Fri, 15 May 2009 14:35:24 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Marketing Research]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=939</guid>
		<description><![CDATA[According to Ben Edelman, Google is engaged in a conversion-inflation syndyication fraud.
These tactics are nothing new.
In the featuring article of the November 2008 issue of IR Watch, &#8220;Fraudulent Web Analytics &#8211; Engineering the Fraud&#8220;, we covered how in-the-middle mechanisms are part of Web Analytic Frauds and Business Collusion Schemes.
As in man-in-the-middle attacks found in information [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=939&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>According to Ben Edelman, Google is engaged in a <a href="http://www.benedelman.org/news/051309-1.html">conversion-inflation syndyication fraud</a>.</p>
<p>These tactics are nothing new.</p>
<p>In the featuring article of the November 2008 issue of IR Watch, <strong>&#8220;Fraudulent Web Analytics &#8211; Engineering the Fraud</strong>&#8220;, we covered how <em>in-the-middle mechanisms</em> are part of Web Analytic Frauds and Business Collusion Schemes.</p>
<p>As in man-in-the-middle attacks found in information security settings, the underlying goal is the same: the crafting of deceiving intermediary events.</p>
<p>Expect soon a pr damage control campaign from the useful idiots/moles.</p>
<p>What is next? A class action lawsuit?</p>
<p>Still, I have a little taste of satisfaction in my mouth when crooks disguised as advertisers/search marketers are gamed. Gaming the gamers: Life ironies!</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/939/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/939/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/939/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/939/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/939/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/939/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/939/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/939/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/939/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/939/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=939&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/15/google-accused-of-conversion-inflation-syndication-fraud/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IR Quiz: Matrices</title>
		<link>http://irthoughts.wordpress.com/2009/05/13/ir-quiz-matrices/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/13/ir-quiz-matrices/#comments</comments>
		<pubDate>Wed, 13 May 2009 12:15:59 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Quizzes]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=933</guid>
		<description><![CDATA[Explain and give example for the following matrices used in IR:
1. Term-document occurrence matrix.
2. Term-term cooccurrence matrix.
3. Term-term correlation matrix.
4. Term-term similarity matrix.
5. Term-term coweights matrix.
6. Term-term distance matrix (*).
7. Covariance matrix (*).
 
(*) PS. I forgot to list these other matices.
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=933&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Explain and give example for the following matrices used in IR:</p>
<p>1. Term-document occurrence matrix.</p>
<p>2. Term-term cooccurrence matrix.</p>
<p>3. Term-term correlation matrix.</p>
<p>4. Term-term similarity matrix.</p>
<p>5. Term-term coweights matrix.</p>
<p>6. Term-term distance matrix (*).</p>
<p>7. Covariance matrix (*).</p>
<p> </p>
<p>(*) PS. I forgot to list these other matices.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/933/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/933/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/933/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/933/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/933/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/933/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/933/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/933/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/933/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/933/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=933&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/13/ir-quiz-matrices/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Vector Normalization with Excel &#8211; Part II</title>
		<link>http://irthoughts.wordpress.com/2009/05/07/vector-normalization-with-excel-part-ii/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/07/vector-normalization-with-excel-part-ii/#comments</comments>
		<pubDate>Thu, 07 May 2009 12:06:56 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=928</guid>
		<description><![CDATA[Back in March, we explained how to normalize column vectors with Excel. But, what about normalizing row vectors? This question is addressed in the current QA column of IRW. I think it might be useful sharing the answer with readers since many of these are students struggling with similar questions. So, here we go.
The following table [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=928&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Back in March, we explained <a href="http://irthoughts.wordpress.com/2009/03/04/vector-normalization-with-excel/">how to normalize column vectors with Excel</a>. But, what about normalizing row vectors? This question is addressed in the current QA column of IRW. I think it might be useful sharing the answer with readers since many of these are students struggling with similar questions. So, here we go.</p>
<p>The following table emulates an Excel array consisting of three columns (A, B, and C) and six rows (1-6).</p>
<table border="0" cellspacing="0" cellpadding="0" width="178">
<col span="1" width="37"></col>
<col span="1" width="47"></col>
<col span="1" width="48"></col>
<col span="1" width="46"></col>
<tbody>
<tr>
<td width="37" height="20"> </td>
<td width="47">A</td>
<td width="48">B</td>
<td width="46">C</td>
</tr>
<tr>
<td height="20">1</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td height="20">2</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<td height="20">3</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td height="20">4</td>
<td>0.27</td>
<td>0.53</td>
<td>0.80</td>
</tr>
<tr>
<td height="20">5</td>
<td>0.46</td>
<td>0.57</td>
<td>0.68</td>
</tr>
<tr>
<td height="20">6</td>
<td>0.50</td>
<td>0.57</td>
<td>0.65</td>
</tr>
</tbody>
</table>
<p style="text-align:left;">Rows 1, 2, and 3 are row vectors. Rows 4, 5, and 6 are the corresponding normalized vectors, also known as unit vectors because their length is 1. To compute these, do as follows:</p>
<p>1. In cell A4, enter the formula =A1/(SQRT(SUMSQ($A1:$C1))). The result should be as given in this cell.</p>
<p>2. Copy this formula, select cells A5 and A6 and paste the formula in these.</p>
<p> 3. Finally, copy at once cells A4 through A6, select the remaining empty cells of the array, i.e., cells B4 through C6 and paste the formulas in these.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/928/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/928/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/928/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/928/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/928/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/928/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/928/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/928/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/928/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/928/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=928&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/07/vector-normalization-with-excel-part-ii/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>NSA/DHS Designates PUPR as a CAE</title>
		<link>http://irthoughts.wordpress.com/2009/05/05/nsadhs-designates-pupr-as-a-cae/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/05/nsadhs-designates-pupr-as-a-cae/#comments</comments>
		<pubDate>Tue, 05 May 2009 12:40:22 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Homeland Security]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=916</guid>
		<description><![CDATA[As blogged yesterday, the current issue of IRW should reach subscribers inbox today. The Top CS Departments column features Polytechnic University of Puerto Rico, where I teach graduate courses. As mentioned few days ago, PUPR has been designated a CAE. This is a great news that is making a splash across academic centers within the U.S., the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=916&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As blogged yesterday, the current issue of IRW should reach subscribers inbox today. The Top CS Departments column features Polytechnic University of Puerto Rico, where I teach graduate courses. As mentioned few days ago, PUPR has been designated a CAE. This is a great news that is making a splash across academic centers within the U.S., the Caribbean Region and Latin America, and whose mission is research relevant to homeland security.</p>
<p>Associate Director for Computer Science, Dr. Alfredo Cruz, sent me an  official announcement, which I am reproducing.</p>
<blockquote><p>Polytechnic University of Puerto Rico (PUPR) is Designated National Center of Academic Excellence in Information Assurance Education by NSA and DHS. PUPR was recently designated as a National Center of Academic Excellence in Information Assurance Education (CAE/IAE) by the National Security Agency (NSA) and the Department of Homeland Security (DHS) on April 22, 2009. The goal of these centers is to reduce the vulnerability of the national information infrastructure by promoting higher education and research in Information Assurance (IA) and Security through the development of a growing number of professionals with IA expertise in various related disciplines. PUPR will be recognized as the first institution in Puerto Rico to be designated as a CAE/IAE on June 3, 2009 in Seattle, Washington. Dr. Alfredo Cruz from the Department of Electrical &amp; Computer Engineering and Computer Science will be present to receive the designation. He is the Director of the Center of Information Assurance for Research and Education (CIARE) at PUPR. Dr. Cruz is the person responsible for this designation. PUPR is of the very few Hispanic serving institution (HSI) in the Nation to receive this designation, and to become one of the first 100 institutions nationwide; this is a very special recognition. This designation requires that the President of the United States send the Governor of Puerto Rico a certification that should be handed to the president of PUPR designating the Institution as a CAE/IAE at a National level. The Congress and all the respective Congressional Committees are also notified.</p>
<p>Some of the benefits of the CAE/IAE designation are:<br />
• PUPR will receive formal recognition from the U.S. Government as well as opportunities for prestige and publicity for our roll in securing the Nation’s information systems.<br />
• This designation increases collaboration opportunities between designated and aspiring institutions at local and national levels. This includes internships, faculty and student exchange, research, and publications, among other activities.<br />
• With this designation as a CAE/IAE PUPR can obtain scholarships that can help outstanding students to pursue graduate studies in IA, enabling them to work with the Federal Government or other federal institutions and agencies.<br />
• PUPR can compete and benefit from proposal calls (RFP) that are specifically for designated CAE/IAE institutions. These proposals offer millions of dollars from the DoD, NSF, NSA and “Homeland Security”, among others, for research and infrastructure.<br />
• Student scholarships offered under the NSF&#8217;s Scholarship for Service (SFS) program. The SFS scholarship offers the following:<br />
&#8211;2-year scholarship, includes 8K stipend (12K for graduate students), plus tuition and nominal room and board expenses.<br />
&#8211;Paid summer internship in a federal agency.<br />
&#8211;Placement in federal government at the end of the scholarship period.</p></blockquote>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/916/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/916/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/916/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/916/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/916/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/916/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/916/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/916/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/916/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/916/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=916&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/05/nsadhs-designates-pupr-as-a-cae/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IRW: RIA Vulnerabilities</title>
		<link>http://irthoughts.wordpress.com/2009/05/04/irw-ria-vulnerabilities/</link>
		<comments>http://irthoughts.wordpress.com/2009/05/04/irw-ria-vulnerabilities/#comments</comments>
		<pubDate>Mon, 04 May 2009 14:09:50 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Newsletters]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=910</guid>
		<description><![CDATA[
The current of issue of IRW should reach subscribers inbox tomorrow.
In this issue:
Featuring article: RIA Vulnerabilities
This issue of the newsletter discusses how hackers might be exploiting Web vulnerabilities found in Rich Internet Applications (RIAs). As mentioned in our previous issue, some RIAs are based on Adobe’s technologies like Flash, Flex, or AIR. Some are designed to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=910&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:center;"><img class="aligncenter" src="http://www.miislita.com/irw/ria-vulnerabilities.gif" alt="" /></p>
<p>The current of issue of IRW should reach subscribers inbox tomorrow.</p>
<p>In this issue:</p>
<p>Featuring article: RIA Vulnerabilities</p>
<blockquote><p>This issue of the newsletter discusses how hackers might be exploiting Web vulnerabilities found in Rich Internet Applications (RIAs). As mentioned in our previous issue, some RIAs are based on Adobe’s technologies like Flash, Flex, or AIR. Some are designed to be run online or offline. Their rising popularity has attracted developers and marketers, and -as expected- hackers and spammers.</p></blockquote>
<p>QA: Excel Vector Normalization: How do I convert a row vector into a unit vector?<br />
Who is Who in IR: C.J. van Rijsbergen<br />
Top CS Departments: Polytechnic University of Puerto Rico<br />
Historical Notes: ENIAC Computer<br />
Outstanding Graduate Theses<br />
Calls and Events<br />
Research Blogs<br />
and more&#8230;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/910/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/910/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/910/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/910/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/910/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/910/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/910/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/910/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/910/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/910/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=910&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/05/04/irw-ria-vulnerabilities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://www.miislita.com/irw/ria-vulnerabilities.gif" medium="image" />
	</item>
		<item>
		<title>No-Caching is Spammers Best Friend</title>
		<link>http://irthoughts.wordpress.com/2009/04/30/no-caching-is-spammers-best-friend/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/30/no-caching-is-spammers-best-friend/#comments</comments>
		<pubDate>Thu, 30 Apr 2009 14:10:44 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=905</guid>
		<description><![CDATA[Today I feel like giving a piece of advise to spammers, so this will force raising the bar in the &#8220;we versus them&#8221; in the Spam War. Think of this as a love-hate relationship.
C&#8217;mon spammers, I know you can do better. Don&#8217;t make our IR life easy at neutralizing your tactics. He, He.
At the recent AIRWeb [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=905&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Today I feel like giving a piece of advise to spammers, so this will force raising the bar in the &#8220;we versus them&#8221; in the Spam War. Think of this as a love-hate relationship.</p>
<p>C&#8217;mon spammers, I know you can do better. Don&#8217;t make our IR life easy at neutralizing your tactics. He, He.</p>
<p>At the recent AIRWeb Workshops, Brian Davison presented the paper <a href="http://airweb.cse.lehigh.edu/2009/papers/p1-dai.pdf">Looking into the Past to Better Classify Web Spam</a>, which received high reviews from referees and the audience.</p>
<p>Wannabe spammers, if you are really committed to spamdexing, at least know the how-tos. Don&#8217;t leave a temporal fingerprint of your web presence. Try this:</p>
<p>1. Prevent online resources from caching your web pages, like the Wayback Machine and commercial search engines.</p>
<p>2. Use No-Cache and No-Archive.</p>
<p>3. Switch hosts whenever you can.</p>
<p>4. Constantly mutate your link structure.</p>
<p>5. Don&#8217;t profile yourself with easy to detect/predictable honeypots, link swapping, strongly-connected component structures, etc.</p>
<p>Why giving these advices? Check current AIRWeb &#8220;gems&#8221;.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/905/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/905/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/905/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/905/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/905/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/905/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/905/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/905/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/905/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/905/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=905&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/30/no-caching-is-spammers-best-friend/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Microsoft, Inter-Metro to Co-Launch a MIC</title>
		<link>http://irthoughts.wordpress.com/2009/04/29/microsoft-inter-metro-to-co-launch-a-mic/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/29/microsoft-inter-metro-to-co-launch-a-mic/#comments</comments>
		<pubDate>Wed, 29 Apr 2009 13:33:20 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Tools]]></category>
		<category><![CDATA[Marketing Research]]></category>
		<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=900</guid>
		<description><![CDATA[This afternoon, Microsoft in partnership with The Interamerican University of Puerto Rico, Metropolitan Campus (Inter-Metro) will announce that they are officially co-launching the Microsoft Innovation Center (MIC) of Puerto Rico.
This will be the first MIC in the region. A two stores building has been abilitated within the Inter-Metro campus for this project. As member of the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=900&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>This afternoon, Microsoft in partnership with The Interamerican University of Puerto Rico, Metropolitan Campus (Inter-Metro) will announce that they are officially co-launching the Microsoft Innovation Center (MIC) of Puerto Rico.</p>
<p>This will be the first MIC in the region. A two stores building has been abilitated within the Inter-Metro campus for this project. As member of the MIC steering committee, I have been invited to the presentation by President, Manuel J. Fernos.</p>
<p>They have also provided me with office and lab space in the MIC building to put together the Internet Business Development Center (IBDC). The objectives of the MIC is the development and commercialization of ecommerce-related software tools. Emphasis will be given to egovernment and ebusiness solutions.</p>
<p>It looks like I will split my schedules between being the IBDC principal investigator, MIC meetings, doing research at Inter-Metro, teaching at PUPR, and writing IRWs. These are exciting news. Let see how things go, especially with the other great news  that PUPR&#8217;s ECE&amp;CS department has been accredited by NSA as a CAE.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/900/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/900/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/900/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/900/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/900/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/900/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/900/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/900/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/900/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/900/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=900&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/29/microsoft-inter-metro-to-co-launch-a-mic/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>AIRWeb 2009 Proceedings</title>
		<link>http://irthoughts.wordpress.com/2009/04/28/airweb-2009-proceedings/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/28/airweb-2009-proceedings/#comments</comments>
		<pubDate>Tue, 28 Apr 2009 15:13:00 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[AIRWeb Course]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=893</guid>
		<description><![CDATA[Here are the proceeding papers of AIRWeb 2009, available at http://airweb.cse.lehigh.edu/2009/proceedings.html
OK, SEOs, Spammers, and Hackers: start your engines and let the fun begin.
If you are a PUPR graduate student and are planning to take my AIR course, it might be a good idea to start browsing through these &#8220;gems&#8221;. Check also previous proceedings of AIRWeb.
Invited [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=893&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Here are the proceeding papers of AIRWeb 2009, available at <a href="http://airweb.cse.lehigh.edu/2009/proceedings.html">http://airweb.cse.lehigh.edu/2009/proceedings.html</a></p>
<p>OK, SEOs, Spammers, and Hackers: start your engines and let the fun begin.</p>
<p>If you are a PUPR graduate student and are planning to take my AIR course, it might be a good idea to start browsing through these &#8220;gems&#8221;. Check also previous proceedings of AIRWeb.</p>
<h4>Invited Talks</h4>
<p class="paper">The Potential for Research and Development in Adversarial Information Retrieval — <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/Davison-AIRWeb2009-Keynote.pdf">slides</a></p>
<p><span class="authors">Brian D. Davison </span></p>
<p class="paper">Web Spam Challenges: Looking Backward and Forward — <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/castillo-challenges.pdf">slides</a></p>
<p><span class="authors">Carlos Castillo</span></p>
<h4>Temporal Analysis</h4>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p1-dai.pdf">Looking into the Past to Better </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/Dai- LookingintothePasttoBetterClassifyWeb.pdf">slides</a></p>
<p><span class="authors">Na Dai, Brian D. Davison and Xiaoguang Qi</span></p>
<p>Classify Web Spam</p>
<div class="abstract">Web spamming techniques aim to achieve undeserved rankings in<br />
search results. Research has been widely conducted on identifying<br />
such spam and neutralizing its influence. However, existing spam<br />
detection work only considers current information. We argue that<br />
historical web page information may also be important in spam<br />
classification. In this paper, we use content features from historical<br />
versions of web pages to improve spam classification. We use<br />
supervised learning techniques to combine classifiers based on<br />
current page content with classifiers based on temporal features.<br />
Experiments on the WEBSPAM-UK2007 dataset show that our<br />
approach improves spam classification F-measure performance by<br />
30% compared to a baseline classifier which only considers current<br />
page content.</div>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p9-chung.pdf">A Study of Link Farm Distribution </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/airweb2009_chung.pdf">slides</a></p>
<p><span class="authors">Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa</span></p>
<p>and Evolution Using a Time Series of Web Snapshots</p>
<div class="abstract">In this paper, we study the overall link-based spam structure<br />
and its evolution which would be helpful for the development<br />
of robust analysis tools and research for Web spamming as a<br />
social activity in the cyber space. First, we use strongly connected<br />
component (SCC) decomposition to separate many<br />
link farms from the largest SCC, so called the core. We<br />
show that denser link farms in the core can be extracted by<br />
node filtering and recursive application of SCC decomposition<br />
to the core. Surprisingly, we can find new large link<br />
farms during each iteration and this trend continues until at<br />
least 10 iterations. In addition, we measure the spamicity<br />
of such link farms. Next, the evolution of link farms is examined<br />
over two years. Results show that almost all large<br />
link farms do not grow anymore while some of them shrink,<br />
and many large link farms are created in one year.</div>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p17-erdelyi.pdf">Web Spam Filtering in Internet </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/erdelyi-timeline-spam- pres.pdf">slides</a></p>
<p><span class="authors">Miklós Erdélyi, András A. Benczúr, Julien Masanes and </span></p>
<p>Archives</p>
<p>Dávid Siklósi</p>
<div class="abstract">While Web spam is targeted for the high commercial value of topranked<br />
search-engine results, Web archives observe quality deterioration<br />
and resource waste as a side effect. So far Web spam filtering<br />
technologies are rarely used by Web archivists but planned in the<br />
future as indicated in a survey with responses from more than 20<br />
institutions worldwide. These archives typically operate on a modest<br />
level of budget that prohibits the operation of standalone Web<br />
spam filtering but collaborative efforts could lead to a high quality<br />
solution for them.<br />
In this paper we illustrate spam filtering needs, opportunities and<br />
blockers for Internet archives via analyzing several crawl snapshots<br />
and the difficulty of migrating filter models across different<br />
crawls via the example of the 13 .uk snapshots performed<br />
by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.</div>
<h4>Content Analysis</h4>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p21-martinez-romo.pdf">Web Spam Identification </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/juaner09airweb-pres.pdf">slides</a></p>
<p><span class="authors">Juan Martinez-Romo and Lourdes Araujo</span></p>
<p>Through Language Model Analysis</p>
<div class="abstract">This paper applies a language model approach to different<br />
sources of information extracted from a Web page, in order<br />
to provide high quality indicators in the detection of<br />
Web Spam. Two pages linked by a hyperlink should be<br />
topically related, even though this were a weak contextual<br />
relation. For this reason we have analysed different sources<br />
of information of a Web page that belongs to the context of<br />
a link and we have applied Kullback-Leibler divergence on<br />
them for characterising the relationship between two linked<br />
pages. Moreover, we combine some of these sources of information<br />
in order to obtain richer language models. Given<br />
the different nature of internal and external links, in our<br />
study we also distinguished these types of links getting a<br />
significant improvement in classification tasks. The result<br />
is a system that improves the detection of Web Spam on<br />
two large and public datasets such as WEBSPAM-UK2006 and<br />
WEBSPAM-UK2007.</div>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p29-katayama.pdf">An Empirical Study on </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/Katayam-active_learning_blog_spam.pdf">slides</a></p>
<p><span class="authors">Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and </span></p>
<p>Selective Sampling in Active Learning for Splog Detection</p>
<p>Tomohiro Fukuhara</p>
<div class="abstract">This paper studies how to reduce the amount of human supervision<br />
for identifying splogs / authentic blogs in the context<br />
of continuously updating splog data sets year by year.<br />
Following the previous works on active learning, against the<br />
task of splog / authentic blog detection, this paper empirically<br />
examines several strategies for selective sampling in<br />
active learning by Support Vector Machines (SVMs). As a<br />
confidence measure of SVMs learning, we employ the distance<br />
from the separating hyperplane to each test instance,<br />
which have been well studied in active learning for text classification.<br />
Unlike those results of applying active learning<br />
to text classification tasks, in the task of splog / authentic<br />
blog detection of this paper, it is not the case that adding<br />
least confident samples performs best.</div>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p37-biro.pdf">Linked Latent Dirichlet Allocation </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/Siklosi- LinkedLDA.pdf">slides</a></p>
<p><span class="authors">István Bíró, Dávid Siklósi, Jácint Szabó </span></p>
<p>in Web Spam Filtering</p>
<p>and András Benczúr</p>
<div class="abstract">Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003)<br />
is a fully generative statistical language model on the content<br />
and topics of a corpus of documents. In this paper<br />
we apply an extension of LDA for web spam classification.<br />
Our linked LDA technique takes also linkage into account:<br />
topics are propagated along links in such a way that the<br />
linked document directly influences the words in the linking<br />
document. The inferred LDA model can be applied for<br />
classification as dimensionality reduction similarly to latent<br />
semantic indexing. We test linked LDA on the WEBSPAM-UK2007<br />
corpus. By using BayesNet classifier, in terms of<br />
the AUC of classification, we achieve 3% improvement over<br />
plain LDA with BayesNet, and 8% over the public link features<br />
with C4.5. The addition of this method to a log-odds<br />
based combination of strong link and content baseline classifiers<br />
results in a 3% improvement in AUC. Our method<br />
even slightly improves over the best Web Spam Challenge<br />
2008 result.</div>
<h4>Social Spam</h4>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf">Social Spam Detection</a></p>
<p>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/Markines-social_spam.pdf">slides</a></p>
<p><span class="authors">Benjamin Markines, Ciro Cattuto and Filippo Menczer</span></p>
<div class="abstract">The popularity of social bookmarking sites has made them prime<br />
targets for spammers. Many of these systems require an administrator’s<br />
time and energy to manually filter or remove spam. Here<br />
we discuss the motivations of social spam, and present a study<br />
of automatic detection of spammers in a social tagging system.<br />
We identify and analyze six distinct features that address various<br />
properties of social spam, finding that each of these features provides<br />
for a helpful signal to discriminate spammers from legitimate<br />
users. These features are then used in various machine learning<br />
algorithms for classification, achieving over 98% accuracy in detecting<br />
social spammers with 2% false positives. These promising<br />
results provide a new baseline for future efforts on social spam. We<br />
make our dataset publicly available to the research community.</div>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p49-neubauer.pdf">Tag Spam Creates Large Non-</a> — <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/airweb_neubauer.pdf">slides</a></p>
<p><span class="authors">Nicolas Neubauer, Robert Wetzker and Klaus Obermayer</span></p>
<p>Giant Connected Components</p>
<div class="abstract">Spammers in social bookmarking systems try to mimick<br />
bookmarking behaviour of real users to gain the attention<br />
of other users or search engines. Several methods have been<br />
proposed for the detection of such spam, including domain specific<br />
features (like URL terms) or similarity of users to<br />
previously identified spammers. However, as shown in our<br />
previous work, it is possible to identify a large fraction of<br />
spam users based on purely structural features. The hypergraph<br />
connecting documents, users, and tags can be decomposed<br />
into connected components, and any large, but non-giant<br />
components turned out to be almost entirely inhabited<br />
by spam users in the examined dataset. Here, we test<br />
to what degree the decomposition of the complete hypergraph<br />
is really necessary, examining the component structure<br />
of the induced user/document and user/tag graphs.<br />
While the user/tag graph&#8217;s connectivity does not help in<br />
classifying spammers, the user/document graph&#8217;s connectivity<br />
is already highly informative. It can however be augmented<br />
with connectivity information from the hypergraph.<br />
In our view, spam detection based on structural features, like<br />
the one proposed here, requires complex adaptation strategies<br />
from spammers and may complement other, more traditional<br />
detection approaches.</div>
<h4>Spam Research Collections</h4>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p53-jones.pdf">Nullification Test Collections </a><br />
— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/Jones- Nullification_test_collections_for_web_spam_an.pdf">slides</a></p>
<p><span class="authors">Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell</span></p>
<p>for Web Spam and SEO</p>
<div class="abstract">Research in the area of adversarial information retrieval has<br />
been facilitated by the availability of the UK-2006/UK-2007<br />
collections, comprising crawl data, link graph, and spam labels.<br />
However, research into nullifying the negative effect<br />
of spam or excessive search engine optimisation (SEO) on<br />
the ranking of non-spam pages is not well supported by<br />
these resources. Nor is the study of cloaking techniques<br />
or of click spam. Finally, the domain-restricted nature of a<br />
.uk crawl means that only parts of link-farm icebergs may<br />
be visible in these crawls. We introduce the term nullification<br />
which we define as &#8220;preventing problem pages from<br />
negatively affecting search results&#8221;. We show some important<br />
differences between properties of current .uk-restricted<br />
crawls and those previously reported for the Web as a whole.<br />
We identify a need for an adversarial IR collection which is<br />
not domain-restricted and which is supported by a set of<br />
appropriate query sets and (optimistically) user-behaviour<br />
data. The billion-page unrestricted crawl being conducted<br />
by CMU (web09-bst) and which will be used in the 2009<br />
TREC Web Track is assessed as a possible basis for a new<br />
AIR test collection. We discuss the pros and cons of its scale,<br />
and the feasibility of adding resources such as query lists to<br />
enhance the utility of the collection for AIR research.</div>
</p>
<p class="paper"><a class="pdf" href="http://airweb.cse.lehigh.edu/2009/papers/p61-benczur.pdf">Web Spam Challenge Proposal for </a>— <a class="presc" href="http://airweb.cse.lehigh.edu/2009/slides/erdelyi- challenge-position-pres.pdf">slides</a></p>
<p><span class="authors">András A. Benczúr, Miklós Erdélyi, Julien Masanes and </span></p>
<p>Filtering in Archives</p>
<p>Dávid Siklósi</p>
<div class="abstract">In this paper we propose new tasks for a possible future Web Spam<br />
Challenge motivated by the needs of the archival community. The<br />
Web archival community consists of several relatively small institutions<br />
that operate independently and possibly over different top<br />
level domains (TLDs). Each of them may have a large set of historic<br />
crawls. Efficient filtering would hence require (1) enhanced<br />
use of the time series of domain snapshots and (2) collaboration by<br />
transferring models across different TLDs. Corresponding Challenge<br />
tasks could hence include the distribution of crawl snapshot<br />
data for feature generation as well as classification of unlabeled<br />
new crawls of the same or even different TLDs.</div>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/893/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/893/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/893/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/893/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/893/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/893/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/893/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/893/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/893/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/893/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=893&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/28/airweb-2009-proceedings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Marketing Professor Kills Three, Hurts Two</title>
		<link>http://irthoughts.wordpress.com/2009/04/26/marketing-professor-kills-three-hurts-two/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/26/marketing-professor-kills-three-hurts-two/#comments</comments>
		<pubDate>Sun, 26 Apr 2009 15:23:27 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Marketing Research]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=882</guid>
		<description><![CDATA[George M. Zinkhan III, from Terry College of Business at the University of Georgia allegedly went into a killing rampage, killing his ex-wife and two others, and hurting two.
According to his university page (accessible at the time of writing), Zinkhan is a Coca-Cola Company Professor Department of Marketing and Distribution. Zinkhan is well known in [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=882&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://www.terry.uga.edu/profiles/?person_id=457">George M. Zinkhan III</a>, from Terry College of Business at the University of Georgia allegedly went into a killing rampage, killing his ex-wife and two others, and hurting two.</p>
<p>According to his university page (accessible at the time of writing), Zinkhan is a Coca-Cola Company Professor Department of Marketing and Distribution. Zinkhan is well known in the academic marketing research circles, having served as editor of the JOURNAL OF THE ACADEMY OF MARKETING SCIENCE.</p>
<p>His <a href="http://www.scribd.com/doc/14643257/zinkhanvitae">40-page CV</a> reveals he conducted extensive research on Marketing and Net Advertising.</p>
<p>In 2008 he was part of an <a href="http://www.newcommreview.com/?p=1104">American Marketing Association</a> committee that redefined marketing. The new definition reads:</p>
<blockquote><p>&#8220;Marketing is the activity, set of institutions, and processes for creating, communicating, delivering, and exchanging offerings that have value for customers, clients, partners, and society at large.&#8221;</p></blockquote>
<p>According to the AMA committee,</p>
<blockquote><p>&#8220;Marketing is no longer a function &#8212; it is an educational process.&#8221;.</p></blockquote>
<p>Zinkhan published extensively with <a href="http://academic.udayton.edu/yuepan/resume.html">Yue Pan</a>, associate professor of marketing, University of Dayton. He published on the concept of Netvertising (&#8221;Netvertising Characteristics, Opportunities and Challenges: A Research Agenda,&#8221; International Journal of Internet Marketing &amp; Advertising, 1(3), 283-299.). According to their <a href="http://inderscience.metapress.com/app/home/contribution.asp?referrer=parent&amp;backto=issue,4,6;journal,13,15;linkingpublicationresults,1:110872,1">abstract</a>:</p>
<blockquote><p>&#8220;Netvertising, or &#8220;advertising on the internet&#8221;, is attracting much attention from advertising and marketing researchers. However, surprisingly little is known about its new features as compared to other forms of advertising and the implications of the new medium for advertisers. Here, we focus on the following issues: the opportunities and challenges associated with internet advertising; the differences of netvertising from other forms of communication; banner ads – the most popular type of netvertising. Applying this framing perspective, we propose a research agenda for the study of netvertising.&#8221;</p></blockquote>
<p>Netvertising is something search marketers do using different out-of-the-thin-air theories/naming conventions.</p>
<p>Read more about the <a href="http://www.allbusiness.com/marketing-advertising/advertising-internet-advertising/330948-1.html">Netvertising Image Communication Model (NICM)</a></p>
<p>That was then. Today Zinkhan&#8217;s name is associated with a Negative Image on the Net. It will be a matter of time before others will dissassociate themselves with such an image. Life ironies!</p>
<p><a href="http://www.ajc.com/news/content/metro/stories/2009/04/25/zinkhan_professor_shoot.html">He didn&#8217;t seem to fit</a> the academic stereotype.</p>
<p>Unfortunately as in any profession, some people cannot coupe with their personal misfortunes and end up doing bad things.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/882/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/882/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/882/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/882/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/882/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/882/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/882/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/882/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/882/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/882/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=882&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/26/marketing-professor-kills-three-hurts-two/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Hackers Hit Pentagon</title>
		<link>http://irthoughts.wordpress.com/2009/04/22/hackers-hit-pentagon/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/22/hackers-hit-pentagon/#comments</comments>
		<pubDate>Wed, 22 Apr 2009 05:00:59 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[AIRWeb Course]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Newsletters]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=875</guid>
		<description><![CDATA[It happened again: Thanks to Web vulnerabilities, hackers were able to hit the Pentagon. 
According to CCN (http://www.cnn.com/2009/US/04/21/pentagon.hacked/), 
Thousands of confidential files on the U.S. military&#8217;s most technologically advanced fighter aircraft have been compromised by unknown computer hackers over the past two years, according to senior defense officials.
The Internet intruders were able to gain access to data [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=875&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:left;">It happened again: Thanks to Web vulnerabilities, hackers were able to hit the Pentagon. </p>
<p style="text-align:left;">According to CCN <strong>(<a href="http://www.cnn.com/2009/US/04/21/pentagon.hacked/">http://www.cnn.com/2009/US/04/21/pentagon.hacked/</a>), </strong></p>
<blockquote><p>Thousands of confidential files on the U.S. military&#8217;s most technologically advanced fighter aircraft have been compromised by unknown computer hackers over the past two years, according to senior defense officials.</p>
<p>The Internet intruders were able to gain access to data related to the design and electronics systems of the Joint Strike Fighter through computers of Pentagon contractors in charge of designing and building the aircraft, according to the officials, who did not want to be identified because of the sensitivity of the issue.</p>
<p>In addition to files relating to the aircraft, hackers gained entry into the Air Force&#8217;s air traffic control systems, according to the officials. Once they got in, the Internet hackers were able to see such information as the locations of U.S. military aircraft in flight.</p></blockquote>
<p style="text-align:left;">This news is quite relevant to my Fall 2009 Web Vulnerability graduate course (<a href="http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf">http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf</a>)</p>
<p style="text-align:left;">BTW. Associate Director of the CS Department at PUPR.edu, also a colleague and friend, Dr. Alfredo Cruz, called me two days ago with some great news: The department has been accredited for 2009-2014 as a National Center of Academic Excellence in Information Assurance Education. Soon they will be listed with members of this exclusive &#8220;club&#8221; in the National Securing Agency web site (<a href="http://www.nsa.gov/ia/academic_outreach/nat_cae/institutions.shtml">http://www.nsa.gov/ia/academic_outreach/nat_cae/institutions.shtml</a>)</p>
<p style="text-align:left;">An official press release and formal presentation before the pertinent authorities is being coordinated for within the next few weeks or so.</p>
<p style="text-align:left;">The next issue of IR Watch &#8211; The Newsletter provides additional coverage of such an exciting news.</p>
<p style="text-align:left;">I have tied these two news in a single post to underscore the need for IR/data mining courses at the intersection of Information Security, which is precisely the mission statement of IRW, reaching now more than 300 investigators/research centers.<!--startclickprintexclude--></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/875/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/875/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/875/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/875/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/875/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/875/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/875/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/875/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/875/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/875/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=875&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/22/hackers-hit-pentagon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>McAfee Report: Email Spam and the Environment</title>
		<link>http://irthoughts.wordpress.com/2009/04/16/mcafee-report-email-spam-and-the-environment/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/16/mcafee-report-email-spam-and-the-environment/#comments</comments>
		<pubDate>Thu, 16 Apr 2009 12:57:34 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[AIRWeb Course]]></category>
		<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=873</guid>
		<description><![CDATA[According to a McAfee report,
Until now, spam&#8217;s impact has been measured in time, money, and aggravation. It turns out there is a massive environmental impact as well. McAfee recently commissioned climate-change consultant ICF International and spam expert Richi Jennings to calculate the environmental impact of spam. The results that came back were startling: The energy [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=873&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>According to a McAfee report,</p>
<blockquote><p>Until now, spam&#8217;s impact has been measured in time, money, and aggravation. It turns out there is a massive environmental impact as well. McAfee recently commissioned climate-change consultant ICF International and spam expert Richi Jennings to calculate the environmental impact of spam. The results that came back were startling: The energy consumed in transmitting and deleting spam is equivalent to the electricity used in 2.4 million U.S. homes, with greenhouse gas (GHG) emissions equivalent to 3.1 million passenger cars(<a href="http://resources.mcafee.com/content/NACarbonFootprintSpam">http://resources.mcafee.com/content/NACarbonFootprintSpam</a>)</p></blockquote>
<p>I first learned about these findings through ABC. Essentially,</p>
<blockquote><p>Anything powered by electricity also emits greenshouse gases. McAfee researchers say each junk e-mail emits 0.3 grams of the greenhouse gas carbon dioxide (CO2). That may not sound like much, but when you consider the volume of global annual spam, it all adds up. (<a href="http://abcnews.go.com/Technology/GlobalWarming/story?id=7343518&amp;page=1">http://abcnews.go.com/Technology/GlobalWarming/story?id=7343518&amp;page=1</a>).</p></blockquote>
<p>Following that reasoning, spamdexing search engines and any adversarial information retrieval (AIR) practice is also an insult to injury, so as too many things that comes to my mind.</p>
<p>I will tell that to students of my Fall 2009 AIRWeb Course.</p>
<p>Humm, shocking: AIR vs. Environment.</p>
<p>I never thought about such an obvious connection.  <img src='http://s.wordpress.com/wp-includes/images/smilies/face-smile.png' alt=':)' class='wp-smiley' /> </p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/873/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/873/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/873/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/873/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/873/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/873/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/873/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/873/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/873/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/873/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=873&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/16/mcafee-report-email-spam-and-the-environment/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Why IDF is Expressed Using Logs</title>
		<link>http://irthoughts.wordpress.com/2009/04/15/why-idf-is-expressed-using-logs/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/15/why-idf-is-expressed-using-logs/#comments</comments>
		<pubDate>Wed, 15 Apr 2009 16:04:42 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[SEO Myths]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=865</guid>
		<description><![CDATA[Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.
I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.
Questions 1 and 3 are related so I will answer both now. After that, I will answer question 2.
1) Why [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=865&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:left;">Recently a known SEO (name reserved) inquired me about some aspects of IDF (Inverse Document Frequency). Below are three of his questions.</p>
<p style="text-align:left;">I am partially reproducing/editing my responses, so it might help other SEOs with similar questions.</p>
<blockquote><p>Questions 1 and 3 are related so I will answer both now. After that, I will answer question 2.</p>
<p>1) Why is a log function used for calculating IDF?<br />
3) Would it be accurate to describe IDF as &#8220;the ratio of documents in a collection to documents in that collection with a given term&#8221;? I&#8217;m guessing your answer would be, IDF is the [LOG of " the ratio of documents in a collection to documents in that collection with a given term"]? Which brings us back to question, I guess? hehe</p>
<p>These are recurrent questions students asked me before. The reason for using logs is due to two assumptions frequently made in most IR models; i.e.</p>
<p>I. that scoring functions are additive.<br />
II. that terms are independent.</p>
<p>While in some models II might not be present, both (I and II) play well with logs since these also are additive.</p>
<p>These functions and why the use of logs is explained in the recent RSJ-PM Tutorial <a href="http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf">http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf</a></p>
<p>Document Frequency (DF) is defined as d/D, where d is number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).</p>
<p>But since often D &gt; d the log of d/D, that is log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared. Now log(D/d) is conveniently called Inverse Document Frequency.</p>
<p>Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B</p>
<p>p(AB) = p(A)p(B).</p>
<p>Taking logs we can write</p>
<p>log[p(AB)] = log[p(A)]+ log[p(B)]</p>
<p>It is easy to show that for two terms</p>
<p>log(d12/D) = log(d1/D) + log(d2/D)</p>
<p>Inverting and using the definition of IDF we end up with</p>
<p>IDF12 = IDF1 + IDF2</p>
<p>validating assumption I; that IDF as a scoring function is additive.</p>
<p>That is the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,</p>
<p>p(AB) &gt; p(A) + p(B)</p>
<p>or</p>
<p>p(AB) &lt; p(A) + p(B)</p>
<p>and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think in order to promote some dumb keyword research tools is plain snakeoil.</p>
<p>2) What do you mean by &#8216;discriminatory power&#8217; in the phrase &#8220;IDF is a measure of the discriminatory power of a term in a<br />
collection.&#8221;</p>
<p>This is legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents. If a a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.</p></blockquote>
<p style="text-align:left;">The RSJ-PM Tutorial mentioned above was written to kill for good some misconceptions regarding IDF. In it we explain why IDF is considered by Robertson and Jones a particular RSJ weight in the absence of relevance information.</p>
<p style="text-align:left;">In a nutshell, IDF is a collection wide estimate and as such the information on whether documents containing the terms being queried are relevant to these is unknown. Similarly, the information on whether documents not containing the query terms are relevant or not is unknown and often remains unscrambled when we just look at the d/D and d/(D &#8211; d) collection-wide ratios. All we can say is that relevant documents might have a higher probability of containing query terms in comparison with other documents from the collection as a whole. But we could make such assertion without resourcing to IDF as well.</p>
<p style="text-align:left;">In the case of Web documents, often these are about multiple topics. Many documents aggregate content from dissimilar sources (news headlines, rss, blogs, etc) and said document content might change in time. The mere mention of a term (regardless of repetition) is not a proof of its relevancy or of its importance with respect to the topics discussed in a document.</p>
<p style="text-align:left;">Thus, the idea that we can assess if terms are relevant to a document by simply comparing IDF values is missing the whole point and defeats the purpose for which the RSJ-PM model and many of its variants (e.g., BM25) were developed.</p>
<p style="text-align:left;">I hope this helps to clear up some SEO misconceptions on the topic.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/865/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/865/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/865/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/865/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/865/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/865/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/865/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/865/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/865/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/865/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=865&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/15/why-idf-is-expressed-using-logs/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Finally SEOs are getting the LSI Myth!</title>
		<link>http://irthoughts.wordpress.com/2009/04/09/finally-seos-are-getting-the-lsi-myth/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/09/finally-seos-are-getting-the-lsi-myth/#comments</comments>
		<pubDate>Thu, 09 Apr 2009 17:47:47 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Latent Semantic Indexing]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=854</guid>
		<description><![CDATA[If you search this blog (IRThoughts) for LSI or visit its Latent Semantic Indexing category you will find many posts wherein SEO LSI Myths are debunked. Prior to this wordpress blog I used to maintain a personal blog wherein SEO myths regarding LSI were also debunked.
Over the years, many realized they were taken by the usual [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=854&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:left;">If you search this blog (IRThoughts) for LSI or visit its Latent Semantic Indexing category you will find many posts wherein SEO LSI Myths are debunked. Prior to this wordpress blog I used to maintain a personal blog wherein SEO myths regarding LSI were also debunked.</p>
<p style="text-align:left;">Over the years, many realized they were taken by the usual agents of misinformation, at least when it comes to &#8220;SEO LSI&#8221; and &#8220;LSI-Friendly&#8221; documents.</p>
<p style="text-align:left;">Recently, I found traffic coming from a blog discussion about a video <a href="http://www.stomperblog.com/warning-advanced-seo-technique-does-not-work/">(http://www.stomperblog.com/warning-advanced-seo-technique-does-not-work/</a>) wherein LSI in relation with Google is debunked.</p>
<p style="text-align:left;">The video also discusses one flavor of LSI; i.e. one wherein weights are tf-IDF weights. This flavor does not incorporate relevance information or entropy information, like other LSI variants.</p>
<p style="text-align:left;">The video does a good job at debunking LSI Myths. However, it has at least a factually incorrect argument in relation to how the SVD algorithm works.</p>
<p style="text-align:left;">The video gives an example implying that SVD works by reducing a large set of words to a few words, such that, for example thousand of words are reduced to, let say 300 words.  This is incorrect and certainly is not a trivial flaw.</p>
<p style="text-align:left;">SVD does not work by reducing a vocabulary, but by reducing dimensions, and there are as many dimensions as singular values. This is why is called a dimensionality-reduction and not a vocabulary-reduction algorithm.  I should stress that an LSI Space is not like a Term Space wherein each term is a dimension such that there is a 1:1 correspondence.</p>
<p style="text-align:left;">In LSI, the SVD algorithm is used to reduce the dimensions of a matrix; the number of singular values of the matrix.</p>
<p style="text-align:left;">For instance in our SVD and LSI Tutorial series at</p>
<p style="text-align:left;"><a href="http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html">http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html</a></p>
<p style="text-align:left;">we present an LSI problem example consisting of many words and few initial dimensions such that for the initial matrix</p>
<p style="text-align:left;">#words &gt;&gt; # initial dimensions</p>
<p style="text-align:left;">more specific, we used 11 words and 3 dimensions</p>
<p style="text-align:left;">After truncation, we ended up with 11 words and 2 dimensions.</p>
<p style="text-align:left;">Other than this, the video is fun to watch, but ended up as an introductory promotion for another SEO proposal.</p>
<p style="text-align:left;"> PS.</p>
<p style="text-align:left;">After reviewing several times the video, unfortunately I found the video has another incorrect argumentation.</p>
<p style="text-align:left;">When objecting to that Google might not use LSI, an argument is made in the sense that LSI has to return same results when word variants are used like plurals and tenses. This might be the case if stemming is heavily used in an LSI implementation, but the use of stemming is not a requirement for implementing LSI at all.</p>
<p style="text-align:left;">When stemming is not implemented, for sure the SVD reduction will return different results since these will be entered in the original term-doc matrix to be undergo decomposition as different tokens.</p>
<p style="text-align:left;">The video also misses what the power of LSI comes from: higher order co-occurrence connectivity path hidden (latent) in the original matrix. Whether terms have to be synonyms, related terms, or even of non-derivative forms is not a requirement for observing these hidden paths in LSI.</p>
<p style="text-align:left;">Terms no need to be related terms either to end up clustered with LSI. It is the hidden co-occurrence patterns what is behind the clustering. For example, in our SVD and LSI tutorial above, we intentionally used stopwords and zero synonyms/related terms and these ended-up in their corresponding clusters, without being necessarily semantically related. This simple example shows that in LSI the SVD algorithm produces an output based on crushing numbers, not on making sense out of meaning or intelligence, and contradicts the generalized opinion that LSI works at the level of meaning. </p>
<p style="text-align:left;">I have to conclude that while the video is intended to debunk LSI SEO myths (a noble effort), it uses incorrect arguments and hearsays lines from around the Web. Debunking hearsay with more hearsay: What a shame.</p>
<p style="text-align:left;"> </p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/854/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/854/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/854/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/854/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/854/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/854/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/854/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/854/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/854/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/854/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=854&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/09/finally-seos-are-getting-the-lsi-myth/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IRW Newsletter: Web &amp; Data Mining with RIAs</title>
		<link>http://irthoughts.wordpress.com/2009/04/08/irw-newsletter-web-data-mining-with-rias/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/08/irw-newsletter-web-data-mining-with-rias/#comments</comments>
		<pubDate>Wed, 08 Apr 2009 13:39:57 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=847</guid>
		<description><![CDATA[
The current issue of IRW should be in subscribers inbox today or tomorrow, at the latest.
In this issue of the newsletter we cover Rich Internet Applications (RIAs) and how these can be used for Web/Data Mining. A RIA is a browser-independent application that can be compiled and run from the desktop.
In this issue:
Featuring article: Web [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=847&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:center;"><img class="aligncenter" src="http://www.miislita.com/irw/rias.gif" alt="RIAs" /></p>
<p>The current issue of IRW should be in subscribers inbox today or tomorrow, at the latest.</p>
<p>In this issue of the newsletter we cover Rich Internet Applications (RIAs) and how these can be used for Web/Data Mining. A RIA is a browser-independent application that can be compiled and run from the desktop.</p>
<p>In this issue:</p>
<p>Featuring article: Web &amp; Data Mining with RIAs<br />
QA: Recommended RIAs<br />
Who is Who in IR: Bruce Croft<br />
Top CS Departments: UMass, Amherst<br />
Historical Notes: John von Neumann and Bugs<br />
Outstanding Graduate Theses<br />
Calls and Events<br />
Research Blogs<br />
and more&#8230;</p>
<p>IRW currently reaches a fine audience of university and government researchers and their labs. If you are a graduate student or IR practitioner and want to be known within this exclusive circle, submit a short article (2, 3 pages, IRW format, free from marketing and sale pitches) for its consideration</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/847/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/847/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/847/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/847/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/847/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/847/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/847/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/847/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/847/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/847/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=847&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/08/irw-newsletter-web-data-mining-with-rias/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://www.miislita.com/irw/rias.gif" medium="image">
			<media:title type="html">RIAs</media:title>
		</media:content>
	</item>
		<item>
		<title>Vector Space, Probabilistic LSI, and LDA</title>
		<link>http://irthoughts.wordpress.com/2009/04/03/vector-space-probabilistic-lsi-and-lda/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/03/vector-space-probabilistic-lsi-and-lda/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 13:17:07 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Latent Semantic Indexing]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=839</guid>
		<description><![CDATA[ 
source: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes.
From VSM to LSI
Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=839&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:center;"> <img src="http://www.miislita.com/blog/images/lda.gif" alt="lda" /><br />
source: <a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf</a></p>
<p>There is a kind of buzz about <a href="http://sphinn.com/story/82765">Probabilistic Latent Semantics Indexing</a>, so this post goes.</p>
<p><strong>From VSM to LSI</strong></p>
<p>Prior to 1988 the prevalent IR model was Salton’s Vector Space Model (VSM). This model treats documents and queries as vectors in a multidimensional space. In this space a query is treated just as another document. In this term space, it is not possible to assign a position to terms simply because these are the dimensions of the space. Coordinate  values assigned to document and query vectors are given by terms weights computed using a particular weighting scheme.</p>
<p>VSM and its many variants are based on matching query terms to terms found in documents. These models assume term independence. However, we know this assumption is not necessarily correct since terms can be dependent via (a) synonymity and (b) polysemy.</p>
<p>In 1988, Dumais and co-workers at Bellcore (now Telcordia) published two papers in which they applied Golub and Kahan’s 1965 SVD algorithm to “documents” exhibiting (a) and (b) and called that Latent Semantic Indexing (LSI).</p>
<p>LSI became an improvement over the simplistic point of view of term matching, accounting for term dependencies. The “documents” were not HTML Web documents (there were no Web documents back then), but just abstracts and memos from specific knowledge domains (HCI, scientific, med). As expected these consisted of synonyms and related terms used in these domains. Thus, clusters of these were obtained.</p>
<p>It was immediately claimed that LSI could be used to model aspects of basic linguistic -like synonymy and polysemy- and how the human mind associates words to concepts and concepts to meaning.</p>
<p>Moving twenty years forward, SEOs misread such outdated research and the synonym-stuffing myth was born.</p>
<p>There is now a crew of SEOs claiming that they can design documents &#8220;LSI-friendly&#8221; by making these rich in synonyms and related terms. We have demonstrated via our SVD and LSI tutorial series why this is not possible. These marketers are simply inventing out of thin air LSI Myths in order to market better whatever they sell or promote (often their own image as &#8220;experts&#8221;). Same goes for those that claim &#8220;PLSI-SEO&#8221; strategies.</p>
<p>Research findings suggest that what makes LSI works is first and higher-order co-occurrence paths hidden in the term-term LSI matrix. These paths are responsible for how and why of the redistribution of term weights in a truncated term-document matrix. Altering terms (even a single term) of this matrix provokes a redistribution of term weights across the entire matrix, whose outcome cannot be predicted. This is why “LSI-friendly” documents is plain SEO Snakeoil. Again, the same goes for those that claim &#8220;PLSI-SEO&#8221; strategies. Keep reading.</p>
<p><strong>Enters </strong><a href="http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf" target="blank"><strong>Probabilistic Latent Semantic Indexing (PLSI) model</strong></a></p>
<p>In 1998 LSI was put into question. Given a generative model of text: why adopt LSI when one could use Bayesian or maximum likelihood methods and fit the model to data?</p>
<p>In 1999, Thomas Hofmann presented the Probabilistic Latent Semantic Indexing (PLSI) model, also known as the Aspect Model, as an alternative to LSI. PLSI (or PLSA) models each word in a document as a sample from a mixture model. The mixture components are multinomial random variables viewed as representations of topics.</p>
<p>Each word is generated from a single topic, and different words in a document can be generated from different topics. In this model each document is represented as a list of mixing proportions for these mixture components. Thus, documents are reduced to a probability distribution over a set of topics, which is the expected &#8220;reduced description&#8221; associated with the document.</p>
<p>But there is a problem.</p>
<p><strong>Enters </strong><a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf" target="blank"><strong>Latent Dirichlet Allocation Model (LDA)</strong></a></p>
<p>By 2003 Hofman’s PLSI model was put into question, this time by David Blei, Andrew Ng and Michael Jordan, who proposed that year the Latent Dirichlet Allocation Model (LDA). As noted by Blei, et al. (and quote) PLSI &#8220;is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. &#8220;</p>
<p>Blei and co-workers then stated that this leads to two problems:</p>
<p>1. the number of parameter in the model grows linearly with the size of the corpus, which leads to serious problems with over fitting</p>
<p>2. it is not clear how to assign probability to a document outside of the training set.</p>
<p>Thus, it is not true that PLSI is the preferred model to work with in IR, as some have claimed. In addition, the model has non-trivial theoretical flaws and limitations.</p>
<p>In Salton Term Vector Model as in the LSI and PLSI models word order does not matter. Documents are simply considered a &#8220;bag of words&#8221;. However, common sense dictates that this is not a valid assumption since word semantics is sensitive to word ordering. This explains why searches in Google for <em>college junior</em> or <em>junior college</em> produce far different results.</p>
<p>To underscore the importance of word ordering consider this: applying a similarity measure like a Jaccard Coefficient computed from a term-term matrix to the above two queries produces identical results, but again the computed similarity scores are disconnected from word semantics.</p>
<p>Blei and co-workers have argued that if we want to consider exchangeable representations (ordering) for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents. This is why they proposed their LDA model.</p>
<p>In LDA documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words.</p>
<p>I believe we are moving toward a Unified IR Theory where Co-Occurrence, Probability and Geometry will converge. In this unified framework there is no room for the idea of term independence or of documents as mere &#8220;bags of words&#8221;. The former is IR&#8217;s Original Sin and the later is its copycat.</p>
<p>The image above gives me a flash back on research work I conducted in the late &#8217;80s on sequential simplex optimization methods.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/839/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=839&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/03/vector-space-probabilistic-lsi-and-lda/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://www.miislita.com/blog/images/lda.gif" medium="image">
			<media:title type="html">lda</media:title>
		</media:content>
	</item>
		<item>
		<title>AIRWeb Course Announcement</title>
		<link>http://irthoughts.wordpress.com/2009/04/02/airweb-course-announcement/</link>
		<comments>http://irthoughts.wordpress.com/2009/04/02/airweb-course-announcement/#comments</comments>
		<pubDate>Thu, 02 Apr 2009 05:00:00 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[AIRWeb Course]]></category>
		<category><![CDATA[Graduate Courses]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=814</guid>
		<description><![CDATA[During the Fall of 2009, I will be teaching 
 Adversarial Information Retrieval on the Web:  A Graduate Course on Web Spam and Internet Vulnerabilities
This a new one-full semester graduate course to be offered at Polytechnic University Puerto Rico. It is based on the material presented at the annual AIRWeb Workshops. KDDM graduate students are encouraged to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=814&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>During the Fall of 2009, I will be teaching </p>
<p> <em>Adversarial Information Retrieval on the Web:  A Graduate Course on Web Spam and Internet Vulnerabilities</em></p>
<p>This a new one-full semester graduate course to be offered at Polytechnic University Puerto Rico. It is based on the material presented at the annual AIRWeb Workshops. KDDM graduate students are encouraged to enroll. An early announcement and preliminary syllabus is available at</p>
<p><a href="http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf">http://www.miislita.com/courses/airweb-web-spam-syllabus.pdf</a></p>
<p>BTW, In November 5 of 2008 PUPR became the First Academic Institution in the Caribbean to be Certified by the Committee on National Security Systems (CNSS). Additional information is available at <a href="http://www.pupr.edu/ias.html">http://www.pupr.edu/ias.html</a></p>
<p>Their goal is to become a Center of Academic Excellence in Information Assurance Education (CAE/IAE). These are great news. Nationwide, how many universities you know that are in such an exclusive &#8221;club&#8221;?</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/814/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/814/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/814/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/814/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/814/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/814/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/814/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/814/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/814/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/814/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=814&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/04/02/airweb-course-announcement/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>RSJ-PM: Probabilistic Model Tutorial</title>
		<link>http://irthoughts.wordpress.com/2009/03/30/rsj-pm-probabilistic-model-tutorial/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/30/rsj-pm-probabilistic-model-tutorial/#comments</comments>
		<pubDate>Mon, 30 Mar 2009 20:35:44 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=811</guid>
		<description><![CDATA[As promised, I am pleased to announce the publication of the Robertson-Sparck Jones Probabilistic Model Tutorial.
It is available in Mi Islita.com in the Tutorials Section. A link is provided in the index page.
The tutorial guides you through the intricasies of RSJ-PM. It is a great start for CS students and teachers interested in probabilistic models in information [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=811&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As promised, I am pleased to announce the publication of the <a href="http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf">Robertson-Sparck Jones Probabilistic Model Tutorial</a>.</p>
<p>It is available in <a href="http://www.miislita.com/">Mi Islita.com</a> in the Tutorials Section. A link is provided in the index page.</p>
<p>The tutorial guides you through the intricasies of RSJ-PM. It is a great start for CS students and teachers interested in probabilistic models in information retrieval.</p>
<p>Enjoy it.</p>
<p>Due to the time spent on it, the April issue of the IR Watch newsletter will be a bit delayed.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/811/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/811/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/811/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/811/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/811/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/811/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/811/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/811/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/811/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/811/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=811&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/30/rsj-pm-probabilistic-model-tutorial/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>W3C 2009 Conference</title>
		<link>http://irthoughts.wordpress.com/2009/03/26/w3c-2009-conference/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/26/w3c-2009-conference/#comments</comments>
		<pubDate>Thu, 26 Mar 2009 15:09:17 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Conferences]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=808</guid>
		<description><![CDATA[Here is the final list conforming the 18 International Conference of the W3C, WWW2009, of which AIRWeb2009 is a workshop.
http://www.webshine.org/2009reg.html
A lot of good stuff to please IRs, CS students, spammers/SEOs, and hackers.
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=808&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Here is the final list conforming the 18 International Conference of the W3C, WWW2009, of which AIRWeb2009 is a workshop.</p>
<p><a href="http://www.webshine.org/2009reg.html">http://www.webshine.org/2009reg.html</a></p>
<p>A lot of good stuff to please IRs, CS students, spammers/SEOs, and hackers.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/808/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/808/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/808/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/808/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/808/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/808/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/808/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/808/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/808/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/808/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=808&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/26/w3c-2009-conference/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>SEOs and Their IDF Myths: Part 3</title>
		<link>http://irthoughts.wordpress.com/2009/03/20/seos-and-their-idf-myths-part-3/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/20/seos-and-their-idf-myths-part-3/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 15:45:34 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[SEO Myths]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=798</guid>
		<description><![CDATA[In SEOs and their IDF Myths, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).
In SEOs and their IDF Myths: Part 2, we exposed some of these folks.
In Understanding TFIDF, we wrote a rebuttal.
We are still seeing so many bloggers mistaking IDF for something that is not. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=798&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>In <a href="http://irthoughts.wordpress.com/2008/06/17/seos-and-their-idf-myths/">SEOs and their IDF Myths</a>, we covered how many are mistaking the measure of term specificity known as Inverse Document Frequency (IDF).</p>
<p>In <a href="http://irthoughts.wordpress.com/2008/07/03/seos-and-their-idf-myths-part-2/">SEOs and their IDF Myths: Part 2</a>, we exposed some of these folks.</p>
<p>In <a href="http://irthoughts.wordpress.com/2008/07/07/understanding-tfidf/">Understanding TFIDF</a>, we wrote a rebuttal.</p>
<p>We are still seeing so many bloggers mistaking IDF for something that is not. We have to conclude these pseudo-teachers either are just trying to sell something or they don&#8217;t really understand what term specificity stands for. They should know that IDF is a small pixel section within the bigger picture of the Robertson-Sparck Jones Probabilistic Model for information retrieval.</p>
<p>Thus, we are writing a tutorial on RSJ-PM to kill for good their intentionally misleading efforts. Hopefully, the tutorial will be ready before the month ends. It will be a great way of putting to rest all the false information flying around from the usual agents of misinformation (mostly SEOs). CS students interested in knowing about the pros and cons of probability models in IR will find it useful.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/798/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/798/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/798/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/798/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/798/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/798/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/798/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/798/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/798/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/798/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=798&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/20/seos-and-their-idf-myths-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>A CBR Sharing Search Engine System</title>
		<link>http://irthoughts.wordpress.com/2009/03/17/a-cbr-sharing-search-engine-system/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/17/a-cbr-sharing-search-engine-system/#comments</comments>
		<pubDate>Tue, 17 Mar 2009 14:34:54 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=793</guid>
		<description><![CDATA[I&#8217;m reading with great interest the paper 
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,
I&#8217;m happy to read they referenced [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=793&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I&#8217;m reading with great interest the paper <a href="http://www.mrtc.mdh.se/publications/1269.pdf"><br />
Efficient Condition Monitoring and Diagnosis Using a Case-Based Experience Sharing System</a>, by Mobyen Uddin Ahmed, Erik Olsson, Peter Funk, Ning Xiong, and presented at the 20th International Congress and Exhibition on Condition Monitoring and Diagnostics Engineering Management, p 305-314, COMADEM 2007, Faro, Portugal,</p>
<p>I&#8217;m happy to read they referenced our <a href="http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html">Tutorial on Cosine Similarity Measures</a>. Their CBR-based search system combines a tf*IDF term vector scoring scheme and ontologies.</p>
<p>Their abstract follows:</p>
<blockquote><p>ABSTRACT<br />
In a dynamic industrial environment changes occur more and more rapidly, new machines, new staff when scaling up production and reduced staff when scaling down during a recession, staff with varying experience etc. This puts a high focus on experience reuse and sharing; much experience is lost during down-scaling and tied up in knowledge transfer/teaching during up-scaling. This is recognised as very costly for industry and reduces productivity and competitiveness. Condition Monitoring and diagnostics is such an area where lack on knowledge and mistakes can have severe consequences for a company’s long term existence. Maintenance staffs, technicians and engineers also gain much experience during their every day work, often during many years, but there are rarely any good processes for experience sharing and reuse inside the organisations. In this paper we present an experience sharing system based on case-based reasoning and limited natural language processing. The system is a tool for maintenance staff and engineers and enables efficient experience collection, reuse and sharing. The implemented prototype is web-based to promote access from any location and may be local or global enabling experience sharing openly or in clusters of collaborating companies. Case based reasoning has proven to be an efficient method to identify and reuse experience if the application domain has cases. Our target application domain has these features and there are plenty of cases valuable to reuse. We have validated this in close collaboration with maintenance engineers through field studies. The prototype developed shows promising features and will be tested in real industrial environments during 2007 and 2008.</p></blockquote>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/793/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/793/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/793/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/793/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/793/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/793/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/793/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/793/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/793/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/793/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=793&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/17/a-cbr-sharing-search-engine-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Centering Data in PCA</title>
		<link>http://irthoughts.wordpress.com/2009/03/13/centering-data-in-pca/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/13/centering-data-in-pca/#comments</comments>
		<pubDate>Fri, 13 Mar 2009 14:11:36 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=788</guid>
		<description><![CDATA[I received yesterday the following email from a reader (name removed). I am reproducing it since the discussion might be of value to others with similar questions.
Hello Dr. Garcia,
I&#8217;ve seen your tutorial &#8220;PCA and SPCA Tutorial&#8221; while I was trying to
find out something about PCA. So I decided to ask this to you. I&#8217;ll be
happy [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=788&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I received yesterday the following email from a reader (name removed). I am reproducing it since the discussion might be of value to others with similar questions.</p>
<blockquote><p>Hello Dr. Garcia,</p>
<p>I&#8217;ve seen your tutorial &#8220;PCA and SPCA Tutorial&#8221; while I was trying to<br />
find out something about PCA. So I decided to ask this to you. I&#8217;ll be<br />
happy if you answer.</p>
<p>My feature matrix contains 30 feature vectors and I want to reduce<br />
this dimension using 95% of variance explained. Ranges of some feature vectors have great differences. For example, one feature vector&#8217;s range is about [0,10], while some others is [-10^10,10^10]. So when I directly subtract the mean and calculate covariance matrix, one of the eigenvalues suppresses the others.</p>
<p>Is it a proper way to scale data (z=(x-mean)/std_dev) firstly and then subtract mean of the scaled version and calculate covariance matrix?</p>
<p>When I try this procedure eigenvalues seem to be correct but I cannot be sure if this is a correct way or not.</p>
<p>Do you think that this is correct? If not, what is the correct way<br />
using covariance matrix rather than correlation matrix?</p>
<p>Thanks in advance.</p>
<p>******</p></blockquote>
<p>My answer follows.</p>
<p>The purpose of centering data (transforming data to z-scores) is to remove undesirable fluctuations. This is particular useful when there is a common source of error; e.g. as in a time series. Assuming this is your case, then you are doing the right thing.</p>
<p>An advantage of data centering is that it is part of the PCA solution of minimizing the sum of squared errors (SSE). Overall, the goal is to find the best affine linear subspace.</p>
<p>Centering the data has other advantages. It allows us to make cosine angles equal to Pearson&#8217;s Correlation Coeffficients so that similarity information can be explored. Also, once in a z-score form, the data can be checked to see whether it follows a normal distribution.</p>
<p>For additional information, check these links:</p>
<p><a href="http://irthoughts.wordpress.com/2007/05/05/on-svd-and-pca-some-applications/">http://irthoughts.wordpress.com/2007/05/05/on-svd-and-pca-some-applications/</a></p>
<p><a href="http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/">http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/</a></p>
<p>I hope this helps.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/788/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/788/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/788/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/788/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/788/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/788/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/788/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/788/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/788/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/788/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=788&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/13/centering-data-in-pca/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IDF and Vector Space Models</title>
		<link>http://irthoughts.wordpress.com/2009/03/11/idf-and-vector-space-models/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/11/idf-and-vector-space-models/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 16:18:57 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=780</guid>
		<description><![CDATA[I&#8217;m back from SIDIM XXIV. It was a great conference in honor of Professor Oscar Moreno, from the Gauss Research Laboratory and NIC.PR (responsible for the .pr Internet domains.). 
Dr. Moreno is a legend in the area of pure and applied mathematics. I have the privilege of meeting with him. 
The conference plenary speakers were [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=780&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I&#8217;m back from SIDIM XXIV. It was a great conference in honor of Professor Oscar Moreno, from the Gauss Research Laboratory and NIC.PR (responsible for the .pr Internet domains.). </p>
<p>Dr. Moreno is a legend in the area of pure and applied mathematics. I have the privilege of meeting with him. </p>
<p>The conference plenary speakers were equally three legends:</p>
<p>Elwyn Berlekamp, University of California at Berkeley<br />
Solomon W. Golomb, University of Southern California<br />
Guang Gong, University of Waterloo</p>
<p>The event was a success, although some speakers read straight from their notes. As an interdisciplinary conference on pure and applied mathematics, all kind of topics were covered.</p>
<p>I got the chance to present research work on a new global weight algorithm we are testing called scaled inverse document frequency (SIDF), a variant of the well-known IDF scheme.</p>
<p>For those unfamiliar with IDF and its implementation with ranking algorithms, <a href="http://www.cs.iitm.ernet.in/khemani/">Dr. Deepak Khemani</a> from the <a href="http://aidb.cs.iitm.ernet.in/">Artificial Intelligence &amp; Database Research Group</a> at Indian Institute of Technology Madras has published a very useful tutorial presentation on <a href="http://aidb.cs.iitm.ernet.in/cs625/10.VectorSpace-model.pdf">Vector Space Models</a>. </p>
<p>The tutorial is based on our series of articles on the subject and provides a better understanding of the theory. We could not have done it better.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/780/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/780/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/780/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/780/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/780/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/780/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/780/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/780/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/780/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/780/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=780&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/11/idf-and-vector-space-models/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>SIDIM XXIV Conference</title>
		<link>http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/#comments</comments>
		<pubDate>Thu, 05 Mar 2009 05:00:00 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Homeland Security]]></category>
		<category><![CDATA[Queries]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=768</guid>
		<description><![CDATA[I am presenting at The Seminario Interuniversitario de Investigación en Ciencias Matemáticas (Interuniversity Seminar on Mathematical Sciences Research, SIDIM).
This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (http://sidim2009.uprr.pr/)
This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=768&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I am presenting at<em> </em><strong><em>The</em> <em>Seminario Interuniversitario de Investigación en Ciencias Matemáticas </em>(<em>Interuniversity Seminar on Mathematical Sciences Research</em>, SIDIM).</strong></p>
<p>This is one of the most important activities held in Puerto Rico for the promotion of Mathematics research. (<a href="http://sidim2009.uprr.pr/">http://sidim2009.uprr.pr/</a>)</p>
<p>This year SIDIM will be held at University of Puerto Rico, Rio Piedras in March 6-7, 2009. The SIDIM program and book of abstracts  is available at <a href="http://sidim.uprh.edu/libroSIDIM2009.pdf">http://sidim.uprh.edu/libroSIDIM2009.pdf</a></p>
<p>I will be presenting new research work on IDF and a new model for the conditional specificity of terms. If you have followed previous posts on the topic of inverse document frequency, now you will understand why I have dissected the topic several times. Thank you all for your private comments and feedback on the topic.</p>
<p>My abstract follows:</p>
<p><strong>Scaled Inverse Document Frequency: A Model for the Evaluation of the Conditional Specificity of Query Terms in Search Engine Collections</strong></p>
<p>Edel Garcia, Internet Business Development Center, Interamerican University of Puerto Rico, Metropolitan Campus</p>
<p>Inverse document frequency (IDF) is a measure of the specificity of query terms over a collection of D number of documents that has been successfully incorporated into numerous vector space information retrieval models. Since these models assume term independence, the specificity of a given term, present in different queries, is assumed to be unique and independent from other query terms. To the best of our knowledge, there are no known models that condition the specificity of terms to the presence of other terms in a query.</p>
<p>This paper proposes a new measure called scaled inverse document frequency (SIDF) which evaluates the conditional specificity of query terms over a subset S of D and without making any assumption about term independence. S can be estimated from search results, OR searches, or computed from inverted index data. We have evaluated SIDF values from commercial search engines by submitting queries relevant to the financial investment domain. Results compare favorably across search engines and queries. Our approach has practical applications for `real-world&#8217; scenarios like in Web Mining, Homeland Security, and keyword-driven marketing research scenarios. SIDF can be incorporated into a variety of information retrieval models as a global weight scoring system.</p>
<p><strong>Keywords:</strong> inverse document frequency, conditional term specificity, web mining, search engines</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/768/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/768/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/768/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/768/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/768/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/768/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/768/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/768/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/768/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/768/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=768&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Vector Normalization with Excel</title>
		<link>http://irthoughts.wordpress.com/2009/03/04/vector-normalization-with-excel/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/04/vector-normalization-with-excel/#comments</comments>
		<pubDate>Wed, 04 Mar 2009 14:53:50 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[IR Tutorials]]></category>
		<category><![CDATA[Newsletters]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=764</guid>
		<description><![CDATA[Unit vectors are frequently used in information retrieval and data mining studies because simplify further calculations and analyses.
In the current issue of IR Watch, we show how easy is to convert column vectors into unit vectors with Excel. It is assumed you know how to define spreadsheet arrays in Excel and how to enter formulas in it.
Say we have [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=764&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Unit vectors are frequently used in information retrieval and data mining studies because simplify further calculations and analyses.</p>
<p>In the current issue of IR Watch, we show how easy is to convert column vectors into unit vectors with Excel. It is assumed you know how to define spreadsheet arrays in Excel and how to enter formulas in it.</p>
<p>Say we have two vectors in columns A and B each with four elements. To convert these into unit vectors, do this:</p>
<p>1. In cell C1, enter the formula =A1/(SQRT(SUMSQ(A$1:A$4)))</p>
<p>2. Paste content of C1 into cell D1. This creates a modified instance of this formula.</p>
<p>3. Paste content of C1 and  D1 cells into remaining empty cells of these columns by selecting these at once. This also creates modified instances of these formulas.</p>
<p>C and D columns represent the unit vectors.</p>
<p>A figure with a step-by-step example is given in IRW (free subscription)</p>
<p>Below is another example, but with the final results.</p>
<table style="width:115pt;border-collapse:collapse;" border="0" cellspacing="0" cellpadding="0" width="152">
<col style="width:23pt;" span="1" width="30"></col>
<col style="width:24pt;" span="1" width="32"></col>
<col style="width:34pt;" span="2" width="45"></col>
<tbody>
<tr style="height:15pt;">
<td class="xl65" style="width:23pt;height:15pt;background-color:transparent;border:#f0f0f0;" width="30" height="20"><span style="font-size:small;font-family:Calibri;"><strong>A</strong></span></td>
<td class="xl65" style="width:24pt;background-color:transparent;border:#f0f0f0;" width="32"><span style="font-size:small;font-family:Calibri;"><strong>B</strong></span></td>
<td class="xl65" style="width:34pt;background-color:transparent;border:#f0f0f0;" width="45"><span style="font-size:small;font-family:Calibri;"><strong>C</strong></span></td>
<td class="xl65" style="width:34pt;background-color:transparent;border:#f0f0f0;" width="45"><strong><span style="font-size:small;font-family:Calibri;">D</span></strong></td>
</tr>
<tr style="height:15pt;">
<td class="xl66" style="height:15pt;background-color:transparent;border:#f0f0f0;" height="20"><span style="font-size:small;font-family:Calibri;">1</span></td>
<td class="xl66" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">8</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.13</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.36</span></td>
</tr>
<tr style="height:15pt;">
<td class="xl66" style="height:15pt;background-color:transparent;border:#f0f0f0;" height="20"><span style="font-size:small;font-family:Calibri;">2</span></td>
<td class="xl66" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">10</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.26</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.45</span></td>
</tr>
<tr style="height:15pt;">
<td class="xl66" style="height:15pt;background-color:transparent;border:#f0f0f0;" height="20"><span style="font-size:small;font-family:Calibri;">4</span></td>
<td class="xl66" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">12</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.53</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.53</span></td>
</tr>
<tr style="height:15pt;">
<td class="xl66" style="height:15pt;background-color:transparent;border:#f0f0f0;" height="20"><span style="font-size:small;font-family:Calibri;">6</span></td>
<td class="xl66" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">14</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.79</span></td>
<td class="xl67" style="background-color:transparent;border:#f0f0f0;"><span style="font-size:small;font-family:Calibri;">0.62</span></td>
</tr>
</tbody>
</table>
<p>That was easy!</p>
<p>If you use the first row to label columns, as in this example, be sure to readjust the formulas so these start at cell 2 and run up to cell 5.</p>
<p>If you still have questions on how to do this, email me or subscribe to IRW.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/764/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/764/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/764/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/764/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/764/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/764/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/764/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/764/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/764/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/764/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=764&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/04/vector-normalization-with-excel/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IRW-March-2009:Data Mining Dates</title>
		<link>http://irthoughts.wordpress.com/2009/03/02/irw-march-2009-data-mining-dates/</link>
		<comments>http://irthoughts.wordpress.com/2009/03/02/irw-march-2009-data-mining-dates/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 13:55:00 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=753</guid>
		<description><![CDATA[
The current issue of the IRW newsletter is available now.
In this issue:
Featuring article: Data Mining Dates
QA: Excel Vector Normalization
Who is Who in IR: Stephen Robertson
Top CS Departments: School of Informatics, City University, London
Historical Notes: Mark and Colossus Computers
Outstanding Graduate Theses
Calls and Events
Research Blogs
and more&#8230;
The abstract of the featuring article is given below.
In this issue of [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=753&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:center;"><img class="aligncenter" src="http://www.miislita.com/irw/data-mining-dates.gif" alt="data mining dates" /></p>
<p>The current issue of the IRW newsletter is available now.</p>
<p>In this issue:</p>
<p>Featuring article: Data Mining Dates</p>
<p>QA: Excel Vector Normalization</p>
<p>Who is Who in IR: Stephen Robertson</p>
<p>Top CS Departments: School of Informatics, City University, London</p>
<p>Historical Notes: Mark and Colossus Computers</p>
<p>Outstanding Graduate Theses</p>
<p>Calls and Events</p>
<p>Research Blogs</p>
<p>and more&#8230;</p>
<p>The abstract of the featuring article is given below.</p>
<p>In this issue of the newsletter we examine the extraction of intelligence from dates. At first, a discussion on dates seems an unnecessary exercise. After all, many are inclined to take dates at face-value. But a date is more than a one-liner of information extracted from a calendar, headline, or footer. In the intelligence community, for example, dates provide a great amount of information about events, people, organized crime, terrorism, money laundering, unexpected situations, accidents, plots, chains of custody, validations, etc. Indeed, a date is a unique form of metadata, not to mention that these can be either relative or absolute. They can also be part of encryption schemes.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/753/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/753/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/753/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/753/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/753/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/753/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/753/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/753/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/753/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/753/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=753&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/03/02/irw-march-2009-data-mining-dates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://www.miislita.com/irw/data-mining-dates.gif" medium="image">
			<media:title type="html">data mining dates</media:title>
		</media:content>
	</item>
		<item>
		<title>When IDF is not enough</title>
		<link>http://irthoughts.wordpress.com/2009/02/25/when-idf-is-not-enough/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/25/when-idf-is-not-enough/#comments</comments>
		<pubDate>Wed, 25 Feb 2009 14:45:23 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Queries]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=744</guid>
		<description><![CDATA[I came across an interesting Collection of Ambiguous or Inconsistent/Incomplete Statements compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:
According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=744&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I came across an interesting <a href="http://www.gray-area.org/Research/Ambig/">Collection of Ambiguous or Inconsistent/Incomplete Statements</a> compiled by Jeff Gray, which illustrates that IDF as measure of the discriminating power of a term is not enough. Gray writes:</p>
<blockquote><p>According to the Oxford English Dictionary, the 500 words used most in the English language each have an average of 23 different meanings. The word &#8220;round,&#8221; for instance, has 70 distinctly different meanings. The variance of word meanings in natural language has always posed problems for those who attempt to construct an unambiguous and consistent statement. It is often the case that a written statement could be interpreted in several ways by different individuals, thus rendering the statement subjective rather than objective. The first detailed examination of this problem with respect to the specifications of computer systems is contained in [Hill, 72]. Hill provides a plethora of examples to illustrate this common problem. Peter G. Neumann illustrated this point by constructing a sentence which contained the restrictive qualifier &#8220;only.&#8221; He then showed that by placing the word &#8220;only&#8221; in 15 different places in the sentence resulted in over 20 different interpretations [Neumann, 84]. Moreover, other words like &#8220;never,&#8221; &#8220;should,&#8221; &#8220;nothing,&#8221; and &#8220;usually&#8221; are sometimes applied in a manner in which a double meaning can be ascribed. In particular, the word &#8220;nothing&#8221; was a favorite word often used by Lewis Carroll.</p></blockquote>
<p>Under these circumstances, why should we assume that the discriminating power of terms in a collection, particularly of polysemes and ambiguous terms, is the same (unique) regardless of their meanings or neighboring query terms? *</p>
<p>This is where IDF as a term specificity measure breaksdown. This problem is intimate linked to <strong>The Original Sin of IR models: The Term Independence Assumption</strong>.</p>
<p>* I have modified a bit this assertive question to make the point more clear.</p>
<p><strong>References</strong></p>
<p><a href="http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/">http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/</a></p>
<p>Hill, I.D., &#8220;Wouldn&#8217;t it be nice if we could write computer programs in ordinary English &#8211; or would it?&#8221; The Computer Bulletin, June 1972, pp. 306-312.</p>
<p>Neumann, Peter G., &#8220;Only his Only Grammarian Can Only Say What Only He Means,&#8221; ACM SIGSOFT Software Engineering Notes, January 1984, pg. 6.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/744/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/744/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/744/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/744/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/744/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/744/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/744/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/744/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/744/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/744/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=744&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/25/when-idf-is-not-enough/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Glottochronology: Part IV</title>
		<link>http://irthoughts.wordpress.com/2009/02/20/glottochronology-part-iv/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/20/glottochronology-part-iv/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 14:05:43 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Queries]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=739</guid>
		<description><![CDATA[Please read Part I, Part II, and Part III before reading this post.
I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur&#8217;s book Discrete Dynamical Systems (Oxford, 1990).
1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=739&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Please read <a href="http://irthoughts.wordpress.com/2009/02/16/glottochronology-part-i/">Part I</a>, <a href="http://irthoughts.wordpress.com/2009/02/18/glottochronology-part-ii/">Part II</a>, and <a href="http://irthoughts.wordpress.com/2009/02/19/glottochronology-part-iii/">Part III</a> before reading this post.</p>
<p>I would like to end this series of posts on glottochronology with some exercises, taken from Sandefur&#8217;s book <em>Discrete Dynamical Systems (</em>Oxford, 1990).</p>
<p>1. Two groups of people have a common language. From a list of 250 words, the two groups have 220 in common. How long ago did these two groups split from one?</p>
<p>2. Consider the model of glottochronology. Assume a language is given today.</p>
<p>(a) How long will it take for 1/4 of the words to change?</p>
<p>(b) How long will it take for 10 per cent of the words to change?</p>
<p>3. Suppose that person <strong>A </strong>knows 60 per cent of a list of 1000 words, person <strong>B</strong> knows 70 per cent of that list, and person <strong>C</strong> knows 30 per cent of that list.</p>
<p>(a) How many words do you expect all three people know?</p>
<p>(b) What per cent of the words is known by <strong>A</strong> and <strong>B</strong> but not by <strong>C</strong>?</p>
<p>Hints:</p>
<p>Problems 1 and 2 are solved with the equations provided in the previous posts. Problem 3 is solved by applying the multiplication principle  to <strong>A</strong>,<strong> B</strong>, and <strong>C</strong>.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/739/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/739/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/739/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=739&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/20/glottochronology-part-iv/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Glottochronology: Part III</title>
		<link>http://irthoughts.wordpress.com/2009/02/19/glottochronology-part-iii/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/19/glottochronology-part-iii/#comments</comments>
		<pubDate>Thu, 19 Feb 2009 14:35:51 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Queries]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=731</guid>
		<description><![CDATA[Please read Part I and Part II before proceeding with this post.
Applications to cultures
Sandefur provides the following example:
Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=731&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Please read <a href="http://irthoughts.wordpress.com/2009/02/16/glottochronology-part-i/">Part I</a> and <a href="http://irthoughts.wordpress.com/2009/02/18/glottochronology-part-ii/">Part II</a> before proceeding with this post.</p>
<p><strong>Applications to cultures</strong></p>
<p>Sandefur provides the following example:</p>
<blockquote><p>Suppose at time 0, a group of people separate themselves from their culture. A group of American Indians leaves the tribe and forms its own tribe, or a group sails to a deserted island and starts its own culture. We then have two cultures, <strong>A</strong> and <strong>B</strong>. At time 0, they have the same language, so that for a given list of <strong>L</strong> words, <strong>A(0) = B(0) = 1</strong> is the per cent of the words they both know (and have in common).</p>
<p>If we contact each of these cultures<strong> k</strong> years later, culture <strong>A</strong> will know <strong>A(k) = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)</strong> per cent of the original list. Thus the per cent of the words that both cultures know is, by the multipliation principle,</p>
<p><strong>Q = (0.805)^(0.001 k)*(0.805)^(0.001 k) = (0.805)^(0.002k)</strong></p>
<p>What the glottochronologist does now is to construct a list of words. From that list of words, the two cultures are studied and it is determined what per cent <strong>Q </strong>of this list of words is known by both cultures. Thus in the equation for <strong>Q </strong>above, <strong>Q</strong> is known, but <strong>k</strong>, the number of years since the two cultures separated, is unknown. Solving for k gives</p>
<p><strong>k = 500lnQ/ln 0.805</strong></p></blockquote>
<p>To understand the significance of this ratio we need to look at some examples.</p>
<p><strong>Examples</strong></p>
<p>Sandefur provides several examples.</p>
<blockquote><p>Suppose that the natives of two islands have similar language. From a list of 300 words, 180 words are understood by both groups, that is <strong>Q = 180/300 = 0.6.</strong> Then</p>
<p><strong>k = 500ln 0.6/ln 0.805 = 1177.5</strong></p>
<p>We then conclude that the natives of these two islands came from a common ancestry, approximately <strong>1200</strong> years ago.</p>
<p>Suppose a collection of tribes with a similar language is considered. First, group the tribes into geographical regions. Then date the time separation <strong>n</strong> for pairs of tribes in each geographical region. It can be argued that the region with the pair of tribes with the largest time separation is the homeland of the tribes. The reason for this conclusion is as follows. Suppose one tribe separates into three tribes. One tribe might move away while the other two remain in the same general region. The tribe that moved away may split again in its geographical location, but the largest time separation will always be the two that remained in the original area.</p></blockquote>
<p><strong>Drawbacks and Pitfalls of Glottochronology</strong></p>
<p>The model presumes independence assumptions (see discussion on multiplication principle); that is, event cooccurance by chance.  But we know that</p>
<p>If p<strong>(A1A2) = p(A1)p(A2)</strong> event cooccurance is by chance.<br />
If <strong>p(A1A2) &gt; p(A1)p(A2)</strong> event cooccurance is more than by chance.<br />
If <strong>p(A1A2) &lt; p(A1)p(A2)</strong> event cooccurance is less than by chance.</p>
<p>One way terms deviate from independence is through their semantics (meaning). If the meaning of words change in time, how do we know if all words from a word list change by the same amount?</p>
<p>As noted by Sandefur</p>
<blockquote><p>&#8230;how do you determine if a word is the same for two culture? If the spelling of a word or the pronunciation of a word changes &#8217;slightly&#8217;, we will still count it as being on the list. If the meaning of a word changes &#8217;sigificantly&#8217;, we will delete it from the list. Thus, there is some subjectivity in determining <strong>Q</strong> which could drastically change the results. Also some words are more likely to change than the others. But in the multiplication principle, we tacitly <strong>assumed </strong>that all words were equally likely to change. This can throw the results off.</p>
<p>The moral of this is that you need to be careful not to make more claims about your model than are justified.</p></blockquote>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/731/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/731/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/731/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/731/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/731/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/731/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/731/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/731/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/731/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/731/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=731&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/19/glottochronology-part-iii/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Glottochronology: Part II</title>
		<link>http://irthoughts.wordpress.com/2009/02/18/glottochronology-part-ii/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/18/glottochronology-part-ii/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 14:37:01 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Queries]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=722</guid>
		<description><![CDATA[Please read Glottochronology Part I before reading this post.
Language dating forecasts are based on independence assumptions. Let A1 and A2 be two different events. If the events are assumed to be independent, the probability of both co-occurring is
p(A1A2) = p(A1)p(A2)
Some authors like Sandefur call this the multiplication principle.
As Sandefur noted and quote:
Suppose two people each &#8216;know&#8217; a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=722&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Please read <a href="http://irthoughts.wordpress.com/2009/02/16/glottochronology-part-i/">Glottochronology Part I</a> before reading this post.</p>
<p>Language dating forecasts are based on independence assumptions. Let <strong>A1</strong> and <strong>A2</strong> be two different events. If the events are assumed to be independent, the probability of both co-occurring is</p>
<p><strong>p(A1A2) = p(A1)p(A2)</strong></p>
<p>Some authors like Sandefur call this the <strong>multiplication principle</strong>.</p>
<p>As Sandefur noted and quote:</p>
<blockquote><p>Suppose two people each &#8216;know&#8217; a certain per cent of a list of words. For example, suppose Frank knows 70 per cent of list <strong>L</strong> and Sue knows 80 per cent of list <strong>L</strong>, where <strong>L</strong> contains 100 words. Given any random sublist of words from list <strong>L</strong>, we would expect Frank to know 70 per cent of them and Sue to know 80 per cent of them.</p>
<p>Frank knows 70 of the original 100 words. We would expect Sue to know 80 percent of Frank&#8217;s 70 words, that is, 56 of Frank&#8217;s words. Thus, Sue and Frank know 56 words in common, that is, the per cent of the 100 words that Frank and Sue both know is (0.80)(0.70) = 0.56 or 56 per cent.</p>
<p><strong>Multiplication principle</strong>: suppose person <strong>A</strong> knows <strong>P </strong>per cent of a list of <strong>L</strong> words and person <strong>B</strong> knows <strong>Q</strong> per cent of the same list of <strong>L</strong> words (where <strong>P</strong> and <strong>Q </strong>are given as decimals). Given no additional information, we woud expect <strong>A</strong> and <strong>B </strong>to both know <strong>PQ</strong> per cent of the words.</p></blockquote>
<p>In Part III we will provide some examples of this principle to the evolution of cultures. </p>
<p>Later in this series we will explain how the independence assumption affects some of the reasonings  and claims behind language dating models.</p>
<p>In the meantime: How relevant this model is to IR? Well, assume that <strong>A </strong>and <strong>B</strong> are not Franks and Sues, but  passages, topics, documents, etc. Or suppose that instead of dealing with language dating we are trying to address the problem of duplicated content. The scenarios might be different, but the drawbacks and gross pitfalls introduced by independence assumptions are quite similar.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/722/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/722/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/722/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/722/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/722/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/722/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/722/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/722/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/722/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/722/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=722&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/18/glottochronology-part-ii/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>AND 2009 Conference</title>
		<link>http://irthoughts.wordpress.com/2009/02/17/and-2009-conference/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/17/and-2009-conference/#comments</comments>
		<pubDate>Tue, 17 Feb 2009 14:25:32 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Conferences]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=706</guid>
		<description><![CDATA[L. Venkata Subramaniam, PhD, Manager &#8211; Information Processing and Analytics, IBM India Research Lab http://lvs004.googlepages.com sent us email with some great news. They are having the Third Workshop on Analytics for Noisy Unstructured Text Data (AND) on July 23-24, 2009, at Barcelona, Spain and asked us to disseminate the news.
Copy of the email follows.
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;
Dear Edel,
We [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=706&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>L. Venkata Subramaniam, PhD, Manager &#8211; Information Processing and Analytics, IBM India Research Lab <a href="http://lvs004.googlepages.com">http://lvs004.googlepages.com</a> sent us email with some great news. They are having the Third Workshop on Analytics for Noisy Unstructured Text Data (AND) on July 23-24, 2009, at Barcelona, Spain and asked us to disseminate the news.</p>
<p>Copy of the email follows.</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;</p>
<p>Dear Edel,</p>
<p>We are organizing the Third Workshop on Analytics for Noisy Unstructured Text Data on July 23-24, 2009, at Barcelona, Spain.</p>
<p>We know you work in related areas and would be happy to have you submit your research work to this workshop.</p>
<p>Also I request you to add AND 2009 to the blog you are maintaining at: <a href="http://irthoughts.wordpress.com/">http://irthoughts.wordpress.com/</a></p>
<p>I know many IR researchers visit your blog and through the blog we will be able to reach.</p>
<p>AND 2009: <a href="http://and2009workshop.googlepages.com/">http://and2009workshop.googlepages.com/</a></p>
<p>This is the third in the series of workshops: AND 2007 at IJCAI 2007: <a href="http://research.ihost.com/and2007/">http://research.ihost.com/and2007/</a></p>
<p>AND 2008 at SIGIR 2008:<br />
<a href="http://and2008workshop.googlepages.com/">http://and2008workshop.googlepages.com/</a></p>
<p>Both earlier workshops resulted in ACM proceedings and journal special issues. Here are some details of AND 09:<br />
<a href="http://and2009workshop.googlepages.com/">http://and2009workshop.googlepages.com/</a></p>
<p>Workshop Name: Third Workshop on Analytics for Noisy Unstructured Text Data (AND 09) in conjunction with ICDAR 09</p>
<p>Submission Date: 20 April 2009<br />
Notification Date: 20 May 2009<br />
Workshop Dates: 23-24 July 2009 Workshop<br />
Location: Barcelona Spain</p>
<p>Regards<br />
Venkat</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
<p>So now you know. Start making plans for attending this great workshop. If you are visiting Madrid, swing by to Barcelona for a few days. Don&#8217;t miss this unique opportunity.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/706/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/706/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/706/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/706/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/706/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/706/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/706/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/706/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/706/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/706/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=706&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/17/and-2009-conference/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Glottochronology: Part I</title>
		<link>http://irthoughts.wordpress.com/2009/02/16/glottochronology-part-i/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/16/glottochronology-part-i/#comments</comments>
		<pubDate>Mon, 16 Feb 2009 15:39:11 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Queries]]></category>
		<category><![CDATA[Vector Space Models]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=696</guid>
		<description><![CDATA[Although not back-to-back during this week I will be posting on glottochronology.
Glottochronology is a combination of greek terms which essentially means language dating.
Looking at some of my &#8220;old&#8221; collection of books on applied Chaos and Fractals from the &#8217;90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=696&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Although not back-to-back during this week I will be posting on glottochronology.</p>
<p><em>Glottochronology </em>is a combination of greek terms which essentially means language dating.</p>
<p>Looking at some of my &#8220;old&#8221; collection of books on applied Chaos and Fractals from the &#8217;90s (a topic close to my heart/doctoral thesis), I recalled that James T. Sandefur dedicated few pages to the topic in his great book <em>Discrete Dynamical Systems (C</em>hapter 2, pages 81-83; Oxford, 1990). Yep. There is nothing new under the Sun, Web IRs.</p>
<p>Sandefur wrote:</p>
<blockquote><p>We all know that, over time, certain words disappear from usage and new words appear. Suppose that, at a certain point in time, we look at a list of <strong>L</strong> words (say <strong>L=250</strong>). At a later point in time, we study that same list of words and determine what per cent of the original list of words are still in use.</p>
<p>Let one unit of time be 1 year. Thus, time <strong>n</strong> will be <strong>n </strong>years. Let <strong>A(n)</strong> represent the per cent of the original list of words still in use <strong>n </strong>years later, given as a decimal. The basic assumption is that the percent <strong>A(n+1)</strong> of the original list of words in use at time <strong>n+1</strong> is proportional to the per cent of the original list of words in use at time <strong>n</strong>, that is,</p>
<p><strong>A(n+1) =rA(n),</strong></p>
<p>where <strong>r</strong> is a positive constant less than one. At time 0, all of the original list of words is in use, so <strong>A(0)=1</strong>. Therefore, at time <strong>k</strong>, <strong>A(k)=r^k(1) = r^k</strong> is the percent of the original list of words still in use, as a decimal.</p>
<p>Since languages change slowly, <strong>r</strong> should be close to 1 and would probably be hard to estimate on a year by year comparison. By comparing a written language today with the same language a millenioum ago, glottochronologists can estimate <strong>r^1000</strong>. This number <strong>r</strong> also depends on the particular language. But glottochronologists have found that the number <strong>r^1000</strong> is usually close to <strong>0.805</strong>. So for languages with no written history, that is, for languages in which we cannot estimate <strong>r</strong>, we will assume that</p>
<p><strong>r^1000 = 0.805</strong></p>
<p>Thus, the per cent of the original list of words that are still in use <strong>k </strong>years later is</p>
<p><strong>r^k = (0.805)^(0.001 k)</strong>.</p></blockquote>
<p>Glottochronology is one of those fields that were popular, but that many now cast doubts about it, due to questionable measurements and assumptions. One of those assumptions is term independence.</p>
<p>It seems that term independence is T<strong>he Original Sin</strong> in Linguistic Studies as well as in IR models for noisy text collections, particularly in models that assume term independence with IDF scores. I&#8217;m working on a paper presentation on the subject.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/696/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=696&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/16/glottochronology-part-i/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>When newspapers stick to spam marketing</title>
		<link>http://irthoughts.wordpress.com/2009/02/12/when-newspapers-sticks-to-spam-marketing/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/12/when-newspapers-sticks-to-spam-marketing/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 14:51:08 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=688</guid>
		<description><![CDATA[That&#8217;s what local newspapers in Puerto Rico are doing: insisting in old spam marketing tactics. Wake up local web masters. It is not 1995. Redirections, use of splash page ads, and keyword spamming  in meta keyword tags, not only does not work with search engines, but annoys users to no end.
This is exactly what some [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=688&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p style="text-align:left;">That&#8217;s what local newspapers in Puerto Rico are doing: insisting in old spam marketing tactics. Wake up local web masters. It is not 1995. Redirections, use of splash page ads, and keyword spamming  in meta keyword tags, not only does not work with search engines, but annoys users to no end.</p>
<p style="text-align:left;">This is exactly what some newspapers like the local El Nuevo Dia newspaper are doing.</p>
<p style="text-align:left;">Today I typed <a href="http://endi.com">http://endi.com</a> and was redirected to a full-size splash page ad. Then I have to opt-out to be redirected to a content page. Thank you for slapping in my face those annoying ads. Why chase away readers?</p>
<p style="text-align:left;">Adding insult to injury, when I looked at their horrible content page source code (<a href="http://www.elnuevodia.com/noticias">http://www.elnuevodia.com/noticias</a>), it is clear those that designed the page are insisting in keyword spamming through meta keyword tags. How many keywords can you count in the following sample?</p>
<blockquote><p>&lt;meta name=&#8221;Keywords&#8221; content=&#8221;ENDI El Nuevo Dia, periodico, Puerto, Rico, internet, noticias, boricua, Clima, Horoscopo, Coqui, Sapo,Concho,Telefonica,Coqui net,RonNueva York ,Horoscopo,Coqui,Sapo Concho,Telefonica,Coqui net,Ron,Bacardi,Estatus,Sila Maria Calderon,Menudo,Carlos Romero Barcelo,Pedro Rossello,Anibal Acevedo Vila,Daddy Yankee,Tego lderon,Luis Fonsi,Ricky Martin,Calle 13,Don Omar,Marcony,Miss Universe,Miss Universo ,Jennifer Lopez,Chayanne,Educacion,Cine,Entretenimiento,Ejercicio,Bienestar,Recetas,Recetarios,Musica,Boricuas,Carlos Arroyo, Islanders,TUTV,Motoristas,Internautas ,Medicos,Historiadores,Venezuela,Santo Domingo,Republica Dominicana,Cuba ,Queens,Manhattan,Bronx,Nueva York ,Espiritualidad,Maestros,Televicentro,Telemundo,Univision,Bellas Artes,Telenovela,Comunidades,Isla,Culebrita,St. Thomas,Isla Nena,Culebra,Isla Mona,Vieques,Filiberto Ojeda,Capitolio,Turismo,Periodico,Periodismo,Veteranos,Marina,Soldados,Senado,Energia Electrica,Plena,Bomba,Reggaeton,Wisin y Yandel,Casinos,Salsa,Construccion,Vacaciones,Jazz,Tito Trinidad,Oscar de la Hoya,Millie Corretjer,Roberto Clemente,San Juan,Miguel Cotto,Becas,Tercera Edad,Museos,Clasificados,Clasificados online,Clasificados en linea,Boletines,Servicios de noticias,Luis A. Ferre,Ferre Rangel,Hepatitis,Dengue,Diabetes,Obituarios,Obesidad,Librerias,Libros,Viejo San Juan,El Morro,Ballaja,Museo de Arte de Ponce,Ponce,Mayaguez,Parque Indigena,Observatorio de Arecibo,Tainos,El Yunque,El tunel Guajataca,Zoologico,RUM,UPR,Sagrado Corazon,Interamericana,Universidades,Colegios,Escuelas,Fajardo,Bahia,Luquillo,Bioluminiscente,La Parguera,Carlos Delgado,Igor Gonzalez,Olga Tanon,Piculin Ortiz,Primera Hora,Zonai,Virtual,Beisbol,Mundial ,Raul Papaleo,Sondeos,Justas,Pavas,Palmas,Munoz Marin,Lyann Puig,Elaine Lopez,Javier Lopez,Remi,Poesia ,Olga Nolla,Rosario Ferrer,Mayra Montero&#8221;&gt;</p>
</blockquote>
<p style="text-align:left;">How dumb is that? It is clear these folks have no clue about search engine marketing or how search engines work.  Otherwise, why insist in old 1995 practices proven a waste today? I question whether they have a clue about optimizing news stories and press releases for search engines.</p>
<p style="text-align:left;">OK, SEO firms, pitch them, but don&#8217;t try to sell them snake oil like keyword density, SEO LSI, rare terms crap, synonyms stuffing, etc.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/688/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/688/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/688/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=688&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/12/when-newspapers-sticks-to-spam-marketing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Building a Query Reduction Search Engine</title>
		<link>http://irthoughts.wordpress.com/2009/02/09/building-a-query-reduction-search-engine/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/09/building-a-query-reduction-search-engine/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 05:00:17 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Queries]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=681</guid>
		<description><![CDATA[As part of ongoing research, I&#8217;m building a search engine with ondemand query reduction capabilities.  To our knowledge, none of the current commercial search engines provides such features.
Experimental machines that do this require the use of training sets, decision graphs and decision trees. For references on this topic, read
Query Expansion and Query Reduction in Document Retrieval
A Two-Step Approach [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=681&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As part of ongoing research, I&#8217;m building a search engine with ondemand query reduction capabilities.  To our knowledge, none of the current commercial search engines provides such features.</p>
<p>Experimental machines that do this require the use of training sets, decision graphs and decision trees. For references on this topic, read</p>
<p><a href="http://www.csse.monash.edu.au/~ingrid/Publications/ICTAI03ZRW.pdf">Query Expansion and Query Reduction in Document Retrieval</a></p>
<p><a href="http://www.waset.org/pwaset/v23/v23-88.pdf">A Two-Step Approach for Tree-structured XPath Query Reduction</a></p>
<p>Unfortunately, these type of search engines  are not popular, in part because are not practical at the scale of the Web and because require retraining of both the search engine and users &#8211;not to mention that these type of search machines are not precisely user-friendly.</p>
<p>Think about this: In general, <a href="http://irthoughts.wordpress.com/2008/08/05/search-interface-usability-issues/">average users are lazy searchers</a>. They are also too busy to do neither query expansion or query reduction as we do in IR, nor they are prone to consult lookup lists, thesaurus, query logs, etc to refine their searches while surfing across databases. At any given point in time of the year the mentality of non-IR searchers  is: &#8220;Don&#8217;t make me think&#8221;.</p>
<p>Thus, building a search engine that does ondemand query reductions for the Web (and that users will use without being forced to think) is not that easy.</p>
<p>We would like to hear of others working on similar research as we believe we have found a promising solution, at least partially. Ours is different from the approaches given in the above two references.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/681/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=681&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/09/building-a-query-reduction-search-engine/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Time Series Semantics</title>
		<link>http://irthoughts.wordpress.com/2009/02/04/time-series-semantics/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/04/time-series-semantics/#comments</comments>
		<pubDate>Wed, 04 Feb 2009 16:26:48 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Latent Semantic Indexing]]></category>
		<category><![CDATA[Queries]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=676</guid>
		<description><![CDATA[The title of this post might be a bit confusing, but we couldn&#8217;t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.
Consider the key term [man].
Providing a meaning or perception for [man] during a good or a bad [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=676&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>The title of this post might be a bit confusing, but we couldn&#8217;t find a better choice of words. The point to be made is that definitions and associations of terms can be affected as events evolve in time.</p>
<p>Consider the key term [man].</p>
<p>Providing a meaning or perception for [man] during a good or a bad Economy is a good example.</p>
<p>[man] means something different, depending if you ask to an employed or unemployed [man].</p>
<p>This can be illustrated by reading <a href="http://www.philly.com/philly/entertainment/20090204__quot_A_10_000-pound_gorilla__quot_.html">Why losing a job can hurt men more</a>.</p>
<p>Although it might be a quite depressing article (especially for those with a <span style="color:#ff00ff;"><strong>pink slip </strong></span>in their foreheads), note the key words/phrases that define its topic and overall semantics. Incidentally, the article starts with</p>
<p>&#8220;Thomas Schuler is a man.&#8221;</p>
<p>and almost at the ends says:</p>
<p>&#8220;A man is what he does.&#8221;</p>
<p>The key words/phrases of the article can be taken as a semantics state in time attached to the [man] key term</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/676/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/676/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/676/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/676/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/676/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/676/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/676/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/676/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/676/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/676/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=676&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/04/time-series-semantics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>IRW: Data Mining Credit Cards</title>
		<link>http://irthoughts.wordpress.com/2009/02/02/irw-data-mining-credit-cards/</link>
		<comments>http://irthoughts.wordpress.com/2009/02/02/irw-data-mining-credit-cards/#comments</comments>
		<pubDate>Mon, 02 Feb 2009 16:29:39 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=669</guid>
		<description><![CDATA[
The current issue of IR Watch &#8211; The Newsletter will be available during the day. It consists of the following sections.
Featuring Article: Data Mining Credit Cards
In this issue of the newsletter we cover Luhn’s Algorithm, also known as the Modulus 10 or Mod-10 Test. This algorithm is used for data mining and validation of credit [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=669&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><img class="aligncenter size-full wp-image-672" title="data-mining-credit-cards1" src="http://irthoughts.files.wordpress.com/2009/02/data-mining-credit-cards1.gif?w=450&#038;h=310" alt="data-mining-credit-cards1" width="450" height="310" /></p>
<p>The current issue of IR Watch &#8211; The Newsletter will be available during the day. It consists of the following sections.</p>
<p>Featuring Article: Data Mining Credit Cards</p>
<blockquote><p>In this issue of the newsletter we cover Luhn’s Algorithm, also known as the Modulus 10 or Mod-10 Test. This algorithm is used for data mining and validation of credit cards. Credit cards fraud is a topic that never goes away.</p></blockquote>
<p>QA: Types of Links</p>
<blockquote><p>What is the difference between in-links, out-links, co-citation, and co-reference?</p></blockquote>
<p>Historical Notes: The Whirlwind Project</p>
<p>Top CS: State University of New Jersey, Rutgers</p>
<p>Who is Who in IR: Tefko Saracevic</p>
<p>Graduate Theses</p>
<p>Data Mining Blogs</p>
<p>and more.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/669/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=669&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/02/02/irw-data-mining-credit-cards/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>

		<media:content url="http://irthoughts.files.wordpress.com/2009/02/data-mining-credit-cards1.gif" medium="image">
			<media:title type="html">data-mining-credit-cards1</media:title>
		</media:content>
	</item>
		<item>
		<title>When OR Clusters Behave as AND</title>
		<link>http://irthoughts.wordpress.com/2009/01/30/when-or-clusters-behave-as-and/</link>
		<comments>http://irthoughts.wordpress.com/2009/01/30/when-or-clusters-behave-as-and/#comments</comments>
		<pubDate>Fri, 30 Jan 2009 18:39:42 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[IR Tools]]></category>
		<category><![CDATA[Queries]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=662</guid>
		<description><![CDATA[We are currently doing some testing with a new experimental engine. The experiment consists in using OR as the default mode and IDF-only for scoring terms. IDF is precomputed straight from the inverted index which is also computed at query time. We are also trying replacing IDF with Entropy scores.
With large collections, the inverted index [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=662&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We are currently doing some testing with a new experimental engine. The experiment consists in using OR as the default mode and IDF-only for scoring terms. IDF is precomputed straight from the inverted index which is also computed at query time. We are also trying replacing IDF with Entropy scores.</p>
<p>With large collections, the inverted index is written to a text file and read at query time.</p>
<p>Since local information (e.g., term freq) is ignored, keyword spam is not an issue.</p>
<p>Instead of a Vector Space Model, we use a cummulative sum of scores over IDF scores, such that is not necessary to compute cosine similarities (*).</p>
<p>So far the results of the experiment is that with multi-term queries two extreme clusters are obtained:</p>
<p>1. the top N ranked documents almost behave as being queried in AND mode and as obeying the Cluster Hypothesis.</p>
<p>2. the M ranked documents at the bottom behave as being queried either in EXACT mode or with a single-term query. (**)</p>
<p>Between these extremes we have some noisy results.  </p>
<p>If some have tried this before, we would love to hear about it. Contact us by email.</p>
<p> </p>
<p>PS.</p>
<p>(*) In this way we don&#8217;t need to make independence assumptions.</p>
<p>(**) With few changes, M now behaves as being queried with single-term queries or few query terms, which is what we expected. The N set still is the more interesting. The middle cases are now quite noisy.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/662/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/662/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/662/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/662/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/662/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/662/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/662/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/662/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/662/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/662/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=662&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/01/30/when-or-clusters-behave-as-and/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Coming Soon: Data Mining MP3 Players</title>
		<link>http://irthoughts.wordpress.com/2009/01/27/coming-soon-data-mining-mp3-players/</link>
		<comments>http://irthoughts.wordpress.com/2009/01/27/coming-soon-data-mining-mp3-players/#comments</comments>
		<pubDate>Tue, 27 Jan 2009 12:37:15 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Homeland Security]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=654</guid>
		<description><![CDATA[MP3 Confidentials: I saw this morning on CNN a technology news about how military records, including the names, SSNs, phones, etc of soldiers were discovered stored in an MP3 Player. According to the news,
Chris Ogle of New Zealand was in Oklahoma about a year ago when he bought a used MP3 player from a thrift store [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=654&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>MP3 Confidentials: I saw this morning on <a href="http://www.cnn.com/2009/TECH/01/27/confidential.mp3.player/index.html?iref=mpstoryview">CNN</a> a technology news about how military records, including the names, SSNs, phones, etc of soldiers were discovered stored in an MP3 Player. According to the news,</p>
<blockquote><p>Chris Ogle of New Zealand was in Oklahoma about a year ago when he bought a used MP3 player from a thrift store for $9. A few weeks ago, he plugged it into his computer to download a song, and he instead discovered confidential U.S. military files.</p>
<p>&#8220;The more I look at it, the more I see, and the less I think I should be,&#8221; Ogle said with a nervous laugh in an interview with TVNZ.</p>
<p>The files included the home addresses, Social Security numbers and cell phone numbers of U.S. soldiers. The player also included what appeared to be mission briefings and lists of equipment deployed to hot spots in Afghanistan and Iraq.</p>
<p>Pentagon officials told CNN that they are aware of the MP3 player, but can&#8217;t talk about it until investigators confirm that the information came from the U.S. Department of Defense. </p>
<p>&#8220;The government isn&#8217;t doing a good job of protecting the information that it collects,&#8221; said Marc Rotenberg of the Electronic Privacy Information Center in Washington.</p>
<p>Despite government efforts to protect sensitive information, this is a growing problem, privacy experts say.</p>
<p>Two years ago, the Department of Veterans Affairs lost track of a laptop with the personal information of millions of soldiers. And computer hard drives with classified military information have been found for sale at street markets in Afghanistan.</p>
<p>&#8220;When you can identify American personnel, when you have their names, their home address, their cell phone numbers, you put people in a dangerous position,&#8221; Rotenberg said.</p></blockquote>
<p>It might be time to cover data mining of MP3 Players.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/654/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/654/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/654/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/654/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/654/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/654/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/654/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/654/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/654/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/654/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=654&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/01/27/coming-soon-data-mining-mp3-players/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
		<item>
		<title>Tons of Credit Card Transactions Exposed at HPY</title>
		<link>http://irthoughts.wordpress.com/2009/01/22/tons-of-credit-card-transactions-exposed-at-hpy/</link>
		<comments>http://irthoughts.wordpress.com/2009/01/22/tons-of-credit-card-transactions-exposed-at-hpy/#comments</comments>
		<pubDate>Thu, 22 Jan 2009 13:10:25 +0000</pubDate>
		<dc:creator>E. Garcia</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Newsletters]]></category>

		<guid isPermaLink="false">http://irthoughts.wordpress.com/?p=645</guid>
		<description><![CDATA[We learned about this news from a business associate:
According to USAToday, Heartland Payment Systems (HPY) on Tuesday disclosed that intruders hacked into the computers it uses to process 100 million payment card transactions per month for 175,000 merchants.
In IRW &#8211; The Newsletter, we have covered data mining of VINs, SSNs, web analytic frauds, and email [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=645&subd=irthoughts&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We learned about this news from a business associate:</p>
<p>According to <a href="http://www.usatoday.com/money/perfi/credit/2009-01-20-heartland-credit-card-security-breach_N.htm">USAToday</a>, Heartland Payment Systems (HPY) on Tuesday disclosed that intruders hacked into the computers it uses to process 100 million payment card transactions per month for 175,000 merchants.</p>
<p>In IRW &#8211; The Newsletter, we have covered data mining of VINs, SSNs, web analytic frauds, and email headers. It might be time to cover credit card mining so readers will understand the risks involved when servers, even test servers, are not properly secured or supervised.</p>
<p>Data Mining at the intersection of Information Retrieval, Business Intelligence, and Information Security is here to stay.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/irthoughts.wordpress.com/645/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/irthoughts.wordpress.com/645/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/irthoughts.wordpress.com/645/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/irthoughts.wordpress.com/645/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/irthoughts.wordpress.com/645/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/irthoughts.wordpress.com/645/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/irthoughts.wordpress.com/645/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/irthoughts.wordpress.com/645/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/irthoughts.wordpress.com/645/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/irthoughts.wordpress.com/645/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=irthoughts.wordpress.com&blog=1041983&post=645&subd=irthoughts&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://irthoughts.wordpress.com/2009/01/22/tons-of-credit-card-transactions-exposed-at-hpy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2d26d7051f681fdbb28379876c940a32?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">irthoughts</media:title>
		</media:content>
	</item>
	</channel>
</rss>