Archive for January, 2008

Back Mapping for the Masses

January 31, 2008

In a recent tutorial on association and scalar clusters, http://www.miislita.com/information-retrieval-tutorial/association-scalar-clusters-tutorial-1.pdf, I introduced a back mapping technique wherein once features conforming clusters are extracted from objects, the clusters are mapped back to objects.

The technique works well with clusters of terms extracted from documents. The reverse case is also possible: given a cluster of documents extracted from terms, it is possible to map these back to terms.

What do we gain from such two-way manipulations? A lot. Consider the first scenario: Mapping term clusters back to documents; a tutorial on the second scenario will be available soon.

Back Mapping Term Clusters to Documents

A document is just a distribution over topics while topics are distributions over words. Thus, across a collection of documents there are topics hidden (latent) and waiting to be uncovered. Back mapping allows us to recover these, precisely.

Combinations of terms that do not amount to topics across the collection are discovered as well. Reasonably, one would expect these to be least relevant across other documents than those distributed across the collection. In addition, one would expect documents traced back to clusters to be the most relevant documents, from the collection and with respect to the topics.

The implications of this for search engine optimization and keyword bidding are quite obvious. Implementation is straightforward. To learn more about it, read Part 1 of the tutorial.

Web Mining Week 9

January 28, 2008

Week 9 Agenda

Intelligence Searching for Penetration Testers (PPT Presentation)
Searching for Terrorist Threats and Identity Thefts, the SSN Way (PPT Presentation)
Mining VIN numbers, Email Headers, and other Undocumented Commands (PPT Presentation)

Required Reading Material

Provided during lecture.

The Power of Document Linearization

January 25, 2008

In http://www.miislita.com/fractals/keyword-density-optimization.html  I explained to the SEO community the concept of document linearization as part of document GAP analysis. Marketers learned what IR graduate students already know: that document linearization (i.e., markup removal) is just one component of document indexing.

Keyword distribution, word distances, phrase matching, etc. are obtained from the text stream that results from linearization, not from the apparent position of text that is rendered by a browser and visually inspected by average end users. Document linearization debunks the common SEO Keyword Density Myth. One thing is the apparent distribution of words as perceived when end users visually scan a document and another thing is the actual word distribution as parsed by a search engine. The futility of computing KD values is quite obvious.

Here is a report of another recent SEO that discovered the power of document linearization:

http://seo-gw.blogspot.com/2008/01/fractal-semantics-linearization.html

The testimonial is worth to read.

The post http://irthoughts.wordpress.com/2007/12/20/from-keyword-density-to-william-tuttes-legacy/  is also relevant these days.

Search for posts on keyword density: http://irthoughts.wordpress.com/?s=keyword+density

Microsoft’s Black Cloud on Yahoo! & SEO Tag Clouds

January 23, 2008

From time to time rumors spread of the black cloud of Microsoft over Yahoo!; i.e., of Microsoft buying Yahoo!. This time things are less cloudy, especially now that Yahoo! is about to cut jobs.

Early this year, Jeremy Zawodny from Yahoo!, wrote:

“Sure, there would be cultural problems, integration challenges, and many people who’d likely walk. But at the end of the day, Microsoft would end up with a much larger set of online services, a better advertising network, and people who know how to build, brand, and market web stuff that people actually use.”

Talking about clouds:

A student asked me about some SEOs claiming that text tag clouds are a kind of LSI technology.

Pure non sense coming from many SEOs, as usual.

These clouds are easy to construct. No LSI is needed:

1. Sort terms from a document or lookup list by frequencies.
2. Normalize frequencies to run between the 0,1 interval.
3. Use normalized frequencies as parameters to be passed as font sizes.

For pizzaz, store terms into array to be sorted or randomized and or use some CSS.

We can do the same with hit counts assigned to blog categories, links, etc. No special technology is needed.

Association & Scalar Clusters Tutorial - Part 1

January 22, 2008

I am writing a tutorial series on Cluster Analysis. It is my pleasure to announce that the
Association and Scalar Clusters Tutorial - Part 1: Back Mapping Term Clusters to Documents was uploaded few days ago.

Online publication was announced in advanced to subscribers of the IR Watch - The Newsletter, so they already have an edge over regular readers and visitors of Mi Islita

Abstract follows:

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A “reaction” equation approach is used to break down the classification problem to a sequence of steps. From the initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified. A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters. Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation, and document classification are discussed.

During last night lecture (Web Mining Course), I applied the back mapping technique to scalar clusters generated from LSI. The technique provides additional information and reasons as to how and why documents score as observed after implementing SVD. A clear connection with Fuzzy Set Theory was made.

Students taking the Web Mining Course will find this tutorial quite handy.

Web Mining Week 8

January 21, 2008

Week 8 Agenda

 Take-Home Work 3 and Web Mining Course FAQs
LSI and Scalar Cluster Analysis: An EXCEL Spreadsheet Approach (PPT presentation)
LSI and Fuzzy Sets = Fuzzy LSI
Introduction to Intelligence Searches (PPT presentation)
Bonus: My IPAM Lost Pictures at the 2006 Document Indexing Workshop

Required Reading Material

http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf 
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Finding Topic-Specific Posts

January 18, 2008

Global Term Weights based on Entropies

January 16, 2008

A grad student taking the Web Mining, Search Engines, and Business Intelligence course asked me to clarify global weights G defined as entropies.

Global weights based on entropies are frequently combined with local and normalization weights into overall weights.  These are then used to populate a term-doc matrix. The matrix can be used with term vector models to rank documents. The same matrix can be decomposed with SVD (LSI) and used to rank documents.

The following set of equations define the global entropy weight of term i in a collection of just 3 documents (N=3). I am providing two extreme cases:

Global Entropy Weights

Evidently,

G = 0 if the term is equally mentioned in all documents of the collection.
G = 1 if the term is present in just one document.

Any other combination of frequencies yields G values somewhere between 0 and 1. Thus, the model gives higher weights to terms that appear fewer times in a small number of documents, while lowering the weights of terms that are frequently used across the collection.

Note that the convention is to default p log p values when a condition is met; e.g., p log p = 0 if p = 0 or 1.

Web Mining Week 7

January 14, 2008

Week 7 Agenda

Review of Association and Scalar Clusters
Review of Vector Space Models
LSI & SVD: Demystifying LSI SEO Myths (OJOBuscador Congress, Madrid; PDF Presentation)
LSI & Keyword Research (PDF Presentation)
SVD Noise Filtering: Principal Component Analysis (PCA)

Required Reading Material

Tutorial Series
This is part one of a five-part tutorial series:
http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

Fast Tracks
These are quick tutorials, with to-the-point calculations:
http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf
http://www.miislita.com/information-retrieval-tutorial/lsi-keyword-research-fast-track-tutorial.pdf

Blog Posts
These are IR blog posts designed to fight back against misinformation promoted by unethical SEOs and Spammers:
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
http://irthoughts.wordpress.com/page/1/?s=lsi
http://irthoughts.wordpress.com/page/2/?s=lsi

Blog Category
This is a blog category pointing to a collage of posts that demystify SEO non sense about LSI. Some are about topics that overlap with LSI:
http://irthoughts.wordpress.com/category/latent-semantic-indexing/   

Web Mining and Search Engines Architecture Courses

January 11, 2008

Winter back to school.

Here is the schedule for the Web Mining, Search Engines, and Business Intelligence graduate course for the next weeks.

Jan 14 - LSI and SVD: A hands-on approach. Covers SEO LSI Myths

Jan 21 - Intelligence Searching: Ethical hacking and penetration testing with search engines

Jan 28 - Spam Intelligence: Ethical Spamming, spamdexing, and Adversarial IR strategies

Feb 4 - On-Topic Analysis and Co-Occurrence Theory

Feb 11 - TBA

Next Spring I will be teaching the advanced graduate course Search Engines Architecture

This is a hands-on course where students will spend most of the time in the Software Testing Lab. We will build crawlers, dbas, parsers, search interfaces, etc. Students doing or interested in working on projects/theses with me are encouraged to take the course.

Triangular Link Swapping

January 10, 2008

Over the past few days we received emails requesting what the senders call “triangular link swapping”. Here is the most recent template-type request (I’m removing the links):

Dear Info,

I am writing on behalf of __________________.

We are looking for triangular link swapping with some good qualitysites as  yours. You must already aware that triangular link swapping is much more popular and beneficial than a reciprocal link exchange. This way both of us will be benefited. I would request you to place my link at your site and in return I will have a link / exclusive page created for your site on our Directory.

Here are details of my site :

URL: __________________

An EXACT search in Google for “triangular link swapping” returns results with almost the same wording.

This is nothing new, but the building blocks of link farms and link islands. Good try.

Thesis: DNIDS Using the CSI-KNN Algorithm

January 4, 2008

Here is a great 2007 MS Thesis from Liwei (Vivian) Kuang from School of Computing, Queen’s University, Kingston, Ontario, Canada. DNIDS: A Dependable Network Intrusion Detection System Using the CSI-KNN Algorithm

I’m happy she quoted my Cosine Similarity Tutorial.

Part of the abstract states: “In this thesis, we propose a Dependable Network Intrusion Detection System(DNIDS) based on the Combined Strangeness and Isolation measure K-Nearest Neighbor(CSI-KNN) algorithm. The DNIDS can effectively detect network intrusionswhile providing continued service even under attacks. The intrusion detection algorithmanalyzes different characteristics of network data by employing two measures:strangeness and isolation. Based on these measures, a correlation unit raises intrusionalerts with associated confidence estimates. In the DNIDS, multiple CSI-KNNclassifiers work in parallel to deal with different types of network traffic. An intrusiontolerantmechanism monitors the classifiers and the hosts on which the classifiers resideand enables the IDS to survive component failure due to intrusions. As soon asa failed IDS component is discovered, a copy of the component is installed to replaceit and the detection service continues.”

“We evaluate our detection approach over the KDD’99 benchmark dataset. Theexperimental results show that the performance of our approach is better than the bestresult of the KDD’99 contest winner. In addition, the intrusion alerts generated byour algorithm provide graded confidence that offers some insight into the reliabilityof the intrusion detection. To verify the survivability of the DNIDS, we test theprototype in simulated attack scenarios. In addition, we evaluate the performanceof the intrusion-tolerant mechanism and analyze the system reliability.”

SIMCALC, Binary Similarity Calculator

January 2, 2008

SIMCALC, my binary similarity calculator is officially online.

The calculator was designed to compute similarity measures between any two binary strings of identical length. To use the calculator users must be familiar with Data Mining and similarity analysis. Read the instructions. Since strings are treated as vectors, the tool also works as a vector analyzer.

Who could benefit from this tool?

Scholars

IR/Statistic teachers, students, and researchers can use this calculator for classroom demonstrations or to compare results or exams of the Right (1), Wrong (0) type.

Investigators

Investigators and testers can use it to examine possible cases of duplicated content, fraud, or plagiarism.

Marketers

Marketing and sales executives can use the tool to score consumers’ satisfaction questionnaires of the Yes (1), No (0) type.

Business Intelligence Analysts

Analysts can use it to extract patterns and correlations from polls, surveys, and similar intelligence instruments.

The following figure depicts SIMCALC sample results for the 1010101 and 1010101 vectors:

Binary Similarity Calculator