Learn about the power of X Searches (short for XOR and XNOR searches) for keyword discovery, disambiguation, clustering, information retrieval, and data mining in general.
This is a follow up on the Beauty of XOR and XNOR searches post, describing possible applications of these search modes to Information Retrieval, Search Marketing, and Web Mining. The post is a snippet taken from http://www.minerazzi.com/help/xor-xnor.php
An IR researcher can test the performance of an LSI algorithm with a sample of documents retrieved through XOR and XNOR searches. Said sample should be rich in co-occurrence cases. Using a similar procedure, search marketers or Web intelligence specialists can identify sets of documents that emphasize keywords somehow related through different co-occurrence paths.
An interesting application consists in extracting all the unique terms (or just the high frequency ones) from a text source and constructing an XOR query with these. We may refer to this as XORing a text source. This should help one identify a network of co-occurrence paths over a collection and which documents might be relevant to specific combination of terms from the original source.
The text source can be a title, description, abstract, or paragraph of a document, or even an entire document. However, XORing a large document might be computer-intensive.
A similar exercise can be done by XNORing a text source. In both cases, the resultant output can be used to identify prospective competitors; i.e., documents relevant to similar concepts or belonging to companies within the same business space.
We are currently testing the XOR and XNOR search modes as a query disambiguation strategy.
PS. Today, 1-9-2014, we added new material that discusses these search modes for disambiguation and clustering. :)
Who said that IR and LSI cannot be fun? Detecting Cyberbullying: Query Terms and Techniques
As part of the development of Minerazzi, we have published an article explaining two of our search modes: XOR and XNOR. Additional articles explaining other modes will soon follow.
We believe that IR and SEO practitioners will find these search modes particularly useful.
The beauty of XOR and XNOR searches is that these allow users to run complex co-occurrence searches in a straightforward manner. This is important as Latent Semantic Indexing information is related to term-term co-occurrence relationships.
On January 23-27, 2006 I was at the Institute for Pure and Applied Mathematics, UCLA, California attending a now infamous Document Space Workshop. I took some pictures, but did not find these until now.
I’ve posted these in my facebook page, posing with back then IPAM director and with world-recognized LSI expert Dr. Michael Berry and his former students. To learn more about the workshop and the speakers, follow this link http://www.miislita.com/ipam/ipam-document-space-workshop.pdf
Today I feel like giving away a little quiz material on applied linear algebra. The topic is relevant these days wherein some misleading SEOs are playing the we-do-“science” game (quack “science”, after all).
The following is taken from the Search Engines Architecture grad course I lectured back in 2008. I’m providing only one exercise with multiple parts. The quiz with answers might be a great topic for an IRW issue.
1.1 A search engine has three types of revenue channels: pay-per-click (PPC), pay-per-placement (PPP), and pay-for-conversion (PFC). In quarter 1, the million-dollar revenues respectively were: 20, 4, and 9. In quarter 2, PPC revenues were 20% less, PPP revenues doubled, and PFC revenues remained constant.
1.1.1 Write a matrix M1 expressing the revenue and quarter vectors for the first two quarters.
1.1.2 If the goal in quarter 3 is to increase by 20% all revenues earned in quarter 2, update M1 so it reflects such a goal as a new matrix M2.
1.1.3 If the goal in quarter 4 is to meet the average revenues of each of the previous channels in quarter 4, update M2 such that it reflects that goal as a new matrix M3.
1.1.4 Express the above quarters as column unit vectors. Inspecting either rows or columns, construct a nearest neighbor similarity matrix Mnn and construct scalar clusters of quarters. Ignore cosine similarity deviations of 0.02 units or less. How similar the quarters are?
We have published a new article titled:
The traditional way of constructing fractal patterns is with mathematical algorithms. Some of these are based on pixel-by-pixel drawing techniques wherein the output of a recursive function is evaluated against a predefined condition. This is a slow process which requires of a large number of iterations.
Other strategies or combination of strategies have been proposed; for instance, using HTML tables, image files, VML, SVG, or the canvas tag introduced in HTML5. In most cases, implementing these strategies and techniques is not a straightforward process, involve a learning curve, or unnecessarily consume Web server resources.
None of this is necessary with our approach. We are also providing the corresponding source codes at Mi Islita.com, so others can reproduce or improve our results. These are winRAR-zipped files. An unzipped, live example for the Sierpinski Gasket is provided.
Here are some great papers on Mann Iteration Method.
“What does it has to do with IR, CSS or Fractals?”, you might ask. Well, I’m using it in an upcoming article.
In a recent post (http://irthoughts.wordpress.com/2010/04/23/beware-of-seo-statistical-studies/) we warned readers against SEO statistical studies. For those that want a second opinion, here is a collection correlation coefficient myths, taken from a reviewed manuscript written by top researchers from Sandia National Labs, Stony Brooks University, Iowa State University, Lewis and Clark College, and Applied Biomathematics. Enjoy it.
The authors provide applications to risk analysis while debunking the following widespreaded myths:
1. All variables are mutually independent.
2. If X and Y are independent and Y and Z are independent, then X and Z are too.
3. Variables X and Y are independent if and only if they are uncorrelated.
4. Zero correlation between X and Y means there’s no relationship between X and Y.
5. Small correlations imply weak dependence.
6. Small correlations can be “safely ignored” in risk assessments.
7. Different correlation coefficients are similar.
8. A correlation coefficient specifies the dependence between two random variables.
9. Correlation coefficients vary between −1 and +1.
10. Any correlation can be specified between inputs.
11. Perfect dependencies between X and Y and between X and Z imply perfect dependence between Y and Z.
12. Monte Carlo simulations can account for dependencies between variables.
13. Varying correlation coefficients constitutes a sensitivity analysis for uncertainty about dependence.
14. A model should be expressed in terms of independent variables only.
15. You have to know the dependence to model it.
16. The notion of independence generalizes to imprecise probabilities.
Read the article to understand in which context these are listed, before agreeing or disagreeing with these.
They also provide reference material to many correlation coefficients:
“There are many different measures of correlation that are in common use and many more that have been proposed. The most commonly used measures are Pearson’s product-moment correlation and Spearman’s (1904) rank correlation, but there are a host of other measures that also arise in various engineering contexts, including Kendall’s rank correlation, concordance of various kinds (e.g., Hoeffding 1947; Lehmann 1966; Scarsini 1984), Blomqvist’s (1950) coefficient, Gini’s coefficient (Nelsen 1999), etc. Hutchinson and Lai (1990) review many of these.”
I’ve uploaded a new article on fractals:
In an upcoming article I will demonstrate how to create classic fractals as found in the literature, right on the user’s browsers, with no HTML5 canvas tag and without high-level mathematical algorithms. We only need to use CSS and HTML. Markup code will be provided for those interested in reproducing the results.