Minerazzi industry-specific and region-specific working demos and new resources. Available now!
Learning about Grover’s Algorithm: Quantum Database Search.
- Grover, L. K. (1996). A fast quantum mechanical algorithm for database search.
- Grover, L. K. (1997). Quantum Mechanics helps in searching for a needle in a haystack Phys. Rev. Lett. 79 (1997) 325.
- Lavor, C., Manssur, L.R.U., and Portugal, R. (2008). Grover’s Algorithm: Quantum Database Search.
- Wikipedia. Grover’s algorithm.
Learn about the power of X Searches (short for XOR and XNOR searches) for keyword discovery, disambiguation, clustering, information retrieval, and data mining in general.
This is a follow up on the Beauty of XOR and XNOR searches post, describing possible applications of these search modes to Information Retrieval, Search Marketing, and Web Mining. The post is a snippet taken from http://www.minerazzi.com/help/xor-xnor.php
An IR researcher can test the performance of an LSI algorithm with a sample of documents retrieved through XOR and XNOR searches. Said sample should be rich in co-occurrence cases. Using a similar procedure, search marketers or Web intelligence specialists can identify sets of documents that emphasize keywords somehow related through different co-occurrence paths.
An interesting application consists in extracting all the unique terms (or just the high frequency ones) from a text source and constructing an XOR query with these. We may refer to this as XORing a text source. This should help one identify a network of co-occurrence paths over a collection and which documents might be relevant to specific combination of terms from the original source.
The text source can be a title, description, abstract, or paragraph of a document, or even an entire document. However, XORing a large document might be computer-intensive.
A similar exercise can be done by XNORing a text source. In both cases, the resultant output can be used to identify prospective competitors; i.e., documents relevant to similar concepts or belonging to companies within the same business space.
We are currently testing the XOR and XNOR search modes as a query disambiguation strategy.
PS. Today, 1-9-2014, we added new material that discusses these search modes for disambiguation and clustering.
Who said that IR and LSI cannot be fun? Detecting Cyberbullying: Query Terms and Techniques
As part of the development of Minerazzi, we have published an article explaining two of our search modes: XOR and XNOR. Additional articles explaining other modes will soon follow.
We believe that IR and SEO practitioners will find these search modes particularly useful.
The beauty of XOR and XNOR searches is that these allow users to run complex co-occurrence searches in a straightforward manner. This is important as Latent Semantic Indexing information is related to term-term co-occurrence relationships.
As we keep putting the final touches to Minerazzi, we have upgraded the article on its search operators to a series of articles. The first of the series can be found here http://www.minerazzi.com/help/search-modes.php
We have taken the time to explain the difference between search modes and their complements using some Venn Diagrams.
In a nutshell, because most are based on flawed statistics.
The Question of Standard Deviations and Variances
If you have studied for the College Board Examination, you should know that standard deviations are not additive. You should also know that variances are additive for independent random variables. Read the article Why Variances Add — And Why It Matters. Many SEOs fail to know this.
The Question of Correlation Coefficients
Like standard deviations, correlation coefficients are not additive, period. Since they cannot be added, it is not possible to compute an arithmetic average out of them. The same can be said about cosines, cosine similarities, slopes, and in general about any dissimilar ratio. Read the Communications in Statistics article The Self-Weighting Model wherein flaws in the top two main meta-analysis models are documented. Again, many SEOs do not understand this point.
The Question of Normality
Although no data set is exactly normally distributed, most statistical analyses require that the data be approximately normally distributed for their findings to be valid; otherwise one cannot claim that, for instance a computed arithmetic mean (average) is a valid estimator of central tendency for the data at hand. Most SEOs and some “web analytic gurus” out there simply take some data and average them without first doing a normality test.
The Question of Big Data and the t-Test of Significance
When the Fathers of Statistics (Fisher and company) came up with the t-test of significance and similar tests, these were meant to be used with small data sets, not big data sets. To illustrate, if you take a very very very large data set of N paired results, compute a statistic (eg. a correlation coefficient), and compare it against a t-table value, eventually it will pass the test of significance. This will be true for experimental correlations as small as 0.1, 0.01, 0.001….. provided that N is large enough. Claims of statistical significancies are in this case useless. This is why with big data you should try data stratification methods, followed by weighting methods. Big data can lead to big statistical pitfalls.
The Question of Average of Ratios or Ratio of Averages
Ratios cannot be added and then averaged arithmetically, period. A ratio of averages must be used instead of computing an average of ratios. The reason is that a ratio distribution is Cauchy. A Cauchy Distribution is often mistaken for a normal one, but has no mean, variance, or higher moments. As more sample are taken, the sample mean and variance change with an increasing bias as more samples are taken. Computing an average mean from a Cauchy distribution is not an estimate of central tendency. SEOs should know what they are averaging. Check one of my old posts and the comments that followed at
To sum up, beware of SEO statistical “studies”.
More faster than saying “Look mom: No mac address needed!”
If during a browser-specific session a user queries a search engine and accesses subscription-based web services provided by the same search engine (web-based email accounts, gadgets, apps…), the IP used for searching can be associated to his web service credentials (username, password…). Therefore, it is possible for a search engine to guess the identity of that user and know what the user is searching for, when, and how. With referrer and click-through data, it is also possible for that search engine to know where said user came from and where did he/she go.
In most cases, geolocation data are much more accurate for devices with GPS, like smart phones, and HTML5-compatible browsers. In general, users privacy becomes increasingly compromised as more web services, apps, and device features are enabled. On the Web, most free stuffs are not really free, but involve a privacy cost; otherwise, they won’t be free.
Of course, if during a session a user lends the device to another searcher, the search engine might not be able to guess the identity of that user.
A lot of SEOs regurgitate these terms across the Web with ‘correlation is not causation’, or ‘co-occurrence is this or that…’ and the like. When it comes to explaining their data, they simply mistake all those concepts.
Well… some questions for them:
Correlation is not causation: So, how do you determine and measure causality?
The answer is here: Using Statistics to Determine Causal Relationships
Co-occurrence and association: One is affected by size. Which one?