On the Non-Additivity of Correlation Coefficients Part 3: The Bias & Nature of Correlation Coefficients


, , , ,

statistical relationships
This the third and last part of a tutorial series on the non-additivity of correlation coefficients.

Their bias & nature, transformations, and approximations to normality are discussed. The risks of blindly transforming scores to ranks or arbitrarily converting r-to-Z values/Z-to-r values (Fisher Transformations) are discussed. Shifted up cosine approximations to normality are also covered.

Not all researchers know that score-to-rank transformations can change the sampling distribution of a statistic (e.g. a correlation coefficient) and that Fisher transformations are sensitive to normality violations. Combining both types of transformations is a recipe for a statistical disaster.

Alas, some meta analysis and data analytic folks are guilty of that.


Hybrid Similarity Search (HSS) Algorithm for Chemistry Searching


, , ,

Hybrid Similarity Search (HSS) Algorithm for Chemistry Searching for Fentanyl-related compounds and other drugs.
Free version: https://www.mswil.com/images/NIST/NIST17/GCMS-Hybrid-Search-AnalChem-2017.pdf

This is a news from NIST back in March (https://www.nist.gov/news-events/news/2018/03/free-software-can-help-spot-new-forms-fentanyl-and-other-illegal-drugs ) and found with the NIST RSS channel of the Chemical Substances Miner http://www.minerazzi.com/chemsubstances/spp.php

It is a nice example of Information Retrieval applied to Chemistry. They used a modified cosine similarity function. I see possible applications to topic analysis.

Original Source:
Anal. Chem., 2017, 89 (24), pp 13261–13268 DOI: 10.1021/acs.analchem.7b03320

“A mass spectral library search algorithm that identifies compounds that differ from library compounds by a single “inert” structural component is described. This algorithm, the Hybrid Similarity Search, generates a similarity score based on matching both fragment ions and neutral losses. It employs the parameter DeltaMass, defined as the mass difference between query and library compounds, to shift neutral loss peaks in the library spectrum to match corresponding neutral loss peaks in the query spectrum. When the spectra being compared differ by a single structural feature, these matching neutral loss peaks should contain that structural feature. This method extends the scope of the library to include spectra of “nearest-neighbor” compounds that differ from library compounds by a single chemical moiety. Additionally, determination of the structural origin of the shifted peaks can aid in the determination of the chemical structure and fragmentation mechanism of the query compound. A variety of examples are presented, including the identification of designer drugs and chemical derivatives not present in the library.”

C.R. Rao: A Giant & Living Legend of Statistics Among Us


, , , , ,

Curating collections requires going to original sources which is gratifying.

As part of the effort of building a miner on the golden age of Statistics, I researched those from Ronald Fisher times who might still alive. I found one researcher that precisely is Fisher’s only PhD: Calyampudi Radhakrishna Rao, now 98.

I asked Dr. Rao for help in identifying important references and moments from those times. He graciously sent me his CV listing references to all of his glorious books (15), articles (477), and moments.

Even in his retirement he is still publishing:

Dr. Rao also sent me a PDF with historical photos of him with Mahalanobis, Prime Minister Nehru, Prime Minister Indira Gandhi, and others, and of many glorious moments from his career. What an honor!

His work has impacted so many fields that there are several technical terms bearing his name.

Here is an appealing quote from him:

“We study physics to solve problems in physics, chemistry to solve problems in chemistry, and botany to solve problems in botany. There are no statistical problems which we solve using statistics. We use statistics to provide a course of action with minimum risk in all areas of human endeavor under available evidence. — C. R. Rao”

On Mind Retrieval: BrainNet


, , ,

This post is part of a series on Mind Retrieval that we started long time ago back in 2010. Links to previous posts can be found here:

The Internet can be thought of as a network of nodes where each node has an assigned IP address. The Web and Deep Web are subsets of this. Traversing these nodes is done with crawlers. To retrieve the information that flows across them is what we know as information retrieval. The information might consists of data, text, sound, images. The text might convey numbers, words, phrases, topics, ideas, theses…

Now if the nodes are human brains, the information extraction from the part of the brain responsible of thoughts, emotions, senses, can be loosely termed Mind Retrieval.

So Mind Retrieval requires communications between nodes (brains). The BrainNet Project represents one step closer toward this direction. Here is an interesting link (http://info.247apk.com/brainnet-can-have-three-brains-talk-to-each-other/) where three human brains were “connected” to accomplish specific tasks.

Imagine a large scale project where the brightest minds can interact to take on bigger tasks or solve important scientific and practical problems. Imagine artificial brains (bot brains) doing the same.

When I started this series years ago, many did not believe in the idea of mind retrieval, teasing it as mere speculations.

I guess they have been retracted since then. Mind Retrieval and brain-to-brain social networks are at a corner near you, along with peripherals: browsers, reprogramming, adverts, teleports, hacks, etc.

On Men and Ideas: Fisher vs. Pearson


, , ,

Ronald Aylmer Fisher was considered an outsider by the statistical establishment of his time.

The links below (1-3) show his struggles & nuances with Karl Pearson, his son Egon, Bowley, their followers, and the Royal Statistical Society (RSS). His life was a story of accomplishments and noise (deceptions and nasty RSS politics). He was too ahead of his time.

That reminds me of the struggles of another maverick: Benoit Mandelbrot. Eventually and like Mandelbrot, Fisher greatness was recognized. Also like Mandelbrot, he was able to boost the signal-to-noise of his career and life.

Most statisticians consider Fisher the Father of Modern Statistics (https://en.wikipedia.org/wiki/Ronald_Fisher), even when he was not allowed to teach Statistics at the University of Cambridge (they tried to silence Fisher).

Yes, scientists too can be demeaning to other scientists, more for personal reasons than for ideas and the Scientific Method. After all, they are also mostly carbon units called “humans”.

1. Fisher in 1921 https://projecteuclid.org/download/pdfview_1/euclid.ss/1118065041

2. Fisher vs Pearson: A 1935 Exchange from Nature

3. Fisher: The Outsider
R. A. Fisher: how an outsider revolutionized statistics


Predatory Journals Miner


, ,

Predatory Journals Miner. Our most recent miner. Find, research, and judge for yourself if a journal is predatory.

The problem is pretty bad, particularly when those publishing in said journals are rewarded with career promotions and job security.

It is all about the money, from all sides (authors, journals, publishers, and conferences).

Over 5,000 Japanese articles published in predatory journals

Predatory Publishers

Predatory Conferences

The problem is not unique to science conferences and “experts”. There are others out there that qualify as predatory; for instance, some predatory marketing conferences, some SEM/SEO “experts”, blah, blah,… Nothing new under the sun.

On more positive matters, here is a nice third party tool to verify a journal

This one is predatory. That one isn’t. Really? Are all journals predatory?


, , ,

Researching the origins of so-called trusted publishers (https://www.theguardian.com/science/2017/jun/27/profitable-business-scientific-publishing-bad-for-science) helped me understand the
mentality behind alleged open access predatory journals & publishers which are beating them in their own game.

There is nothing new under the sun. It is all about the money (https://en.wikipedia.org/wiki/Robert_Maxwell). Time to build a miner on that.



Regression & Correlation Calculator: Updates and Improvements


, ,

Regression & Correlation

We have updated and improved our Regression & Correlation Calculator to demonstrate, as shown in the above figure, that a Spearman’s Correlation Coefficient is just a Pearson’s Correlation Coefficient computed from ranks.

The tool uses an algorithm that converts values to ranks and averages any ties that might be present before calculating the correlations. This comes handy when we need to compute a Spearman’s Correlation Coefficient from ranks with a large number of ties.

We have explained in the “What is Computed?” section of the page’s tool that as the number of ties increases the classic textbook formula for computing Spearman’s correlations

Spearman's Correlation Coefficient

increasingly overestimates the results, even if ties were averaged.

By contrast, computing a Spearman’s as a Pearson’s always work, even in the presence or absence of ties.

To illustrate the above, consider the following two sets:

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

using Spearman’s classic equation rs = 0.6364 ≈ 0.64.
By contrast, rs = 0.5222 ≈ 0.52 when computed as a Pearson coefficient derived from ranks. This is a non trivial difference.

Accordingly, we can make a case as to why we should ditch for good Spearman’s classic formula.

We also demonstrate in the page’s tool why we should never arithmetically add or average Spearman’s correlation coefficients. The same goes for Pearson’s.

Early articles in the literature of correlation coefficients theory failed to recognize the non-additivity of Pearson’s and Spearman’s Correlation Coefficients.

Sadly to say, this is sometimes reflected in current research articles, textbooks, and online publications. The worst offenders are some marketers and teachers that, in order to protect their failing models, resist to consider up-to-date research on the topic.

PS. Updated on 09-14-2018 to include the numerical example and to rewrite some lines.

Chemistry at the Intersection of Similarity-Based Classification


, , , ,

I got a copy of this nice research work written as a book chapter, Building Classes of Similar Chemical Elements from Binary Compounds and their Stoichiometries from its author, Guillermo Restrepo.

It is great to see chemistry research at the intersection of similarity-based classification studies.

Read it. It is a nice work!