# Dataset for Sequential Sentence Classification

**10**
*Wednesday*
Apr 2019

**10**
*Wednesday*
Apr 2019

A research group at CSIE department of National Taiwan University (NTU), supervised by Prof. Shou-De Lin, is currently working on releasing a new dataset for sequential sentence classification and type classification for paper abstracts on arXiv.

This annotation and classification project is available at https://mslab.csie.ntu.edu.tw/~labeler/abstract.php

Annotations are made by article authors and upon being invited/contacted by the research group. I’m happy to see they included in the dataset my “Local Term Weights Models from Power Transformations | Development of BM25IR” paper (https://arxiv.org/ftp/arxiv/papers/1608/1608.01573.pdf).

Advertisements

**27**
*Tuesday*
Nov 2018

**Tags**

binary fractals, CSS Fractal Studio, Fractal Patterns, genetic matrices, Genetic Sequences, genetics, Kronecker, Sequencing

Back in 2017, Stepanyan & Petoukhov reported that long nucleotide sequences can be modeled as binary fractals by means of Kronecker exponentiation of matrices.

https://www.mdpi.com/2078-2489/8/1/12/pdf

see also https://arxiv.org/ftp/arxiv/papers/1310/1310.8469.pdf

Abstract reads in part:

“This method uses a set of symmetries of biochemical attributes of nucleotides. It also uses the possibility of presentation of every whole set of N-mers as one of the members of a Kronecker family of genetic matrices. With this method, a long nucleotide sequence can be visually represented as an individual fractal-like mosaic or another regular mosaic of binary type.”

We added the fractal resembling the pattern of the nucleotide sequence Homo sapiens chromosome 22 genomic scaffold into our Fractal Studio tool at

http://www.minerazzi.com/tools/fractals/studio.php

Researchers can reproduce its binary mosaic, shown above, by just selecting the Homo Sapiens Mosaic option from the tool selection menu. Compare results with Figures 4 and 8 of Stepanyan & Petoukhov article. Compare also some multifractals that the tool generates with some of the genetic mosaics described in the article.

Multidisciplinary research is a beautiful thing.

**18**
*Sunday*
Nov 2018

Posted Data Mining, Mathematics, Poems

inReverend Charles Lutwidge Dodgson, better known by his pseudonym Lewis Carroll as the creator of *Alice in Wonderland*, proposed in 1866 a new method for computing matrix determinants called Condensation of Determinants.

An elementary proof of Dodgson’s Condensation Method is available from Main, Donor, and Harwood.

Dodgson was a real genious, although with some dark sides Wikipedia.

Anyway, hidden in his poems are some gems of linear algebra. The following lines written by Lewis Carroll read the same horizontally and vertically when cases and nonletter characters are ignored.

I often wondered when I cursed,

Often feared where I would be—

Wondered where she’d yield her love

When I yield, so will she,

I would her will be pitied!

Cursed be love! She pitied me…

See figure. That is an example of a symmetric matrix.

**For Math Teachers**

Here is a nice homework: Assign integer values to unique words of the poem and compute its determinant with Dodgson’s Method.

Challenge 1: Starting at the top-left corner, read matrix diagonal elements. Explain the meaning of the diagonal message in the Carroll and then in the Dodgson’s sense.

Challenge 2: There are also other hidden messages in that matrix to read. Hint: Move through rows/cols adopting a traveling pattern. Can you find them?

**01**
*Thursday*
Nov 2018

This the third and last part of a tutorial series on the non-additivity of correlation coefficients.

http://www.minerazzi.com/tutorials/nonadditivity-correlations-part-3.pdf

Their bias & nature, transformations, and approximations to normality are discussed. The risks of blindly transforming scores to ranks or arbitrarily converting r-to-Z values/Z-to-r values (Fisher Transformations) are discussed. Shifted up cosine approximations to normality are also covered.

Not all researchers know that score-to-rank transformations can change the sampling distribution of a statistic (e.g. a correlation coefficient) and that Fisher transformations are sensitive to normality violations. Combining both types of transformations is a recipe for a statistical disaster.

Alas, some meta analysis and data analytic folks are guilty of that.

**09**
*Tuesday*
Oct 2018

Posted Chemometrics, Data Mining, Mathematics, minerazzi, Statistics and Mathematics

inCurating collections requires going to original sources which is gratifying.

As part of the effort of building a miner on the golden age of Statistics, I researched those from Ronald Fisher times who might still alive. I found one researcher that precisely is Fisher’s only PhD: Calyampudi Radhakrishna Rao, now 98.

I asked Dr. Rao for help in identifying important references and moments from those times. He graciously sent me his CV listing references to all of his glorious books (15), articles (477), and moments.

Even in his retirement he is still publishing:

http://www.pnas.org/content/pnas/early/2017/03/28/1702654114.full.pdf

Dr. Rao also sent me a PDF with historical photos of him with Mahalanobis, Prime Minister Nehru, Prime Minister Indira Gandhi, and others, and of many glorious moments from his career. What an honor!

His work has impacted so many fields that there are several technical terms bearing his name.

Here is an appealing quote from him:

“We study physics to solve problems in physics, chemistry to solve problems in chemistry, and botany to solve problems in botany. There are no statistical problems which we solve using statistics. We use statistics to provide a course of action with minimum risk in all areas of human endeavor under available evidence. — C. R. Rao”

**05**
*Friday*
Oct 2018

Ronald Aylmer Fisher was considered an outsider by the statistical establishment of his time.

The links below (1-3) show his struggles & nuances with Karl Pearson, his son Egon, Bowley, their followers, and the Royal Statistical Society (RSS). His life was a story of accomplishments and noise (deceptions and nasty RSS politics). He was too ahead of his time.

That reminds me of the struggles of another maverick: Benoit Mandelbrot. Eventually and like Mandelbrot, Fisher greatness was recognized. Also like Mandelbrot, he was able to boost the signal-to-noise of his career and life.

Most statisticians consider Fisher the Father of Modern Statistics (https://en.wikipedia.org/wiki/Ronald_Fisher), even when he was not allowed to teach Statistics at the University of Cambridge (they tried to silence Fisher).

Yes, scientists too can be demeaning to other scientists, more for personal reasons than for ideas and the Scientific Method. After all, they are also mostly carbon units called “humans”.

1. Fisher in 1921 https://projecteuclid.org/download/pdfview_1/euclid.ss/1118065041

2. Fisher vs Pearson: A 1935 Exchange from Nature

http://physics.princeton.edu/~mcdonald/examples/statistics/inman_as_48_2_94.pdf

3. Fisher: The Outsider

R. A. Fisher: how an outsider revolutionized statistics

hashtag#Fisher

hashtag#Pearson

**04**
*Thursday*
Oct 2018

A Goldmine for collection curators: Selected Bibliography of Statistical Literature, 1930-1957: Correlation and Regression Theory.

https://nvlpubs.nist.gov/nistpubs/jres/64B/jresv64Bn1p55_A1b.pdf

Enjoy it!

**14**
*Friday*
Sep 2018

We have updated and improved our Regression & Correlation Calculator to demonstrate, as shown in the above figure, that a Spearman’s Correlation Coefficient is just a Pearson’s Correlation Coefficient computed from ranks.

The tool uses an algorithm that converts values to ranks and averages any ties that might be present before calculating the correlations. This comes handy when we need to compute a Spearman’s Correlation Coefficient from ranks with a large number of ties.

We have explained in the “What is Computed?” section of the page’s tool that as the number of ties increases the classic textbook formula for computing Spearman’s correlations

increasingly overestimates the results, even if ties were averaged.

By contrast, computing a Spearman’s as a Pearson’s always work, even in the presence or absence of ties.

To illustrate the above, consider the following two sets:

X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

using Spearman’s classic equation r_{s} = 0.6364 ≈ 0.64.

By contrast, r_{s} = 0.5222 ≈ 0.52 when computed as a Pearson coefficient derived from ranks. This is a non trivial difference.

Accordingly, we can make a case as to why we should ditch for good Spearman’s classic formula.

We also demonstrate in the page’s tool why we should never arithmetically add or average Spearman’s correlation coefficients. The same goes for Pearson’s.

Early articles in the literature of correlation coefficients theory failed to recognize the non-additivity of Pearson’s and Spearman’s Correlation Coefficients.

Sadly to say, this is sometimes reflected in current research articles, textbooks, and online publications. The worst offenders are some marketers and teachers that, in order to protect their failing models, resist to consider up-to-date research on the topic.

PS. Updated on 09-14-2018 to include the numerical example and to rewrite some lines.

**11**
*Tuesday*
Sep 2018

Posted chemical mining, Chemometrics, Data Mining, ir, Mathematics, Statistics and Mathematics

inI got a copy of this nice research work written as a book chapter, Building Classes of Similar Chemical Elements from Binary Compounds and their Stoichiometries from its author, Guillermo Restrepo.

It is great to see chemistry research at the intersection of similarity-based classification studies.

Read it. It is a nice work!

**23**
*Thursday*
Aug 2018

Posted Arithmetic Geometry, Data Mining, Mathematics, Perfectoid Spaces, Quantum Theory

inPossible connections and applied research resources:

Perfectoid Spaces, Arithmetic Geometry, and Quantum Theory

Quantum Geometric Langlands Correspondence

https://ncatlab.org/nlab/show/quantum+geometric+Langlands+correspondence

Grand Unification of Mathematics and Physics

http://www.math.columbia.edu/~woit/wordpress/?p=7114

A new approach to Quantum Mechanics I : Overview

http://vixra.org/pdf/1803.0626v1.pdf

Is the tone appropriate? Is the mathematics at the right level?

https://mathematicswithoutapologies.wordpress.com/2018/06/02/is-the-tone-appropriate-is-the-mathematics-at-the-right-level/

Other useful references:

http://arxiv.org/abs/1201.6343

http://math.berkeley.edu/courses/fall-2014-math-274-001-lec

http://mathoverflow.net/questions/162803/p-adic-string-theory-and-the-string-orientation-of-topological-modular-forms-tm

http://ncatlab.org/nlab/show/LanglandsOxford2014

http://ncatlab.org/nlab/show/Weil

http://ncatlab.org/nlab/show/differential

http://ncatlab.org/nlab/show/function+field+analogy

http://ncatlab.org/nlab/show/geometric+Langlands+correspondence#GerasimovLebedevOblezin0

http://ncatlab.org/nlab/show/string+orientation+of+tmf

http://sciencematters.berkeley.edu/archives/volume2/issue12/story3.php

http://www.msri.org/programs/276

http://www.msri.org/workshops/710