A novel system for the robust data analysis yielding the maximum
information

BASIC INFO

DOWNLOADS

QUESTIONS

REFERENCES

Introduction

1 Introduction

Mathematical Gnostics is a non-statistical tool used to efficiently treat small
samples of strongly uncertain data. It consists of

The axiomatic theory of uncertainty of individual uncertain data and small data
samples,

Numerical characteristics of data uncertainty resulting from this theory,

Algorithms and programs estimating true data values along with characteristics of
their uncertainty.

This approach resulted from the long-term scientific activity of the Institute of
Information Theory and Automation of the Czechoslovak Academy of Sciences in Prague.
The theory was published by P. Kovanic (1984, 1986). Pragmatic readers not interested
in the theoretical fundaments of the approach are recommended to leave out the two
following abstract paragraphs and to go directly to programs and applications.

2 The Gnostic Theory of Individual Data

The theoretical model of uncertainty of individual data uses the
axioms of the measurement theory created by von Helmholtz (1887) as a system of
conditions necessary and sufficient for consistent quantification (counting or
measuring) of real quantities. The measurement theory considers only uncertainty-free
quantification, leaving the possible errors to the statistical treatment. Unlike this,
the Gnostic model of uncertain quantification is bi-dimensional, applying the
aforementioned conditions to both (true and uncertain) components of the measured
quantity. The fundamental assumption of the theory is thus that the real data to be
treated were obtained by orderly (consistent) measurements or counting. No assumptions
of a statistical nature are used. The consistency conditions accepted as the first
axiom of the theory lead to the data item representation as an element of a
commutative bi-algebra, which implies the Minkowskian metric to the data space. The
surprising consequence of this metric can be interpreted as maximization of the
„damage“ caused by Nature’s (uncertain) contribution to the quantified data value by
moving the data item along a special (extreme) path in the Minkowskian space. To
confront this damage, the analyst looks for a way of minimizing the uncertainty, for
the best possible path back to the hidden true value. The form of the former
(“quantifying”, Minkowskian) path along with the latter (“estimating”, Euclidean)
opposite path, justify introduction of two pairs of non-linear uncertainty measures,
quantifying and estimating data weight and data irrelevance. Using a plausible
Gedanken-experiment, close connections to the classical (Clausius', pre-statistical)
concept of thermodynamic entropy are shown allowing the entropy change caused by an
individual data's uncertainty to be evaluated and corresponding information of the
data item to be estimated. Both quantifying and estimating versions of these
quantities reach extreme values when the quantifying and estimating processes follow
the mentioned paths forming the closed Gnostic Cycle. When passing this Cycle, the
entropy change causes the information change and vice versa. An important, but natural
principle is proved: the effect of a finite contribution of uncertainty to a data item
value can be minimized by the optimum estimation, but it cannot ever be completely
removed. This result can be interpreted as an information complement to the Second Law
of Thermodynamics.

Gnostic data weight and irrelevance manifest their natural robustness
of two kinds: the estimating characteristics are robust with respect to outliers
(peripheral data) while the quantifying ones are robust to inliers (to central data
and noises) of the data sample). This feature makes them suitable to be applied to
strongly uncertain data. However, it can be shown, that all these characteristics
converge to statistical ones when the uncertainty is very weak: data irrelevances
converge to ordinary (Euclidean) errors and the data weights converge to squared
errors. This presents the classical (non-robust) statistics as a tool for handling
„relatively good“ data and Gnostics as its robust extension having its own, different
theoretical fundament justifying its applications to small data
samples.

3 The Gnostic Theory of Small Data Samples.

Samples consisting of 30 data are considered large enough in
statistics to at least approximate the application of the Central Limit Theorem.
Unfortunately, there are applications not providing an analyst with samples of this
size or larger. Moreover, many tasks solved by statistics are based on a priori
assumptions of the statistical model of data which can neither be tested nor justified
by small data samples.

Unlike this, the Gnostic model of individual data uncertainty is
based only on the assumption that the underlying data are real, i.e. obtained by
consistent measuring process. To go over from the individual uncertainty to the
uncertainty of a data sample, one needs a well justified Aggregation Law for uncertain
data and their characteristics. Fortunately, the Minkowskian metric proved for the
space of the quantification process has a surprising consequence: there exists a
Lorentz-invariant and linear isomorphism between each quantifying pair of data weight
and data irrelevance and a corresponding energy-momentum pair of charge-free
relativistic particle. The relativistic composition law is well justified by the
relativistic Energy and Momentum Conservation Law, which is additive. The requirement
of quantification/estimation consistency motivates the additive Composition Law to be
accepted for both quantifying and estimating data weight and data irrelevance (as the
second axiom of the theory).

Data weights and irrelevances are parameterized by a ratio of the
observed and true data value. However, the true data value is unknown. It is to be
estimated by using all components of the data sample. The missing true value is
estimated by maximizing the information of the sample's aggregated data
items.

Accepting the data weights and irrelevances (which are non-linearly
dependent on the data values) as uncertainty measures, one uses geometry of the
Riemannian type valid in a curved space. However, the curvature of the space and its
geometry is determined not subjectively by an analyst, but objectively by the observed
data. This is how Gnostics satisfy the requirement “Let the data speak for
themselves”.

4 Gnostic Algorithms

The development of the theory was always running in close interaction
with verification of the algorithms using the Gnostic formulae in applications. There
are two classes of analytic tasks: the marginal (one-dimensional) and
multi-dimensional analysis. To solve them using the Gnostic algorithms, one needs only
data, no statistical model assumptions are used.

The Gnostic one-dimensional analysis is based on four non-standard
types of probability distribution functions and their densities: ELDF (Estimating
Local), EGDF (Estimating Global), QLDF (Quantifying Local) and QGDF (Quantifying
Global) distribution function. The estimating functions are robust with respect to
outliers, the quantifying ones to inliers. The ELDF’s flexibility can be controlled by
the scale parameter to reveal details of the structure of a non-homogeneous data
structure. Unlike this, the EGDF is relatively rigid to provide an overall view on a
homogeneous data sample. Its rigidity enables not only the “ordinary” tasks
(estimation of probability and quantiles), but also some special tasks to be robustly
solved:

Estimation of the bounds of the data domain (data support),

Objective estimation of the membership in a sample,

Estimation of the left- and right-censored and interval data,

Reliable testing of data homogeneity,

Reliable one- and two-samples hypotheses testing,

Estimation of covariances and correlations,

Robust filtering of data,

Probabilistic prediction and monitoring of processes,

Automatic data exploration and classification robustly providing the detailed
information on data features and on their quality.

The quantifying distribution functions can be advantageously applied
to data contaminated by noises partly masking the larger “signals” to carry out
analogous functions in the “noise-robust” manner.

The Gnostic multi-dimensional analysis is mainly based on robust
identification of several types of regression models. All of them use a Gnostic
“influence” function to maximize the results‘ information when conducting the method
of Iterated Weighted Least Squares. Both explicit (ordinary) and implicit forms of the
regression model are solved along with the regression in probabilities expressing the
interdependence of data probabilities instead of the data itself. All the types offer
some advantages and find reasonable applications.

Being supported by the cohesive theory, these algorithms are
applicable not only under some special assumptive conditions but more universally and
objectively because everything needed for data processing is determined by the data.
Optimality of the results consists in maximizing information mined from
data.

5 Applications

There is a rich experience with successful applications of Gnostic programs in many
fields of science and technology:

Environmental control: analysis of pollutants in waters, air, in human organisms
and of toxicity.