README.rst

Table of Contents

Installation

The BDQC framework has no requirements other than Python 3.3.2 or later.
The GCC toolchain is required for installation as some of its
components are C code that must be compiled.

After extracting the archive...

python3 setup.py install

...installs the framework, after which...

python3 -m bdqc.scan <directory>

...will analyze all files in <directory>, and

python3 -m bdqc.scan --help

...provides further help.
The contents of the online help is not repeated in this document.

Overview

What is it?

BDQC is a Python3 software framework and executable module.
Although it provides built-in capabilities that make it useful "out of the
box", being a "framework" means that users (knowledgeable in Python
programming) can extend its capabilities, and it is intended to
be so extended.

What is it for?

BDQC identifies anomalous files among large collections of files which are
a priori assumed to be "similar."
It was motivated by the realization that when faced with many thousands of
individual files it can be challenging to even confirm they all contain
approximately what they should.

These use cases merely highlight different sources of anomalies in data.
In the first, anomalies might be due to faulty data handling or acquisition
(e.g. sloppy manual procedures or faulty sensors). In the second, anomalies
might appear in data due to bugs in pipeline software or runtime failures
(e.g. power outages, network unavailablity, etc.). Finally, anomalies that
can't be discounted as being due to technical problems might actually be
"interesting" observations to be followed up in research.

Although it was developed in the context of genomics research, it is
expressly not tied to a specific knowledge domain. It can be customized
as much as desired (via the plugin mechanism) for specific knowledge domains.

Importantly, files are its fundamental unit of operation.
This means that a file must constitute a meaningful unit of
information--one sample's data, for example--in any
application of BDQC.

What does it do?

BDQC analyzes a collection of files in two stages.
First, it analyzes each file individually and produces a summary of the
file's content (Within-file Analysis).
Second, the aggregated file summaries are analyzed heuristically
(Between-file Analysis) to identify possible anomalies.

The two stages of operation can be run independently.

BDQC can be run from the command line, and command line arguments control
which files are analyzed,
how files are summarized,
how the summaries are aggregated and finally analyzed.
All command line arguments are optional; the framework will carry out
default actions. See command line help.

Alternatively, the bdqc.Executor Python class can be incorporated directly
into third party Python code. This allows it to be incorporated into
pipelines.

Results

A successful run of bdqc.scan ends with one of 3 general results:

nothing of interest found ("Everything is OK.")

two or more files were found to be incomparable

anomalies were detected in specific files

Files are considered "incomparable" when they are so different (e.g.
log files and JPEG image files) that comparison is essentially meaningless.
This rarely occurs because of the way Between-file Analysis works.

A file is considered "anomalous" when one or more of the statistics computed
on its content (Within-file Analysis) are outliers, either in the usual
sense of the word or as explained in Between-file Analysis.

In the second and third cases, a report is optionally generated (as text or HTML)
summarizing the evidence.

Design goals

The BDQC framework was developed with several explicit goals in mind:

Identify an "anomalous" file among a large collection of similar files of arbitrary type with as little guidance from the user as possible, ideally none. In other words, it should be useful "out of the box" with almost no learning curve.

"Simple things should be simple; complex things should be possible" [1] Although basic use should involve almost no learning curve, it should be possible to extend it with arbitrarily complex (and possibly domain-specific) analysis capabilities.

Plugins should be simple (for a competent Python programmer) to develop, and the framework must be robust to faults in plugins.

How does it work?

This section describes in more detail how BDQC works internally.
This and following sections are required reading for anyone
wanting to develop their own plugins.

The most important fact to understand about BDQC is that
plugins, not theframework, carry out all within-file analysis of input files.
The BDQC framework merely orchestrates the execution of plugins
and performs the final Between-file Analysis, but only plugins
examine a files' content.
(The BDQC package includes several "built-in" plugins which insure
it is useful "out of the box." Though they are built-in, they are
nonetheless plugins because the follow the plugin architecture.)

Plugins are simply Python modules installable like any Python module.
Plugins provide functions that can read a file and produce one or more
summary statistics about it.
The functions are expected to take certain forms, and the plugin is expected
to export certain symbols used by the BDQC framework.

Within-file Analysis

The plugins that are executed on a file entirely determine
the content of the summary (the statistics) generated for that file.
The framework itself never looks inside a file; only the plugins examine
file content.

The framework:

assembles a list of paths identifying files to be analyzed,

executes a dynamically-determined subset of the available plugins on each file path,

merges the plugins' results into one (JSON-format) summary per analyzed file.

Each plugin can declare (as part of its implementation) that it depends
on zero or more other plugins.

The framework:

insures that a plugin's dependencies execute before the plugin itself, and

each plugin is provided with the results of its declared dependencies' execution.

By virtue of their declared dependencies, the set of all plugins available
to BDQC (installed on the user's computer and visible on the PYTHONPATH)
constitute a directed acyclic graph (DAG), and a plugin that is "upstream"
in the DAG can determine how (or even whether or not) a downstream plugin runs.

The framework minimizes work by only executing a plugin when required.
The figure above represents the skipping of plugins; plugin #3, for example,
was not run on file #N.

By default, the summary for file foo.txt is left in an adjacent file named
foo.txt.bdqc.

Again, the BDQC framework does not read files' content; it only
handles filenames and paths.

Two or more files are considered incomparable when their summaries do not
contain the same set of statistics. This typically only occurs when files
are so different that different plugins ran, and it is usually the result of
insufficiently constraining the bdqc.scan run
(see the --include and --exclude options).
It can also occur when *.bdqc files from different bdqc.scan runs are
inappropriately aggregated in an independent bdqc.analysis run.

When incomparable files are detected it is impossible to determine which, if
any, are anomalous.

Filtering

Recall that plugins exist in DAGs ("trees") defined by their dependencies.
This arrangement facilitates reuse by allowing capabilities to be
modularized and dynamically chained together at runtime.
Typically, upstream plugins are the most general-purpose (domain-blind),
and, conversely, downstream plugins are the most specialized (domain-aware).
Thus, the leaves of the plugin DAG are the most authoritative with respect
to what constitutes an anomalous file.
For this reason, only the results of "terminal plugins",
those in the "leaves" of the DAG, are included by default inBetween-file Analysis. (However, this does not apply when
Between-file Analysis is launched independently of the bdqc.scan module.)

For example, one might launch BDQC on a directory tree, specifying a single
image-processing plugin to analyze image files. The image-processing plugin
might depend on a filetype plugin to identify files that it should process.)
The results of the filetype plugin are not of ultimate interest; it is being
used as a filter by the image-processing plugin.
Only the results of the image-processing plugin are relevant to anomaly
detection.

Flattening

A plugin's output can be (almost) anything representable as JSON data.
In particular, the "statistic(s)" produced by a plugin need not be scalars
(numbers and strings); they can be compound data like matrices or sets.
However, only scalar statistics are (currently) used in subsequent analysis.

Since JSON includes compound types (Object and Array), it supports the
creation of hierarchical data representations.
Thus, the individual (scalar) statistics in plugins' summaries are
necessarily identified by paths in the JSON data.
For example, the following excerpt of output from the bdqc.builtin.tabular
plugin's analysis of one file shows some of the many statistics it produces:

The plugin inferred that the 3rd column in the file contains quantitative
data ("class"), and the mean value of that column was 47.38.
The process of "flattening" the JSON summaries creates one column in the
aggregate matrix from the values of the mean statistic for all files analyzed,
and that column's name is the path:

bdqc.builtin.tabular/table/columns/2/stats/mean.

These paths can be used to make heuristic analysis selective. (See
heuristic configuration (TODO)).

In summary, each *.bdqc file contains all plugins' statistics for one
analyzed file; each column in the aggregate matrix contains one statistic
(from one plugin) for all files analyzed.

Heuristic Analysis

Files thata prioriare expected to be "similar" should be
effectivelyidenticalin specific, measurable ways.

For example, files that are known to contain tabular data typically should
have identical column counts. This need not always be the case, though,
which is why it is a heuristic.

In concrete terms this means that each column in the summary matrix should
contain a single value. (e.g. The bdqc.builtin.tabular/table/column_count
column in the summary matrix should contain only one value in all rows.)

If the column is not single-valued, then the analyzed files corresponding to
rows containing the minority value(s) will be reported as anomalies.

Clearly, this heuristic cannot be applied to quantitative data since it
usually contains noise inherent in the phenomena itself or its measurement.
However, a "relaxation" of the heuristic still applies:
a quantitative statistic should manifest central tendency and an absence
of outliers ("outliers" in the usual univariate statistical sense of the word).

For example, files containing genetic variant calls of many individuals
of the same species (one individual per file), performed on the same
sequencing platform, called by the same variant-calling algorithm, etc.
should typically be approximately the same size (in bytes).

Finally, missing data is also treated as anomalous. A statistic that
contains a value of null (None in Python) is always considered an
anomaly.

Thus, BDQC identifies anomalous files by three different indicators:

outliers in quantitative data (the usual sense of the word "outlier")

outliers in categorical data defines as the minority value(s) when a categorical column contains more than one value

missing values

Obviously, plugins must support these rationale by only producing
statistics that satisfy them (when files are "normal").

Finally, because heuristics are by definition not universally applicable,
plugins' output (the statistics) can be filtered so that the heuristic is
applied selectively. For example, in a particular context "normal" files
containing tabular data may actually be expected to contain variable column
counts, so this should not be reported as an anomaly.
(See heuristic configuration).

Plugins

The BDQC executable framework does not itself examine files' content.
All within-file analysis is performed by plugins.
Several plugins are included in (but are, nonetheless, distinct from) the
framework. These plugins are referred to as "Built-ins".

A plugin is simply a Python module with several required and optional
elements shown in the example below.

A plugin may provide a list called DEPENDENCIES (which may be empty). Each dependency is a fully-qualified Python package name (as a string).

A plugin may include a VERSION declaration. If present, it must be convertible to an integer (using int()).

The process function must return data built entirely of the basic Python types:

dict

list

tuple

a scalar (int, float, string)

None

These requirements do not limit what a plugin can do.
They merely define a packaging that allows the plugin to be hosted
by the framework. In particular, a plugin may invoke compiled code (e.g.
C or Fortran) and/or use arbitrary 3rd party libraries using standard
Python mechanisms.

Moreover, while a plugin is free to return multiple statistics,
the Unix philosophy of "Do one thing and do it well" suggests that a
plugin should return few statistics (or even only one).
This promotes reuse, extensibility, and unit-testability of plugins, and is
part of the motivation behind the plugin architecture.

There is no provision for passing arguments to plugins from the framework
itself. Environment variables can be used when a plugin must be
parameterized.

Developers are advised to look at the source code of any of the built-in
plugins for examples of how to write their own. The bdqc.builtin.extrinsic
is a very simple plugin; bdqc.builtin.tabular is much more complex and
demonstrates how to use C code.

The framework will incorporate the VERSION number, if present, into the plugin's output
automatically. The plugin's code need not (and should not) include it in the
returned value. The version number is used by the framework (along with other factors) to decide
whether to re-run a plugin.

A plugin should return a Python dict with the name(s) of its statistic(s) as keys.
If a plugin returns any of the other allowed types, the framework will wrap it in
a dict and its value will be associated with the key "value."

Built-ins

The BDQC software package includes several built-in plugins so that it is
useful "out of the box." These plugins provide very general purpose analyses
and assume nothing about the files they analyze.
Although their output is demonstrably useful on its own, the built-in plugins
may be viewed as a means to "bootstrap" more specific (more domain-aware)
analyses.

bdqc.builtin.extrinsic

Warning

Unfinished.

bdqc.builtin.filetype

Warning

Unfinished.

bdqc.builtin.tabular

Warning

Unfinished.

Advanced topics

Aggregation and "flattening" of JSON data

The JSON-formatted summaries generated by plugins are hierarchical in nature
since JSON Objects and Arrays can each contain other JSON Objects and Arrays.

The process of flattening the JSON to produce the summary matrix
need not, in general, result in columns of scalars (eg. numbers and string
labels).
Although it is always possible to arrive at columns of scalars by flattening ("exploding")
JSON compound objects exhaustively, the process is intentionally not exhaustive by default.
Because we want plugins to be able to return compound values as results (e.g. sets,
vectors, matrices) without complicating JSON by defining special labeling
requirements, the following rules and conventions are observed:

Arrays of values of a single scalar type are not flattened (e.g. an Array with only Numbers).

Nested Arrays--Arrays that contain other Arrays of identical dimension--are also not flattened.

Arrays of the first type are interpreted as either vectors (1D matrices) or sets.
An Array is interpreted as a set when and only when it contains non-repeated
String values.

BDQC interprets the second use of JSON Arrays as matrices. For example, in...