Previous topic

Next topic

This Page

Quick search

pySPACE is an interface, which automatizes data handling, data processing
and parallelization.
So the user defines Resources, how to process the data
in which Modality.
It is also specified,
where the relevant resources are with another configuration file.

Datasets are structured in dataset_defs.
Datasets should comprise data that originates from the same source and type
(data_types),
i.e. the process that generated them should be the same.
Several dataset_defs, that originate from different sources,
can be combined together into a summary.
A typical example of a dataset_defs is an EEG
measured from one subject in one session.
In contrast, a summary can contain the measurements of several different subjects or from different sessions.
Both dataset_defs and summaries consist
of the actual data and some meta data stored in the file: metadata.yaml.
dataset_defs have a type, e.g.,
time_series or
feature_vector.
The metadata is crucial, because it tells the software,
which type of data is stored in which format.
The format is also important to know, since different formats require
totally different loading algorithms, e.g., comma separated values without heading
and empty space separated values with heading.
For the relevant types, there is a direct mapping between
dataset_defs types and
the data_types, they contain.

A summary inherits the type of the datasets it comprises,
i.e. it must be homogeneous containing only one type of dataset.
The structure is quite easy,
because it is normally only one folder,
containing folders with the different dataset_defs.
So there is no special type or implementation therefore,
but only the folder name in the configuration files and in the program.
So input_path and result_dir always refer to a summary in the code.

Processing describes any kind of computation that transforms one kind of data
into another. In pySPACE, there exist different concepts of processing for
different levels of granularity mostly dependent on the Resources.
Predefined processing, which only concatenates other processing is implemented
in the chains module.
The other components, where the programmer might integrate new processing
algorithms are defined in the missions module with an
automatic integration into the interface and documentation.

The main categories are:

operation chain:

Concatenation of operations, with summaries as input,
output and intermediate results

operation:

Transformation of a summary with several parameter settings
into a new summary

process:

Single transformation part of a summary, normally only
operating with one parameter setting on one dataset,
producing a new dataset

node chain:

Concatenation of nodes to transform a dataset to a new one

node:

Transforms one component/sample of the dataset, normally

...:

External code and other elementary functions can be wrapped
or used in nodes.

On the next, main level, an operation
takes a summary as input
and produces a second summary as output.

Each operation
consists internally of a set of processes.
While the operations of an operation chain are dependent and thus processed
sequentially, the processes of an operation are independent
and can thus be processed in parallel.
The way an operation is divided into processes is not fixed,
for instance an operation might have one process per dataset of the input
summary or one process per run applied to the input summary.

Types

Both processes and operations have a type.
Currently, most processes are internally implemented
by using algorithms implemented as nodes or in Weka.
Correspondingly, there is one
NodeChainOperation/process.
There are currently two processes based on Weka.
One type is the WekaFilter-Process/Operation,
which is defined by its property to transform a summary of
datasets of type “feature vector” into another one of the same type.
It might internally apply some feature selection, normalization etc.
The second Weka based type is the WekaClassification-Process/Operation.
This type is defined by its property of transforming
one summary of datasets of type “feature_vector” into a
PerformanceResultSummary.
Usually, it applies internally a set of classifiers to the datasets
and stores several statistics concerning their performance
(accuracy, precision, recall etc.)
as well as some properties of the input data into a result file.
Furthermore, there is also one AnalysisProcess/Operation
that analysis the data contained in a
PerformanceResultSummary
and creates a set of plots that visualizes and
evaluates the effect of various parameters on several metrics.

All processes of an operation
must have the same type.
In contrast, the operations of an operation chain have typically different
types.
The restriction is that each operation of an operation chain must be able
to process the summary produced by the preceding operation.

Operations
and operation chains can directly be started from the command line
and be configured by means of a YAML configuration file.
In contrast, processes can not be started explicitly
but only as part of an operation or an operation chain.
Correspondingly, processes are not configured individually
but are created based on the specification of the
operations.
This specification file contains the type of the operation
(e.g. weka_classification), the input summary, and some information
that depend on the type of the operation.
For instance, a node_chain operation specifies which node_chain or
node chain template should be used,
which parameter values for the nodes of this chain should be tested,
and how many independent runs should be conducted (see example below).
The specification file of an operation chain consists of the input data
and the list of configuration files of the operations
that should be executed as part of the operation chain.
The specification files of operations and its chains are located
in the specs directory (see The Specs Directory).
Examples for these files can be found in: Specification Files.

The execution of an operationchain/operation
depends on the used backend.
The backend determines
on which computational modality the actual computation is performed.
Currently there are four different backends:

The SerialBackend is mainly meant for testing purposes.
It executes all processes sequentially on the local machine.

The MulticoreBackend executes all process on the local machine, too,
but potentially several processes in parallel, namely one per CPU core.

The MpiBackend uses MPI to distribute the processes
on a High Performance Cluster

The fourth backend is the LoadLevelerBackend.
This one distributes the processes on a cluster via the LoadLeveler software.
It requires that the operation/operation chain is started on a machine
with the software installed
and that some global file system is available
to which results can be written.
The same holds for the MpiBackend.

It should be pointed out here, that the
SubflowHandler is in some
cases able to communicate with the backend and distribute subprocesses.
It is responsible for giving a
node_chain the ability for further
parallelization.
So pySPACE can support a 2-level parallelization.

Though the aforementioned backend modalities are the standard we want to mention
other possibilities for completeness.
When using certain algorithms as library in the interactive interpreter or
a script, no backend is used, but some paralellization is added by hand or
by using the SubflowHandler without communication to a backend.

Furthermore, the live package has its own parallelization concept to use
the same data for different
nodechains.

The following graphic shows for some operations
which type of
dataset they take as input which type as output they produce.
The graphic is not complete and further dataset types and operations could be added.
Especially for the NodeChainOperation.
For more details see the datasetdocumentation.

Note

Though we may write, that an operation takes a dataset type as input,
it is important to mention, that the input is always a summary of
datasets of the same type (only one dataset in the extreme case)
and always produces a new summary, comprising datasets of the same
type, which may differ to the input type.