I/O

The I/O folder provides basic activities for reading and writing different types of data from and to hard-disk. They all have an input port to specify the files to read from or the destination to write the data. File writer with the abitlity to handle iterative incoming data can be configurated to write one file per iteration or only a single file for all iterations. Every file writer provides a list of all written files.

Draft of the SD-file reader and the SD-file writer activitiy and the configuration panel of the writer activity. Note: With the "One file per iteration" checkbox it is possible to decide whether to write one file per iteration or one single file for all iterations.

ARFF File Reader

Reads ARFF files to harddisk. Sets the last attribute as the class attribute when its name is "Class".

SMILES File Writer

Text File Writer

XRFF File Reader

XRFF File Writer

Iterative I/O

The iterative I/O folder provides the ability to handle huge file sizes by reading them iteratively. They also have to be configurated like the basic input activities. They all have an input port to specify the files to read the data from. Additionaly you can adjust the number of elements read per iteration through the second port.

Draft of the iterative SD-File reader activity.

Note: To avoid out of memory errors uncheck the In-memory storage checkbox. Only nescessary if the data caching feature of the plugin is disabled.

Consume State

This activity is needed for the iterative loop reader activities. You have to connect an activitiy to the "state" output port of the loop activities because otherwise the port will not be evaluated. For an exapmle have a look at the Loop SDFile Reader activity.

DataCollectorAcceptor/DataCollectorEmitter

This two activities are only in combination with each other useable. The acceptor activity caches all the data coming from an iterative source. Afterwards the emitter activity reads the cached data at once and provides the whole data in a single invovation to the subsequent workflow.

Example workflow to show the usage of the DataCollectorAcceptor and DataCollectorEmitter activity.

Iterative SDFile Reader

Loop SDFile Reader

Iterative file reader for MDL SDFiles. The difference to the Iterative RXN/SD File readers is that the whole nested workflow is executed bevor the next iteration step starts.

Example workflow to show the usage of the Loop SDFile Reader Activity.

To configure the loop condition go to the "1 Details" tap and press "2 Add looping" in the advanced menu. Afterwards set the looping condition under point 3 and confirm by pressing the "4 OK" button.

Loop RXN File Reader

Iterative file reader for MDL RXN files. The difference to the Iterative RXN/SD File readers is that the whole nested workflow is executed bevor the next iteration step starts. The configuration process is the same like for the Loop SDFile Reader activity.

String Converter

This activities are used to convert string data to the data format used within the CDK-Zaverna 2.0 project and backwards. The activities have not to be configurated.

Write Molecule As PNG

Write Molecule As PDF

Write Reaction As PDF

QSAR

This folder contains activities for the calculation and the processing of QSAR descriptor results. Example workflows can be found here.Note: It is strongly recommended to write one file per iteration during iterative workflows. When the workflow is finished merge the CSVs with the Merge CSVs To QSAR Vector activity.

Calculate QSAR Vector Statistics

Evaluates some statistics about the calculated QSAR descriptor values and shows the ratio between calculated and not calculated QSAR descriptor values.

CSV To QSAR Vector

Merge CSVs To QSAR Vector

Multi CSV file reader which merges the different CSV files into one QSAR Vector.

Curate QSAR Vector

Curates the given QSAR Vector from not calculated descriptor values and removes columns which do not differ in their min/max values. You can choose between three curation methods:

Dynamic curation between rows and columns: Tries to maximize the number of remaining descriptor values. This curation type is an intermediate type between curation type 2 and 3.

Curate only columns: Discards the columns which contain not calculated descriptors.

Curate only rows: Discards the rows (molecules) which contain not calculated descriptors.

Additionaly you can choose whether columns with not in min max values differing descriptors should be discarded.

Configuration panel of the Curate QSAR Vector Activity.

Merge QSAR Vectors

Merges given QSAR Vectors into one resulting QSAR Vector. Thereby will be created a minimum subset of the existing describtors. The number of QSAR Vectors to merge is configurable.

QSAR Descriptor

This activity combines the power of all QSAR descriptors in one single activity. You can choose all available descriptors to be calculated at once.

Configuration panel of the QSAR Descriptor Activity.

QSAR Descriptor Threaded (Experimantal)

This activity bases on the QSAR Descriptor Activity but with the ability to use multi threading for the QSAR descriptor calculations. Vou can set the number of used threads in the configuration panel. Note: It is tagged as experimental because the CDK is not explicitly thread safe.

QSAR Vector Generator

Extracts from structures the QSAR descriptor values and generates a QSAR Vector.

Protein QSAR Descriptors

ART-2a Clustering

This folder provides activities for the classification of input data. The used algorithm is the ART-2a classification algorithm. Example workflows can be found here.

ART-2a Clusterer

This activity implements the ART-2a classification algorithm. There are six parameters to configure:

Number of classifications: Determines the number of classifications within the intervall of the lower and upper vigilance parameter limit.

The upper vigilance limit. The vigilance parameter determines the number of resulting classes. The higher the vigilance paramater the higher the number of resulting classes.

The lower vigilance limit.

The maximum classification time.

Scale fingerprint items to values between 0.0 and 1.0.

The output directory of the classification result files.

Configuration panel of the ART-2a Clusterer Activity.

ART-2a result As PDF

Visualizes the results of an ART-2a clustering in a PDF file. The output directory is the same as for the ART-2a Clusterer activity results.

ART-2a result As PDF File Reader

Has the same functionality like the ART-2a result As PDF activity. But it is possible choose the ART-2a results directly from hard -disk. The output directory is the directory of the input files.

ART-2a Result Considering Different Origins As PDF

This activity visualizes the fraction of each origin in the resulting classes so that it is possible to determine the similarity between different compound sources. An equal fraction within the classes shows a high equality between the sources. The output directory is the same as for the ART-2a Clusterer activity results.

Weka Clustering

This folder provides activities for clustering and result visualisation. It uses the Weka Machine Learning library. All result activities use the same output diretory than the weka clustering worker. Example workflows can be found here.

Extract Clustering Result As CSV

Extract Clustering Result As PDF

Clustering Result Considering Different Origins As PDF

Visualizes the clustering result from different origins and shows the ratio of the sources in the different clusters. The file is saved as PDF file.The activity uses the same output diretory than the weka clustering worker.

Create Weka Regression Dataset

This activity creates a regression dataset from a basic weka dataset i.e. provided by the Create Weka Dataset From QSAR Vector activity. The first attribute of the dataset has to be an UUID followed by n numeric data attributes. The last field is the class field and has to be numeric and named as "Class". The ID Class CSV File consits of two columns. The first column contains the UUID and the second column the numeric class value.

The Create Weka Regression Dataset activity.

Split Dataset Into Train-/Testset

This activity splits a given dataset into a trainset and a testset. There are three algorithms available:

Single Global Max: Uses also the simple KMeans clusterer to assemble the sets. But afterwards the algorithm tries to optimize the sets through switching the worst datapoint from the testset into the trainset. This step is performed for a certain amount of iterations. Within every iteration a classification step is performed to evaluate the worst described datapoint in the testet. The blacklisting should usually be enbabled because the algorithm is very prone to oscillation and the blacklisting suppresses this behaviour.

The configuration panel for the Split Dataset Into Train-/Testset activity.

GA Attribute Selection

The activity uses a genetic algorithm to find an optimized set of attributes.

The configuration panel for the GA Attribute Selection activity.

Heuristic Attribute Selection

The activity tries to sort the attributes corresponding to their relevance for the underlying machine learning problem. The algorithms evaluates the performance of every attribute and leaves out the worst. This step is repeated until only one attribute remains.

Evaluate Regression Results as PDF

This activity produces a PDF containing different plots and statistics characterising the used machine learning model.