April 2015

Software

BLOG is a data mining software designed specifically for DNA Barcode analysis applications,
The aim of the system is to identify logic rules that are able to recognize the species
(also referred as class) of a specimen by analyzing its barcode sequence.
The standard input of the program is a FASTA format file of barcode sequences containing
the training and the testing set. The FASTA format is an internationally agreed upon format
for nucleotide sequences.

The DNA Barcodes sequences classification problem may be approached as a supervised machine learning problem in the following way:
given a reference library composed of DNA Barcode specimen sequences of known species and a collection of unknown DNA Barcode sequences
(query set) recognize the latter into the species that are present in the library. This problem may be solved with a special software
procedure present in this section.

LAF combines alignment free k-mer frequency counts sequence representations and logic data mining. Therefore, it allows the analysis of biological sequences
without the strict requirement of an alignment or of an overlapping DNA gene region. This leads to the possibility of performing classification
of non coding DNA, which is not alignable, and of whole genomes, which are very hard to align, as the problem of whole genome alignment is computationally hard.

GELA (Gene Expression Logic Analyzer) is a novel tool able to perform a knowledge discovery in gene expression profiles data of RNA-Seq.
In particular, it is able to deal with the RNA-Seq technologies and the gene expression profiles. GELA and our knowledge extraction algorithm
is tested on the public RNA-seq dataset of The Cancer Genome Atlas (TCGA), obtaining promising results.

The software implements a method to evaluate the similarity between next generation sequencing (ngs) reads.
This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings
of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods:
Needleman-Wunsch and Blast.

MALA is specifically designed for the analysis of Microarray data. The rational data representing
the gene expression is discretized into a limited number of intervals for each cell of the array;
the discrete variables so obtained are then used to select a small subset of the genes that have
strong discriminating power for the classes considered. The usual DMB algorithms for feature selection
and logic formula extraction are then used to identify networks of genes - and related thresholds on
their expression level - that characterize the classes.

DMIB is a general tool for the deployment of our software for logic data analysis. It is not designed for a specific
type of application (as it is the case for MALA and BLOG). Its configuration
is a little more complex but, as usual, it required a training set of tagged elements, and indication on the type of input
features, some details on running time and solution dimensions.

The increasing availability of large network datasets along with the progresses in experimental high-throughput technologies
have promoted the need for tools allowing easy integration of experimental data with data derived from network computational
analysis. In order to enrich experimental data with network topological parameters, the Cytoscape plugin BiNAT
(Biological Networks Analysis Tool) has been developed by Fabio Cumbo.