INFORMATION This program is mainly designed to work under Linux. It has been tested under different distributions. Despite of this, it still should work under Windows with the use of Cygwin environment (http://www.cygwin.com/).

The proper installation requires compilation of the program. However, this instruction contains the information on the process, elementary knowledge of the Linux-like environment might be required. It is worth mentioning also that the program DOES NOT have GUI (Graphical User Interface); this may result difficult for computer beginners.

In the following, one may find instructions of the proper configuration, compilation, installation and running the program together with simple examples.

INSTALLATION The boost library is required for the compilation of the program as well as R binary. However, R is optional for the proper functioning of the program, it is used in the example. Both programs can be downloaded from the repository under Linux, or may be downloaded as sources from the developer website. www.boost.org/ and www.r-project.org/ respectively. On these websites, one can find instructions for their installations. Both, the library and the binary should be found in the default path by both the configuration script and the binary.

To install, simply type: ./configure; make

In order to obtain more information one should execute: ./configure --help

RUNNING To run the application, check the help first. Simply type ./claim -h. All the arguments may be passed either as common arguments or passed to the standard input. To check the program functionality simple type: ./claim < data/input.args

QUICK HELP The main purpose of this work is to create an easy tool for biologists to manipulate different types of biological data and help identifying functional modules, i.e. sets of genes performing similar tasks in living organisms.

It is easily extensible and configurable. User may add his own packages to process the data.

The analysis is defined by a data flow, e.g. a graph-like dependencies passing the output of one package(s) as an input of another. The user might define an input of a package in three different ways: <filename>, {-p <package> args}, \<package number>. In general the program launch looks like this: ./claim <program options> -p <package name> <package options> -p <package name> -i {-p <package name> <options>: <option2>} -p <name> -i \1 where ':' is depicting that the program should use the same package as the former but with different arguments: <options2>.

In order to obtain more information on the available packages, one should run: ./claim --help

EXAMPLEAn example has been prepared and can be used as a reference. Besides designingnew data flows, a user can obtain the application of claim described in therelated paper by simply changing the names of the input and output files,provided that the indication on data format (see end of this file) are obeyed.

In order to run the example one should execute: ./claim < data/input.args "< data/input.args" means that the file "data/input.args" contains the actual configuration which content should be passed as the standard input to the executable.

The referenced example is based on the publication, on the CLAIM software and have the following form (lines beginning with "#" are comments):

# Read Microarray from file. The delimiter in the file is tabulation (-d "\t")# and the data should be read from data/AffyNaCl_Time-course_for_cliques.csv.# Look into the file to see the format of the file. -p microarray -d "\t" -i data/AffyNaCl_Time-course_for_cliques.csv } -r

# Define the third package; calculate the shortest path between the input graph# (-i {...}) and store it as weights in the graph. -p shortest_path -i {

# Define the fourth package. Read the graph from the file# data/AI_interactions.csv, store it in boost adjacency_list structure (good# for sparse graphs) and store the information the weights in the short# data type (2 bytes per edge). See data/AI_interactions.csv for the file# format. -p ppi -d ':' -g adjacency_list -t short -i data/AI_interactions.csv }

For the sake of clarity, it has to be mentioned that different packages acceptdifferent data structures as input and deliver to output. In spite of differentrepresentations of the internal, low level representations, the user is awareof only 3 structures: vector of vectors (representing the MA array), graphrepresentation (either represented by adjacency matrix, or adjacency list) or,finally, the results (sets of genes). There is also additional typerepresenting a set of structures: multiple. In the actual set of packages thestructures are taken as input and output:* graph package can take graph or a set of graphs as input, and returns graph as an output;* claim package can take results or a set of results as input, and returns multiple of results as an output;* kmeans package can take graph as an input and results as an output;* microarray package takes vector of vectors structure or multiple of them as as an input and returns vector of vectors as an output;* ppi package takes graph as an input and provides graph as an output;* shortest_path package takes graph as an input and provides graph as an output;* corr package takes vector of vectors as an input and provides graph as an output;* limit package takes vector of vectors or graph as an input and the same as an output;* results package takes results as an input and the same as an output;

The user should be aware of these data structures while defining the data flow.An output of a package should be compatible with the input the package it ispassed to.