The goal of the project is to validate predicted genes by computing a confidence score and suggesting possible errors/untrusted regions in the sequence. The results of the prediction validation will make evidence about how the sequencing curation may be done and can be useful in improving or trying new approaches for gene prediction tools. The main target users of this tool are the Biologists who want to validate the data obtained in their own laboratories.

In other words, given a FASTA file with a number of sequences of the same type (mrna/protein), the program takes (by calling blast) a set of reference sequences (the most similar with the current predicted gene). For the moment, from all the information provided by blast, we are interested only in the length of the reference/predicted sequences, in order to start the length validation of the predicted sequence.

As we observed, the length distribution does not fit a bell curve. The actual way to find the majority lengths among the reference lengths is by a typical hierarchical clusterization. First we assume that each length belongs to a separate cluster. Each step we merge the closest two clusters, until a cluster that contains more than 50% of the reference sequences is obtained.

The result of the clusterization can be observed in the histogram (the most dense cluster is in red). The length of our predicted data is a black vertical line.

Here you are some outputs computed for some predicted protein sequences from the ant Solenopsis invicta [1]:

1) ACCEPTED predicted sequence

2) ACCEPTED predicted sequence

3) UNACCEPTED predicted sequence

4) UNACCEPTED predicted sequence

You can try the application by yourself by cloning the code from github [2] and meeting the requirements (same Ruby gams and paths to CLASSPATH must be added -- see the README). More histograms can be found here [3].

Next step is to add a confidence percentage for each length validation test.