The goal of the project is to validate predicted genes by computing a confidence score and suggesting possible errors/untrusted regions in the sequence. The results of the prediction validation will make evidence about how the sequencing curation may be done and can be useful in improving or trying new approaches for gene prediction tools. The main target users of this tool are the Biologists who want to validate the data obtained in their own laboratories.

By now we validate 4 things about the predicted genes: the length (by clusterization and ranking - the rank of the prediction among all the hits), the reading frame, whether there is a duplication in the prediction and whether the prediction is in fact a merge between multiple genes.We adjusted our previous approach for merge detection in order to have fewer false positives. I'll briefly explain how we do the validation now:- we plot a 2D graph by using the start/end offsets of the matched regions in the prediction on the two axis
- we draw a line obtained by linear regression - a gene merge is present if the slope of the line is between 0.4 and 1.2
- these thresholds were chosen empirically and will be adjusted after analyzing a larger amount of data.Here it is a simplified drawing explaining this: