The set of attributes or features selected to model an entity is very important for correct classification.
For the problem of sequence classification, I have built an integrated process, which I refer to as feature
generation. This algorithm allows the user to construct interesting features out of basic elements and to search effectively a large space of
potential features. I have applied this approach to the problem of splice-site prediction. Predictive models for acceptor and donor site for two different organisms, human and Arabidopsis Thaliana, have achieved significant improvements in accuracy over existing, state-of-the-art approaches.
In each case, the identified sets of features were used to discover biologically interesting motifs. An easy-to-use website, SplicePort, can be used to predict new splice sites from user-input sequences, and to browse the whole collection of features for interesting signals.
I have expanded the algorithm to construct more complex features, that also capture the three-dimensional characteristics of the genomic sequence. The new features improve the predictive power of the model, and also contribute to the discovery of new biological properties.
I am interested in applying this approach for the discovery and prediction of other interesting signals, as well as expanding my research into other challenging computational problems.