Using enhancing signals to improve specificity of Ab initio splice site sensors

..........

30

ORGANIZING INSTITUTIONS

Nebraska EPSCoR

EPSCoR is an acronym for the Experimental Program to Stimulate Competitive Research thatwas initiated by the National Science Foundation (NSF) in 1980 to address concerns of theU.S. Congress regarding the distribution of federal funds supporting research anddevelopment. Nebraska EPSCoR is a statewide organization established to pursue researchgrant opportunities of the federal agency EPSCoR programs.

Nebraska Informatics Center for the Life Sciences

The Nebraska Informatics Center for the Life Sciences (NICLS) facilitates the integration ofthe biocomputing/informatics disciplines with the life sciences and coordinates cross-campusand state-wide efforts in bioinformatics, chemoinformatics, pharmacoinformatics,computational chemistry, and computational biology.

Nebraska Biomedical Research Infrastructure Network

TheNebraska Biomedical Research Infrastructure Network (BRIN) project is designed toenhance the competitiveness of biomedical research in Nebraska by developing the humanand technological resources essential for cutting edge research in functional genomics. Thefoundation of the project is collaboration between seven undergraduate institutions, twocommunity colleges and the three Ph.D.

granting institutions located throughout the State.

UNMC Eppley Cancer Center

The mission of the UNMC Eppley Cancer Center is to coordinate basic research and clinicalcancer research, patient care and educational programs and to facilitate application of newknowledge about the etiology, diagnosis, treatment and prevention of cancer and to improvehealth and quality of life.

RESEARCH TALKS

1

Eugene V. Koonin, Keynote Speaker

National Center for Biotechnology Information (NCBI), National Library of Medicine(NLM), National Institutes of Health (NIH), Bethesda MD 20892-6510

A comprehensive evolutionary classification of genes is a must for making sense of genomesequences. By comparing the protein sequences encoded in 7 completely sequencedeukaryotic genomes, 6162 clusters of probable orthologs (euKaryotic Orthologous Groups, orKOGs), which include between 50 and 80% of all gene products of the respective organisms,were identified. By combining the most likely topology of the eukaryotic crown groupphylogenetic tree and the phyletic patterns of the KOGs, the most parsimonious scenario ofeukaryotic genome evolution and the minimal ancestral gene sets for ancestral eukaryoticforms were reconstructed. The reconstructed gene set of the last common ancestor of theeukaryotic crown group consists of 3365 KOGs and is substantially enriched in proteinsinvolved in information processing and central metabolism; the reconstructed gene set for thelast common ancestor of animals includes 4898 KOGs, many of which are implicated insignal transduction. In an attempt to reveal the major trends in the evolution of eukaryoticgene structure, intron positions were compared for 684

KOGsfrom 8 complete genomes ofanimals, plants, fungi, and protists, and parsimonious scenarios were constructed for evolutionof exon-intron structure for the respective genes. Remarkable conservation of intron positionthrough >1.5 billion years of evolution was revealed, with one third of the introns in themalaria parasitePlasmodium falciparum

shared with at least one crown-group eukaryote.Paradoxically, humans share many more introns with the plantArabidopsis thaliana

than withfly or nematode.The evolutionary scenario inferred from this data holds that the commonancestor ofPlasmodium

and the crown group and especially the common ancestor of animals,plants and fungi had numerous introns. Most of these ancestral introns, which are retained inthe genomes of vertebrates and plants, have been lost in fungi, nematodes and arthropods, andprobablyPlasmodium. Comparison of various features of ancient and younger introns startsshedding light on probable mechanisms of intron insertion. A strong positive correlation wasnoticed between the loss and gain of genes and introns in different eukaryotic lineages,pointing to the existence of distinct, lineage-specific trends toward genome shrinkage orexpansion.

POWERPOINT PRESENTATION

Contact Information

Phone: (301) 435-5913

Fax: (301) 435-7794

koonin@ncbi.nlm.nih.gov

http://www.ncbi.nlm.nih.gov/CBBresearch/Koonin/

2

Lyle Middendorf

Sr. Vice President of Research & Development and CTO

LI-COR Biosciences

4308 Progressive Avenue

PO Box 4000

Lincoln,NE 68504

Title of Talk

Moore’s Law of Genomic Information

Abstract

Bioinformatics progresses through stages of increasing complexity that deliver data,information, knowledge, and wisdom. Mining of genome sequence data leads to thebiological interpretation of that data giving rise to a genomics information set, which, whencombined with other information sets (e.g. proteomics; cell signaling), provides a systemsbiology knowledge base that has the potential to deliver the wisdom associated withpredictive and preventative healthcare. These stages of progression require interfaces thatmust match both technologically and economically in order to achieve successfulimplementation of the bioinformatics value chain. Within each stage, fundamental principleswhich constrain the evolution of complexity can be assessed by a “Moore’s Law” metric. Toillustrate these constraints, an evaluation of the prospects for achieving a $1000 genome willbe presented from both a throughput and a cost per base perspective.

POWERPOINT PRESENTATION NOT AVAILABLE

Information about Li-Cor

LI-COR Biosciences is a leader in the design and manufacture of instrument systems for plantbiology, biotechnology, and environmental research. LI-COR instruments for photosynthesis,carbon dioxide analysis, and light measurement are recognized world-wide for standard-setting innovation in plant science research and environmental monitoring. The companypioneered the development of infrared fluorescence labeling and detection systems for

imaging, DNA sequencing, genotyping, and AFLP®

for genomic research and discovery.Founded in 1971, the privately held company is based in Lincoln, Nebraska, with subsidiariesnear Frankfurt, Germany and in Cambridge, UK. LI-COR systems are used in over100countries and are supported by a global network of distributors.http://bio.licor.com/CompInfo.htm

Parkinson's Disease is a chronic, progressive neurodegenerative disorder of unknown etiologythat is genetically and clinically heterogeneous. One research challenge is to identifybiomarkers to aid in clinical categorization and to serve in directing more optimizedtherapeutic regimens. To address this challenge, we are employing transcriptional profilingusing microarray analyses to identify constellations of genes whose expression patterns serveas biomarkers and molecular signatures for Parkinson's Disease. To minimize invasiveness,we are using RNA templates isolated from freshly drawn blood or lymphoblastoid cell lines.To limit some aspects of the genetic and phenotypic variation, we are initially focusing onanalyzing expression profiles in parkinsonism individuals that have genetically distinctfamilial forms of Parkinson's disease. We will then compare these analyses with analyses in acohort of individuals with sporadic Parkinson's disease, and relate gene expression profiles tothe age of symptom onset and the severity of disease symptoms at the time of blood draw. Weanticipate that these analyses will allow us to identify a constellation of genes whoseexpression pattern will serve as a biomarker for Parkinson's disease, and that they will allowfor the identification of stage-specific and disease-type specific biomarkers.

POWER POINT PRESENTATION

Contact Information

E-mail: bchase@mail.unomaha.edu

Phone: (402) 554-2586

Fax: (402) 554-3532

http://www.unomaha.edu/~wwwbio/chase.html

6

Stephen Scott

Department of Computer Science and Engineering

University of Nebraska-Lincoln, Lincoln, NE 68588-0115

Title of Talk

Machine Learning in Bioinformatics

Abstract

Building machines that learn from experience is an important research goal of artificialintelligence (AI). The field of machine learning is a subarea of AI that is concerned with thequestion of how to construct computer programs thatautomatically improve with experience.In recent years many successful machine learning applications have been developed,including data mining programs that learn to detect fraudulent credit card transactions,information-filtering systems that learn users' reading preferences, and numerous approachesto biological sequence analysis, phylogenetic inference, and other applications inbioinformatics. We will introduce some of the fundamental concepts in machine learning andoverview various applications of machine learning in bioinformatics.

POWERPOINT PRESENTATION

Contact Information

E-mail: sscott@cse.unl.edu

Phone:

(402) 472-6994

Fax:

(402) 472-7767

http://www.cse.unl.edu/~sscott

7

Hong Jiang

Department of Computer Science and Engineering,

University

of Nebraska-Lincoln; Lincoln, NE 68588-0115

Title of Talk

A Case Study of Parallel I/O for Biological Sequence Search on Linux Clusters

Abstract

In this paper we analyze the I/O access patterns of a widely-used biological sequence searchtool and implement two variations that employ parallel-I/O for data access based on PVFS(Parallel Virtual File System) and CEFT-PVFS (Cost-Effective Fault-Tolerant PVFS).Experiments show that the two variations outperform the original tool when equal or evenfewer storage devices are used in the former. It is also found that although the performance ofthe two variations improves consistently when initially increasing the number of servers, thisperformance gain from parallel I/O becomes insignificant with further increase in servernumber.

We examine the effectiveness of two read performance optimization techniques in CEFT-PVFS by using this tool as a benchmark. Performance results indicate: (1) Doubling thedegree of parallelism boosts the read performance to approach that of PVFS; (2) Skippinghot-spots can substantially improve the I/O performance when the load on data servers ishighly imbalanced. The I/O resource contention due to the sharing of server nodes by multipleapplications in a cluster has been shown todegrade the performance of the original tool andthe variation based on PVFS by up to 10 and 21 folds, respectively; whereas, the variationbased on CEFT-PVFS only suffered a two-fold performance degradation.

POWER POINT PRESENTATION

Contact Information

E-mail: jiang@cse.unl.edu

Phone: (402) 472-6747

Fax: (402) 472-7767

http://cse.unl.edu/~jiang/

8

Alex Nicoll

Associate Director for Technology

Nebraska University Consortium on Information Assurance

College of Information Science and Technology

University of Nebraska at Omaha, Omaha NE68182

Title of Talk

User Friendly Cluster Computing

Abstract

Computing clusters have become more and more prevalent in high performance computingcenters across the globe. However, the users of these clusters are more often scientificresearchers, with little or no computing experience, than expert programmers. Therefore it isimportant to ensure that a cluster computing resource is accessible to all users, not just theexperts. Covered in this talk will be an overview of how we at UNO have addressed that need,and a roadmap for future architecture improvements. Technical implementation details will beincluded.

POWERPOINT PRESENTATION

Contact Information

E-mail: anicoll@unomaha.edu

Phone:(402)554-2060

9

Steven Hinrichs

Director, University of Nebraska Center for Biosecurity

University of Nebraska Medical Center, Omaha, NE68198-6495

Title of Talk

Challenges and Opportunities in Bioinformatics and Homeland Security

Abstract

The threats posed by bioterrorism raise many new challenges to the US and its citizens. Thefields of Information technology and Bioinformatics, from data exchange engines tocomputational analysis of DNA sequences,have important roles to play in meeting thesechallenges. This presentation

will describe the opportunities presented by achieving a greaterlevel of data exchange between both federal and state agencies with responsibility foremergency preparedness and by developing collaborative research programs betweenbiologists and computer scientists. The dual use application of information technologysolutions for not only bioterrorism preparedness but also public health in general will bediscussed, including the ability to detect the outbreak of new infectious illnesses and forcomputational approaches to the rapid identification of biological materials of unknownorigin.

POWERPOINT PRESENTATION

Contact Information

Email: shinrich@unmc.edu

Phone: (402) 559-4116

Fax: (402) 559-4077

PANEL: BIOINFORMATICS EDUCATION

IN NEBRASKA

Hesham Ali, UNO, Moderator

University of Nebraska at Omaha



New Proposed Undergraduate Program in Bioinformatics at UNO

Presented by Hesham Ali



Graduate Degrees in Bioinformatics-

Through the MS in CS and the Ph.D. in ITPrograms at UNO

Presented by Hesham Ali

POWERPOINT PRESENTATION

Contact Information for Hesham Ali

hesham@unomaha.edu

Phone

(402) 554-3623

Fax:

(402) 554-3284

http://www.cs.unomaha.edu/fac-staff/hali.html

University of Nebraska Medical Center



The Nebraska Biomedical Infrastructure Network

Presented by William Chaney, UNMC

POWERPOINT PRESENTATION

Contact Information for William Chaney

E-mail:

wchaney@unmc.edu

Phone:

(402) 559-6657

Fax:

(402) 559-6650

http://www.unmc.edu/Biochemistry/faculty/chaney.html



Bioinformatics Specialty Track, Department of Pathology-Microbiology (UNMC), inConjunction with the College of Information Science and Technology (UNO)

A GeneticAlgorithm for Simplifying Amino Acid Alphabets and PredictingProtein-Protein Interactions

Matt Palensky

and Hesham Ali

Dept of Computer Science; College of Information Science and Technology

University of Nebraska at Omaha, Omaha NE 68182-0116

mpalensky@mail.unomaha.edu, hesham@unomaha.edu

A central problem in creating simplified amino acid alphabets is narrowing down the massivenumber of possible simplifications. Since considering all possible simplifications isintractable, effectively using heuristics is essential. Genetic algorithms have been effective inproviding near-optimal solutions for similar combinatorial problems with large solutionspaces. Simplifying amino acid alphabets may potentially reduce the degree of complexityfor several difficult problems. In this project, we study the impact

of reducing the alphabet inaddressing an important open problem in microbiology, which is predicting protein-proteininteractions. Various techniques for predicting protein-protein interactions exist, but no singlemethod can effectively predict more than a small subset of interactions. Hence, acomprehensive listing all of a cell's protein-protein interactions may require manycomplimentary approaches. Simplified amino acid alphabets could uncover hiddenrelationships in protein sequences, and in turnprovide a valuable first step in solving protein-related microbiological problems. In this research, we employ a new genetic algorithm tosimplify amino acid alphabets and show the impact of reducing the alphabet in predictingprotein interactions.

A Hidden Markov Model for Gene Functional Prediction

Xutao Deng, Hesham Ali

Dept of Computer Science

University of Nebraska at Omaha, Omaha, NE68131

xdeng@mail.unomaha.edu, hesham@unomaha.edu

The prediction of functional class of genes or (Open Reading Frames) ORFs is important forunderstanding the role of unknown genes and gene networks. Currently, the best accuracy ofthe prediction provided by available computational approaches is around 30%. In this project,we develop a gene functional prediction tool

based on Hidden Markov Models (HMMs). Thetraining data are solely time-series gene expression data from yeast experiments. Becausetime-series expression data have Markov property and HMM have showed great success inmodeling sequential data sets in thearea of speech recognition, we expect the predictionaccuracy will be higher than other data mining tool such as Support Vector Machines (SVMs)

and decision trees. Preliminary results showed that HMMs can be elegantly applied in geneexpression data sets and achieve better performance than SVMs. Currently, we are integratingHMMs into Dynamic Bayesian Networks (DBNs) for functional prediction of genes.

Analysis of core promoter motifs in genes that are expressed in pancreas

The expression of genes in specific cells and tissues is regulated by the promoter of eachgene. The best known element of the core promoter is the TATA box. This element is locatedupstream to the transcription start site (+25 to +32), and conforms more or less strictly to theTATAA sequence motif. A second core promoter element is the initiator that overlaps thestart site of the transcription, with the sequence Py-Py-A-N-T/A-Py-Py, wereA

is the startsite of transcription.

The major interest in our laboratory is to understand which

DNA sequences regulate cell-type-specific expression of genes in various tissues and over time. In this study, we analyzedthe features and composition of promoters associated with genes that range in their expressionfrom highly specific for one or few tissues to broad or ubiquitous distribution. Data on tissuedistribution patterns of expression are available for many genes from microarray assays, andrelative specificity of each gene was classified on the basis of Shannon entropy as theinformation measure.

Our analyses of promoter composition indicate that promoters of human genes with highspecificity for expression in pancreas preferentially contain the TATA box motif, with orwithout initiator. With decrease of cell-type specificity, the fractions of genes with TATAmotifs in their core promoter decreases. Conversely, initiator motifs are more prevalent inwidely expressed genes.

This pattern of differential promoter composition was confirmed for the corresponding mouseorthologous genes. Thus, tissue-specific and ubiquitous genes appear to be regulated bydifferent core promoter elements. The relevance of our results for mechanisms of generegulation in various tissues will be discussed.

BioExtract Server Metadata Mapping–

Creating a Federated BiologicalDatabase

Xingming Du

Department of Computer Science,

University of South Dakota;Vermillion, SD 57069

xdu01@usd.edu

The rapid growth of biology research has resulted in an explosion of bioinformatics data(DNA Sequences, gene expression data) anddatabases. This generation of data and databaseshas promoted biology research even further. However the distribution of those databases hashindered biology research in some way, which brought forth the demand of mappingfederated biological databases. Afederated database refers to a set of disparate databaseswhich are viewed by researchers as one database. Federated biological databases represent anextraordinarily diverse collection. They are complicated by the complex data type and evenfurther complicated by the kinds of interpretation supported by the databases. BioExtractServer Metadata mapping is one technique used to map search fields semantically to those infederated biological databases. BioExtract Server talks to federated biological databasessemantically, extracts the related data from those databases and presents the data toresearchers who can get the results with one step. BioExtract server provides flexible web-based query capabilities for researchers through the implementation of a relational meta-database. It also supports system administration functionality for integration of new federateddatabases via a web browser.

The long-term goal of this project is to create an expert system for HIV/AIDS research byusing the power of computer and information sciences. The expert system will combineexpertise in epidemiology, infectious diseases, neurosciences, biology, early detection andpatient care. The systems will allow clinicians and researchers to collect HIV/AIDS-relateddata in a convenient and efficient way, transfer the data into statistical information, and use itin statistical models to predict the risk of AIDS development as well as estimate the survivalrates of HIV/AIDS patients.

Comparative Analysis of Gene Prediction Methods and Development of aFungal Genome Database System

Skanth Ganesan1;

Steven Harris2; Etsuko N. Moriyama3

1Department of Computer Science; University of Nebraska-Lincoln

2Department of Plant Pathology; Plant Science Initiative; University of Nebraska-Lincoln

3School of Biological Sciences; Plant Science Initiative; University of Nebraska-Lincoln

skanth@unlserve.unl.edu;sharri1@unl.edu; emoriyama2@unl.edu

Fungi, plants, and animals represent the three kingdoms of eukaryotic organisms. A vastnumber of fungi are filamentous and have enormous health, economic and ecological impact.

As part of the Fungal Genome Initiative, the complete genome sequences of severalfilamentous fungi have recently become available. Multiple gene prediction programs arebeing used to address the problem of identifying coding regions within these genomes.Despite several limitations, existing methods of gene prediction and models of gene structureare often applied to newly sequenced organisms for which no model or method has yet beentuned. Our objective is to analyze the available gene mining methods by assessing theirprediction performance as well as their use of varied genomic information. We are developingan integrated genome database system that will facilitate the genome annotation of threefilamentous fungi; Neurospora crassa, Aspergillus nidulans andFusarium graminearum.

Dichotomy Analysis of Proteomics and Genomics data

Marina Sapir

and Simon Sherman

Eppley Institute for Research in Cancer and Allied Diseases

University of Nebraska Medical Center; Omaha NE 68198-6805

marina@sapir.us, ssherm@unmc.edu

We introduce an intuitive integrated approach for the analysis of genomics and proteomicsdata.The approach is based on a certain basic dichotomy of each feature. We use thisdichotomy to evaluate classification ability of the feature and to make an elementaryclassifier. Simple voting procedure aggregates these independent classifiers into the finaldecision rule. The proposed dichotomy test can be used to evaluate statistical significance ofthe correlation between the feature and the class attribute.Applying the dichotomy analysison the Leukemiaand Ovarian

Cancer datasets, we were able to find several features withstrong classification abilities. The resulting classification rules built with very few featuresarecomparable

by prognostic accuracy with much morecomputationally extensive

procedures,applied on the same datasets.

DNA-Computing

Vladimir Ufimtsev

andVyacheslav Rykov

Department ofMathematics

University of Nebraska at Omaha, Omaha, NE 68132

vufimtsev@mail.unomaha.edu vrykov@mail.unomaha.edu

Molecular computing is a field that focuses on manipulations with single molecules forcomputational purposes. One of the most powerful molecules that has been found for thesepurposes is deoxyribonucleic acid (DNA). Through the powers of biomolecular computingthe extraordinary parallelism occurring in nature can be uncovered and used to our advantage.Great parallelism at nanoscales has been discovered to be inherent in natural phenomena andwe can now realistically imagine this power being used to

solve computational problems. Theformulation of revolutionary algorithms in biomolecules would present a very effectivealternative for the growing demands of computational power in our world. This paper willfocus on the sticker model for DNA computing.

Existing algorithms for NP-Completeproblems have been adapted and new methods and operations are proposed for computationsusing the sticker model.

Evolution of 3-isopropylmalate dehydrogenase

Philip M. Terry

and Hideaki Moriyama

Department of Chemistry; University of Nebraska-Lincoln, Lincoln, NE

pterry2@unl.edu, hmoriyama2@unl.edu

In excess of 150 protein sequences for a family of decarboxylating dehydrogenases whichinclude those for 3-isopropylmalate, isocitrate, and tartrate are now available for study ofevolutionary, sequence, structure, and function relationships among species. Among them, 3-isopropylmalate dehydrogenase or (IPMDH) is well-studied biophysically and biochemically.

To analyze sequence variation in IPMDH among the available sequences, we created multiplesequence alignments (MSA), using as input, a set of BLASTP hits (E value < e-14) resultingfrom an IPMDH as query. Gaps and substitutions in columns of the MSAare being comparedwith available structures from the PDB to validate the alignment of sequences in the MSA.We project biochemical knowledge of IPMDH to the MSA to validate the alignments.

Horizontal (or lateral) gene transfer (HGT) can occur between distantly related species. Thisphenomenon is considered a major force in organismal evolution. However, questions are stillsurrounding the mechanisms and validity of HGT. Phylogenetic analysis is the best currentlyavailable method for establishing incidences of ancient HGT. Here, we report phylogeneticanalyses of SET-domain containing proteins in prokaryotes and eukaryotes. The SET domainhas been defined as a highly conserved peptide (~130 amino acids) found in epigeneticregulators. Biochemically, the SET peptide carries lysine methylating activity that targetsspecific lysine residues from the tails of the nucleosomal histones. Because

chromatin andhistones are signature features of eukaryotes, it has been assumed thatSET-genes are onlyfound in eukaryotes. SET-domain coding genes were reported in some bacteria, but theirinitial identification only in parasitic and symbiotant species was assumed to representtransfer from a eukaryote to a prokaryote. Comprehensive analysis of ~150 fully sequencedbacterial and archebacterial genomes identified ~30 prokaryotic species (pathogenic,symbiotant, and free-living) that carry SET domain coding genes. Even closely related specieswithin the same family can differ by the presence/absence ofSET

genes. These data seemedto favor HGT. Further analysis, however, revealedSET-gene paralogs in bacteria.Phylogenetic analysis of prokaryotic and eukaryoticSET

genes revealed a surprising pictureindicating that the SET domain, probably, has a common ancestor. Therefore, the prokaryoticgene(s) did not come from horizontal gene transfer between the eukaryotic and prokaryotic

domains of life. However, there are cases of an apparentSET-gene HGT between prokaryoticspecies, like the SET-genes inBacillus

andMethanosarcina.

Finally, we show that inbacteria, a peptide downstream of the SET peptide (named the post-SET domain ineukaryotes) has co-evolved together with the SET domain to perform bacterial gene specificfunctions.

Genome-Wide Identification of Thiol/Disulfide Oxidoreductases

Dmitri E Fomenko, Stephen Scott, and Vadim N Gladyshev

Department of Biochemistry

University of Nebraska-Lincoln, Lincoln, NE 68588

dfomenko@genomics.unl.edu,sscott@cse.unl.edu,vgladyshev1@unl.edu

Thiol-dependent redox regulation is an important, but poorly characterized biological processthat is involved in oxidative stress defense, signal transduction, protein folding and regulationof protein activity. Thiol-dependent redox processes are catalyzed by structurally distinctfamilies of enzymes, thiol/disulfide oxidoreductases, which are difficult to identify byavailable protein function prediction programs. The CxxCmotif (two cysteines separated bytwo residues) is most often present in thiol/disulfide oxidoreductases. We found thatreplacement of one of cysteines in the CxxC motif with serine or threonine is also suitable fora catalytic redox function. We show that

conserved Cxx(C|S|T), (C|S|T)xxC (x is any aminoacid) sequences present in the context of a simple secondary structure pattern may be used asa predictor of redox function.

Cross-talk between cell adhesion molecules (CAMs) on cancer cells and specific hostmicroenvironment cells is critical for tumor invasion and metastasis. Identifyingpeptidomemetics that bind membrane receptors seemingly on vascular endothelial cells ofspecific organs are significant in organ-selective targeting or blocking. Eleven uniquepeptides that can bind specifically to lung, liver, bone marrow or brain were identified byinvivoselection using

a phage display peptide library in NOD-SCID mice. These organ-specificpeptides are seven amino acids in length, and they are the critical binding residues involved inCAM specific protein interactions. We have developed a high-throughput strategy based onthe mouse genome and proteome to identify known CAMs containing these peptides in theirextracellular regions. The strategy involves three overlapping methods comprising ofnucleotide/protein sequence, annotation and mRNA expression based database searches.These searches were done using the peptides as queries against different databases, includinga Local Mouse Cell Adhesion Molecule (LMCAM) sequence database developed in our lab.The resultant proteins were analyzed using a filtering algorithm that selected approximately

30 known CAMs, including a family of proteins called semaphorins. The mRNA expressionof SEMA5A protein using an experimental strategy in human pancreatic cancer cell linesshowed expression in those originating from metastatic tumors, but not from primary tumors.The results are promising as suggested from the examination of public microarray and SAGEexpression databases, and protein structure information. Therefore, a number of new CAMscan be identified with these combined computational and experimental methodologies as aninitial approach, thereby paving theway for complete understanding of various diseaseprocesses, and making specific targeting possible using these peptidomemetics.

Identification of microorganisms at the species level by comparing stringsderived from their DNA sequences

A new approach to evaluate the relatedness of DNA sequences that eliminates the requirementto align sequences prior to analysis has recently been described and termed the RelativeComplexity Measure (RCM). The first step in the RCM method yields a “dictionary”composed of “strings” derived from the sequence being analyzed. In this study the dictionaryof strings derived by the RCM algorithm from the 18S rDNA and cytochrome b genesequences was utilized to evaluate the feasibility of identifying microorganisms based on thesimilarity of strings present in their respective dictionaries. The 18S rDNA and cytochrome bgene sequences from multiple strains of the following organisms were obtained fromGenBank and evaluated:Candida albicans,

Candida glabrata,

Candida parapsilosis,

Candida kruisiiincluding single strain of

Candida dubliniensis, Candida lusitianiae, Candidatropicalis and Malassezia furfur. Using the RCM algorithm a unique dictionary was createdfor each species. The dictionaries were then compared using a second algorithm called RCM-C, which extracted the “common strings” and “unique strings” and calculated the membershipof strings for the dictionaries (membership = number of common strings / number of common+ unique strings). The membership values reflected the degree of similarity between the twodictionaries. Using this method we compared the dictionaries obtained from the cytochrome band 18S rDNA gene sequences separately from sevenCandada speciesandM. furfur.

Inaddition, the dictionaries obtained from cytochrome b and 18S rDNA sequences for eachspecies were combined and queried with the dictionaries obtained from either the cytochromeb or 18S rDNA sequence. The results showed that the RCM-C approach correctlydifferentiated these microorganisms at the species level when either one of the targetsequences was used for query. Combining the dictionaries from two different target sequencesdid not alter the ability of this approach to identify microorganisms at the species level. Theseresults demonstrate that comparing strings derived from multiple target DNA sequences usingRCM and RCM-C algorithms was able to identify fungal organisms at the species level, andthis approach was a dependable alternative to pair wise sequence comparison.

IdentifyingSplice Variants Through EST Assembly

Yi-feng Li

and Hesham Ali

Department of Computer Science; College of Information Science and Technology;

University of Nebraska at Omaha, Omaha, NE 68182-0116

yl1@unmc.edu; hesham@unomaha.edu

Alternative splicing has recently emerged as the most important mechanism to increaseprotein diversity. To further explore its functional roles and regulatory mechanisms, it isessential to identify different splice forms from available resource. The Expressed SequencedTags (EST)

database, which contains a broad sample of mRNA, provides an ideal source forhints on different splicing patterns. Furthermore, Unigene system in NCBI has partitionedEST sequences into a non-redundant set of gene-oriented clusters. In this project, a programtailored for EST assembly is developed to reconstruct individual EST cluster into contigs thatcorrespond to different transcripts. After assembly, the reconstructed transcripts are alignedwith parent genomic DNA to reveal possible splicing patterns. This assembly approachsignificantly facilitates splicing variants discovery from EST data.

2Center for Biotechnology, School of Biological Sciences; University of Nebraska-Lincoln

cgwang@bigred.unl.edu, glu3@unlnotes.unl.edu

The studies on mitochondrial genetic diseases and mitochondrial DNA (mtDNA) intraspeciesdiversity are key topics in population genetics and medicine. Most mtDNA variations withinand among populations are single base variants, known as single nucleotide polymorphisms(SNPs). SNPs as an abundant form of mitochondrial genome

variation, however, have notbeen systematically studied in the field of human molecular evolution and genetic diseases.This research uses mitochondrial genome as a model to study molecular evolution anddisease-associated SNPs in humans. For this purpose, a bioinformatics tool consolidating mtSNP information in various public repositories and literature is developed. We will presenthere the preliminary findings of mitochondrial SNPs potentially associated with humanpopulation evolution and genetic diseases.

Mining Principal Components in Very Large Gene Expression Profiles

Li Xiao;Simon Sherman

Eppley Institute for Research in Cancer and Allied Diseases

University of Nebraska Medical Center, Omaha, NE 68198-6805

lxiao@unmc.edu, ssherm@unmc.edu,

Microarray is a technique to monitor the expression of thousands of genes simultaneously.The gene expression profiles in a microarray experiments often form a huge multi-dimensional datasets. Principal Component Analysis has the ability to present the variancestructure of a set of variables through a few new variables, which are linear combinations ofthe original ones. The computation cost is very high to get the principal components in a large

multidimensional dataset. In this work, a method to efficient mine the principle components invery large gene expression profiles was proposed. Silhouette validation technique was appliedto optimize the k value in k-means classification for the gene expression profiles. Thedimensions of the data set are decreased in

such a way that, the average values within eachsuitable class of genes are used instead of the individual values of each gene. It was shownthat for the very large multi-dimensional gene expression profiles, the principal componentscould be calculated in a very reasonable computational time scale.

Since small peptides with turn structures are highly flexible, their characterization by eitherNMR or UV-CD spectroscopy is usually difficult and yields only time-averaged spectra withcontributions from each structure type present. NPGQ, GKDG, DDKG, DEKS, VPaH, andVPsH were previously characterized as-turns or turn forming cores of longer peptides byone or more of the above methods. Therefore, in this study 25 ns molecular dynamicssimulations of structures were performed. The DSSP method and clustering were used toanalyze trajectories. DSSP analysis of trajectories showed a fluctuation between-turn andunordered structure for all sequences, although it failed to recognize bend structures becauseof the insufficient peptide chain length.

The SET domain is approximately a 130 amino acid motif identified in plants, animals, andyeast, and considered to be associated with eukaryotic functions. These proteins both activateand repress gene transcription mechanisms. Proteins in different families contain unique setsof other domains that are not shared between different families. In order to elucidateevolutionary relationships and distributions of this protein family across eukaryotes, we areconducting large-scale searches from various fungal genomic databases as well as protozoanand other eukaryotes. Our results indicate that some SET-domain protein groups unique tofilamentous fungal species. Phylogenetic analysis shows that these proteins can be classifiedbased on their internal architectures of SET domain sequences.

On Clustering Biological Data Using Message Passing

Huimin Geng

*; Dhundy Bastola†; Hesham Ali *

*Department of Computer Science, College of Information Science and Technology,University of Nebraska at Omaha, Omaha, NE 68182-0116

†Department of Pathology and Microbiology, University of Nebraska Medical Center,Omaha, NE 68198-6495

hgeng@mail.unomaha.edu,dbastola@unmc.edu,

hesham@unomaha.edu

Clustering algorithms have been frequently used in many areas in bioinformatics to classifybiological data as in the analysis of gene expression and in the building of phylogenetic trees.In this study, we propose a new clustering algorithm that employs the concept of messagepassing. Message Passing Clustering (MPC) allows data elements to communicate with eachother and produces clusters by intrinsic processes, and hence simulates human intelligence.We have used 35 simulateddata sets from dynamic gene expression typical of microarrayexperiments to evaluate the proposed method. In our experiments, 95% hit rate is achieved inwhich 639 genes out of total 674 genes are correctly clustered. We have also applied MPC toreal datasets to build phylogenetic trees. The obtained results show higher classificationaccuracies as compared to other traditional clustering methods.

Ontology Specific Data Mining Based on Dynamic Grammars

Daniel Quest; Hesham Ali

Dept of Computer Science; College of Information Science and Technology; University ofNebraska at Omaha, Omaha, NE 68182-0116

daniel_quest@cox.net, hesham@unomaha.edu

In this project, we introduce a new formal approach for mining biological databases. Theproposed grammar based approach provides a flexible and powerful tool for advancedsequence comparison and data mining. The approach benefits from the power of regularexpression in allowing Bioinformatics researchers to use advanced queries in comparingsequences and searching formotifs in Biological databases. A common hypothesis is thatbiological sequences contain elements or functional units that determine the interactions ofthe molecule. These elements may not be detectable by a homology search using simplealignment tools because of the interference and noise produced by mutations in theevolutionary process. However, these consensus subsequences or expressions are the key tothe functionality of the sequence or to understanding the relationship between the sequenceand other biological units. In this paper, we introduce a formal grammar and a correspondingdata mining engine capable of extracting records.

Partition Coding and Its Application to Analysis of Complex Disease Data.

of Biology has endowed us with an extravagant amount of newknowledge. From such knowledge we develop a better understanding of the functionality of

the human being. Genetics has enabled us to detect a class of diseases known as complexdiseases. At the present time, diagnosis of complex diseases such as ADHD is a problem thatis still being studied. The mathematical tools that we discuss will aid the analysis of complexdisease data. These methods present new implications of the partitioning of data sequences.We define a new concept of distance (based on the Hamming distance) between two distinctsets of unordered partitions that we call a partition-distance. We can verify that this distance isa valid metric in the space of unordered partitions of any finite set S size n, where eachpartition contains <= q disjoint subsets of S. Using the distinct partitions of a set S, endowedwith the proposed metric, we investigate a new class of codes which we call q-partition codes.

peptides from tandem mass spectrometry can be used to search byhomology and have been found to double the number of peptides added to the percentcoverage of protein or identify a homologous protein that mass database searching could notdetermine. We have developed a web based program, using blast algorithm and customdatabases to automatically sort throughde novo

peptides and display the data. The predictedde novo

sequenced peptides that result from the PEAKS program have inherent errors due tothe quality of the spectra that it interprets.In order to determine which sequence is the mostaccurate for protein identification, searching the sequences by homology can help find theerrors without having to physically look at each individual spectra.Hand sorting throughdenovo

peptides is inaccurate, biased, time consuming, and requires knowledge of massspectrometry (often not known by the researchers). Proteomic questions can be categorizedinto two main groups: known protein confirmation in which the researchers are looking forexpression levels (i.e. presence/absence), and unknown protein identification. In the casewhere a protein identification was made, maximizing confidence is achieved by % coverage.In the case of an unknown protein, identification can occur, but often is not due to sequencevariation from failure of mass database searching. In these cases, raw data can show peptidesequences do exist;de novo

sequences must be used. In order to limit wasting useful data,automatingde novo

2School of Biological Sciences; Plant Science Initiative; University of Nebraska-Lincoln

cstrope@cse.unl, emoriyama2@unl.edu

Objectives:

Phylogenetic trees are reconstructed based on multiple alignments. Using anyphylogenetic methods (e.g., Neighbor-Joining, Maximum Parsimony, or MaximumLikelihood) we can examine evolutionary relationships among protein sequences andelucidate the hypothetical ancestral protein sequences. Multiple alignments of proteinsequences are generally more useful if protein sequences have undergone only pointmutations with limited amount of insertion/deletion events. However, this approach is notvery effective for modeling more dynamic changes, such as duplication, translocation,insertion, and deletion of large protein regions or domains. In this study, we analyzed theperformance of different methods of reconstructing phylogenetic trees from protein sequenceswith such dynamic evolutionary history.

Prediction of amphipathic helices using statistical analysis

Mamta Bajaj1;Hideaki Moriyama2; Etsuko N. Moriyama3

1Department of Computer Science; University of Nebraska-Lincoln

2Department of Chemistry; University of Nebraska-Lincoln

3School of Biological Sciences; Plant Science Initiative; University of Nebraska-Lincoln

mam_b99@yahoo.com,hmoriyama2@unl.edu,

emoriyama2@unl.edu

Many secondary structure prediction methods have been developed. However, very fewmethods are available for predicting amphipathic helices. Amphipathic alpha helices are veryimportant for protein structure and functions. These alphahelices have hydrophobic andhydrophilic faces, which are corresponding to the protein side and the other side. Locatingthis helix helps in predicting the function of a protein such as DNA-binding proteins. We aredeveloping a method that predicts such alpha helices based on a set of new statistics. Trainingsets consisting amphipathic alpha helices are prepared from the Protein Data Bank (PDB).The helices in PDB are searched by calculating torsion angles. Surface accessibility is alsocalculated to findamphipathic helices as long as manual examination. Using this trainingdata, we optimize a set of statistics that discriminates between amphipathic alpha helices andnon-amphipathic helices. We will discuss the performance of this new method comparing toother methods.

MUC1, a glycosylated transmembrane mucin that is substantially overexpressed andaberrantly glycosylated in many tumors, isbelieved to contribute to metastasis. Thecytoplasmic tail (CT) of MUC1 co-localizes with β-catenin which interacts with transcriptionfactors to regulate gene expression. We hypothesized that the MUC1 CT is involved in theregulation of expression of genes that contribute to tumor growth and metastasis. Toinvestigate alterations in gene expression in pancreatic tumor cells, we performed microarray

experiments using a cDNA array. The results revealed that overexpression of MUC1 in thepancreatic cell lineS2-013 differentially regulated 28 genes. Deletion of MUC1 CT partiallyrestored the expression of 7 genes. The expression levels of 5 of these genes directlycorrelated with increasing metastatic potential of pancreatic cell lines. To determine whetherthis is a direct effect of MUC1 CT mediated signal transduction, we use bioinformaticsstrategies to identify the common transcription factor (TF) binding sites present in thepromoter regions of these genes. We interrogated the sequences of promoter regions

of thesegenes using the MatInspector program to search for TF binding sites. We found that a putativebinding site for activator protein 4 was present in 6 of the 7 sequences but not in any of 3control sequences. We also used the sequencing alignment tool MEME to identify consensusmotifs that may be potential TF binding sites in the 7 sequences, but failed to find specificmotifs due to the limitations of the program. These results suggest that bioinformatics toolscombined with biological techniques are a promising approach for the discovery ofdownstream TFs that are regulated by novel signal transduction pathways. To further refineour strategy, we need to utilize existing bioinformatics tools more efficiently and also developnew tools appropriate to

our study.

Federated QTool: Multidatabase Queries Simplified

Matthew Smart

Department of Computer Science

University of South Dakota;Vermillion, SD 57069

msmart@usd.edu

QTool allows researchers with varying levels of technical skill to interact with the

data intheir relational databases in order to generate tables of data for reports. It is very flexible inthat it can easily be connected to almost any relational database management system on themarket today. It is also capable of generating queries that span multiple databases (federateddatabase queries). The user interface has been simplified so that it does not requireknowledge of querying languages to perform queries. The interface can also be incorporatedinto a web browser for inclusion in a web-based system.

Ranking Differentially-Expressed Genes in Microarray Data

Linfeng Cao;

Li Xiao;Simon Sherman

Eppley Institute for Research in Cancer and Allied Diseases;

University of Nebraska Medical Center; Omaha NE 68198-6805

linfengcao@hotmail.com,lxiao@unmc.edu, ssherm@unmc.edu

In microarray analysis, the expression levels of several thousand genes can be measuredsimultaneously. To extract biologically meaningful information from microarray data,statistical methods are used. The purpose of this

work is to develop and implement in a newsoftware tool,MicroMultitest1.0, a number of different algorithms for statistical testing (suchas t-test,p-value, adapted SAM method,p-value adjustment and multiple testing), as well asthe Receiver Operation

Characteristic (ROC) analysis technique to quantify accuracies ofdifferent methods aimed to analyze the DNA microarray data. We proposed to rankdifferentially-expressed genes by the joint use of several statistical methods. We also

proposed to use the ROC curves to: (i) estimate the accuracies of different statistical methods,and (ii) to find the optimal cutoffs for statistical methods.

SPV: A Similar Parikh Vector Search Algorithm for Protein Sequences

Xiaolu Huang, Anguraj Sadanandam, Rakesh Singh and Hesham Ali

Department of Computer Science; College of Information Science and Technology

University of Nebraska at Omaha 68182-0116

xhuang@unmc.edu, hesham@unomaha.edu

Tumor markers are polypeptides expressed at the surface of the tumor cells. These moleculescan adhere to the receptors at the surface of normal tissue cells, and are considered to beimportant for tumor cell metastasis. Previous studies showed that only 4-7 critical residues arerequired for protein-protein interactions. These critical residues may not appear in the proteinsequence in a specific order or in a contiguous manner in order to perform their function.Understanding these critical residues is very important in drug design and tumor metastasisresearch. Given an ordered alphabetA of finite k elements, with redundant elementspermissible, Parikh vector of a word w on the alphabet A is the integer vector v = (n1, n2,…,nk) where i is the number of occurrences of the ith

letter of A in w. In this project, we proposea new Similar Parikh Vector (SPV) search algorithm. SPV provides an excellent tool fortumor marker search and prediction since traditional alignment algorithms are orderdependent.

We present a back-propagation neural network (BPNN) design for 50 taxol derivativesevaluated with a feature vector of 27 numerically quantified physical and chemical properties.Training set contains 40 compounds with known output of the antitumor activities. A cascadeof correlation and discriminant analyses then decreases the number of inputs to 8, in order toconstruct an optimal NN prototype. Based on the training data set and BPNN architecture,meaningful and accurate predictions of the anticancer

activity for the 10 tested analogues areachieved. The system design depends greatly on the nature of the non-linearity to be modeled.For data sets containing periodicity (signature), the results indicate that the BPNN is moreflexible with better performance than statistical analyses based on the assumption of normallydistributed inputs. In this study, BPNN is used as a powerful tool for the design ofquantitative structure-activity relationships (QSAR) with screening of structurally similar

taxol analogues for their anticancer activities. BPNN prototype was validated with synthesisof these compounds and consequent tests that indicate the enhanced antitumor activities in 8out of 10 predicted taxol analogues. This is more than two times better than approximately35% accuracy expected from a statistical classifier.

Usage of multivariate methods in the analysis of protein sequences

Stephen O. Opiyo1, Han Asard2, Stephen Kachman3

and Etsuko N. Moriyama4

1Department of Agronomy and Horticulture,

University

of Nebraska-Lincoln, Lincoln, NE 68583-0915

2Department of Biochemistry, Plant Science Initiative,

University of Nebraska-Lincoln, Lincoln, NE 68588-0664

3Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE 68583-0712

The amount of amino acid sequences are increasing in databases. Various methods areneeded to extract information from this wealth of data. Multivariate methods have been littleused in the analysis of protein sequences in bioinformatics. In this study, we examine twomultivariate analysis methods: principal component analysis (PCA) and cluster analysis (CA).Proteins included in the study are Cytochrome b561 (Cyt-b561) and fatty acid desaturaseenzymes. The objectives of this study are to use principal component analysis (PCA) toextract information from physico-chemical properties of the 20 amino acids, to use auto andcross covariance (ACC) to transform amino acid sequences into quantitative measures, and touse PCA and CA to analyze the transformed protein sequences. We started from 13 physico-chemical properties of the 20 amino acids and PCA is used to reduce the dimensionality ofthis data set. Three principal components (S scores) are extracted. Various sizes of the aminoacid range (lag) to calculate ACC are investigated using amino acid sequences from proteinslisted above. So far we have successfully reduced the lag size up to 5 amino acids with themaximum classification power. The use of ACC in data transformation makes it possible totranslate amino acids of different length into same number of variables. This enables us to usevarious multivariate analysis methods without relying on multiple alignments but stillincluding positional information in our analyses. The results from this study show thatmultivariate methods can be used in protein sequenceclassification.

2Department of Computer Science; College of Information Science and Technology;University of Nebraska, Omaha, NE 68182

1Department of Computer Science and Engineering,

University of Nebraska-Lincoln, Lincoln, NE 68588;

achurbanov@unomaha.edu, deogun@cse.unl.edu, hesham@unomaha.edu

In this paper, we describe a new approach to improve the precision of splice site annotation inhuman genes. The problem is known to be extremely challenging since the human splicesignals are highly indistinct and frequent cryptic sites confuse signal sensors. There is strongevidence that Exonic Splicing Enhancers (ESE) and Exonic Splicing Silencers (ESS)influence commitment to splicing at early stages. We propose the use of Bayesian Networks(BN) combined with Boltzmann machine splice sensor, to improve the specificity of splicesite prediction. The new program, SpliceScan, was implemented to demonstrate feasibility ofspecificity enhancement based on ESE/ESS signals interactions. The performance ofSpliceScan was assessed by comparing it to the recently developed GeneSplicer program. Ourexperimental results show that SpliceScan outperforms GeneSplicer and produces fewer falsenegatives for the used test cases. The proposed approach is of particular value for Ab initiogeneannotation.