In Summer 2012, we participated to the ad-hoc search task of the Microblog Track of TREC. We focused on the vocabulary mismatch problem between tweets and the query. We proposed two approaches to address this issue. The first is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. These two approaches gave additive gains in MAP and P@30, and our best run was in the top 10 of the automatic runs submitted.

We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.

This project was supported by the Qatar Foundation through Carnegie Mellon University's Seed Research program.

De-identified medical records are critical to biomedical research. Text de-identification software exists, including “resynthesis” components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software. We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records. We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule.

The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure. The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less “realistic,” resulting in loss of performance particularly when training on resynthesized data and testing on real data.

The project was supported by the Vanderbilt International Office (VIO) Grants Program

We introduce a controlled natural language for
biomedical queries, called BioQueryCNL, and present an algorithm
to convert a biomedical query in this language
into a program in answer set programming (ASP)---a formal framework to automate reasoning about
knowledge. BioQueryCNL allows users to express complex
queries (possibly containing nested relative clauses and cardinality constraints)
over biomedical ontologies; and such a transformation of BioQueryCNL queries into ASP programs
is useful for automating reasoning about biomedical ontologies by means of ASP solvers.
We precisely describe the grammar of BioQueryCNL, implement our transformation
algorithm, and illustrate its applicability to biomedical
queries by some examples.

We approached this problem in two ways. Our first method was to use factor analysis to examine the underlying structure among the variables. As the second approach we used genetic algorithm to find a subset of features which helps us to to do better classifications among CEOs.

Fall 2007

Developing A New Approach to Measure the Similarities of Protein Structures Using Network Properties

Protein structure prediction is one of the most important research areas in bioinformatics. CASP has been one of the world-wide experiments in this area. It assesses the quality of methods and results of international research in this area. CASP evaluation is based on comparison of each model with the corresponding native model.
In this work we aim to estimate a new function to calculate the measure of similarity between model and native protein structures. Moments of graph theoretical properties were used to find a similarity measure between two protein structures. Multiple Linear Regression was applied to these graph properties to estimate a new function.

Suveyda Yeniterzi, Reyyan Yeniterzi, Alper Kucukural, Nilay Noyan and Ugur Sezerman, A New Approach to Measure the Similarities of Protein Structures Using Network Properties, presented in HIBIT08, International Symposium on Health Informatics and Bioinformatics, May 18-20, 2008, Istanbul, Turkey.

A New Approach to Measure the Similarities of Protein Structures Using Network Properties, poster presented in BIOSYSBIO 2008, Synthetic Biology, Systems Biology and Bioinformatics, April 20-22, 2008, London, UK.

The ESP Game is the famous example of Games With a Purpose, which are games that are played by humans and at the background collect those human computations to be used in researches. Luis von Ahn, Ph.D., developed many games like this to improve the accuracy of searches and computer computations.

Today, many Machine Learning applications need data to be more accurate. The Statistical Machine Translation is one of them. In this project we aim to overcome this problem by collecting word alignments with a game called "E.T. English Turkish Alignment Game". In this game two players simultaneously try to align words of the same English and Turkish sentences. Alignments that are same are stored and statistics about them are stored.

Fall 2006 - Spring 2007

Using Genetic Algorithms to Select the Minimum Number of Features for Classification

Selecting most relevant factors from genetic profiles that can optimally characterize cellular states is of crucial importance in identifying complex disease genes and biomarkers for disease diagnosis and for assessing drug efficiency. In this work, we present an approach using a genetic algorithm for feature subset selection problem that can be used in selecting optimum set of genes for classification of gene expression data. We implemented a dynamic parent generation procedure which is inspired by the nature. The idea of fitter and fewer genes (features) make up for fitter and more evolved efficient parents, enabled us to dynamically reduce number of genes. This way we could obtain optimum number of features with the highest classification accuracy for each data set.

Automatic speech recognition (ASR) is the process of finding the most likely word sequence from a given acoustic speech signal. The ASR consists of several steps like feature extraction, the Acoustic Model and the decoder, which consists of Language Model and Lexical Model. In this project we mainly dealt with the language model and the lexical model. We used off-the-shelf acoustic models and as a result produced an ASR for the recognition of some 911 audio files.

Spring 2006

Developed an Online Search Database for SU Sponsored Research Award/Proposal Projects

Transcription factors (TF) control the expression levels of the genes by binding to the regulatory DNA sequences in the genome. Finding these regulatory sequences will enable determination of TFs. We used data mining tools to find TF binding motifs. Using structural TF-DNA complex information, we performed association rule mining to determine the binding residues of TF. With the combination of these rules, we built a predictor which can predict the binding site. Moreover, using the rules derived from the genomic sequences together with TF sequences, our algorithm is able to determine the possible regulatory motifs of a given TF.