G protein-coupled receptor 40 (GPR40) has become an attractive target for the treatment of diabetes since it was shown clinically to promote glucose-stimulated insulin secretion. Herein, we report our efforts to develop highly selective and potent GPR40 agonists with a dual mechanism of action, promoting both glucose-dependent insulin and incretin secretion. Employing strategies to increase polarity and the ratio of sp3/sp2 character of the chemotype, we identified BMS-986118 (compound 4), which showed potent and selective GPR40 agonist activity in vitro...

Background: The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists...

We utilized evidence for enhancer-promoter interactions from functional genomics data in order to build biological filters to narrow down the search space for two-way Single Nucleotide Polymorphism (SNP) interactions in Type 2 Diabetes (T2D) Genome Wide Association Studies (GWAS). This has led us to the identification of a reproducible statistically significant SNP pair associated with T2D. As more functional genomics data are being generated that can help identify potentially interacting enhancer-promoter pairs in larger collection of tissues/cells, this approach has implications for investigation of epistasis from GWAS in general...

Noncoding DNA - once called "junk" has revealed itself to be full of function. Technology development has allowed researchers to gather genome-scale data pointing towards complex regulatory regions, expression and function of noncoding RNA genes, and conserved elements. Variation in these regions has been tied to variation in biological function and human disease. This PSB session tackles the problem of handling, analyzing and interpreting the data relating to variation in and interactions between noncoding regions through computational biology...

A central challenge of developing and evaluating artificial intelligence and machine learning methods for regression and classification is access to data that illuminates the strengths and weaknesses of different methods. Open data plays an important role in this process by making it easy for computational researchers to easily access real data for this purpose. Genomics has in some examples taken a leading role in the open data effort starting with DNA microarrays. While real data from experimental and observational studies is necessary for developing computational methods it is not sufficient...

As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset...

Electronic Health Records (EHRs) contain a wealth of patient data useful to biomedical researchers. At present, both the extraction of data and methods for analyses are frequently designed to work with a single snapshot of a patient's record. Health care providers often perform and record actions in small batches over time. By extracting these care events, a sequence can be formed providing a trajectory for a patient's interactions with the health care system. These care events also offer a basic heuristic for the level of attention a patient receives from health care providers...

Processing of amyloid-β (Aβ) precursor protein (APP) by γ-secretase produces multiple species of Aβ: Aβ40, short Aβ peptides (Aβ37-39), and longer Aβ peptides (Aβ42-43). γ-Secretase modulators, a class of Alzheimer's disease therapeutics, reduce production of the pathogenic Aβ42 but increase the relative abundance of short Aβ peptides. To evaluate the pathological relevance of these peptides, we expressed Aβ36-40 and Aβ42-43 in Drosophila melanogaster to evaluate inherent toxicity and potential modulatory effects on Aβ42 toxicity...

Objectives: This study proposes a novel Prior knowledge guided Integrated likelihood Estimation (PIE) method to correct bias in estimations of associations due to misclassification of electronic health record (EHR)-derived binary phenotypes, and evaluates the performance of the proposed method by comparing it to 2 methods in common practice. Methods: We conducted simulation studies and data analysis of real EHR-derived data on diabetes from Kaiser Permanente Washington to compare the estimation bias of associations using the proposed method, the method ignoring phenotyping errors, the maximum likelihood method with misspecified sensitivity and specificity, and the maximum likelihood method with correctly specified sensitivity and specificity (gold standard)...

Gene-by-environment (G × E) interactions are important in explaining the missing heritability and understanding the causation of complex diseases, but a single, moderately sized study often has limited statistical power to detect such interactions. With the increasing need for integrating data and reporting results from multiple collaborative studies or sites, debate over choice between mega- versus meta-analysis continues. In principle, data from different sites can be integrated at the individual level into a "mega" data set, which can be fit by a joint "mega-analysis...

Genome-wide, imputed, sequence, and structural data are now available for exceedingly large sample sizes. The needs for data management, handling population structure and related samples, and performing associations have largely been met. However, the infrastructure to support analyses involving complexity beyond genome-wide association studies is not standardized or centralized. We provide the PLatform for the Analysis, Translation, and Organization of large-scale data (PLATO), a software tool equipped to handle multi-omic data for hundreds of thousands of samples to explore complexity using genetic interactions, environment-wide association studies and gene-environment interactions, phenome-wide association studies, as well as copy number and rare variant analyses...

Background: Understanding the factors that affect water quality and the ecological services provided by freshwater ecosystems is an urgent global environmental issue. Predicting how water quality will respond to global changes not only requires water quality data, but also information about the ecological context of individual water bodies across broad spatial extents. Because lake water quality is usually sampled in limited geographic regions, often for limited time periods, assessing the environmental controls of water quality requires compilation of many datasets across broad regions and across time into an integrated database...

The goal of this unit is to introduce epistasis, or gene-gene interactions, as a significant contributor to the genetic architecture of complex traits, including disease susceptibility. This unit begins with an historical overview of the concept of epistasis and the challenges inherent in the identification of potential gene-gene interactions. Then, it reviews statistical and machine learning methods for discovering epistasis in the context of genetic studies of quantitative and categorical traits. This unit concludes with a discussion of meta-analysis, replication, and other topics of active research...