Facebook

WP4: High-throughput methods - Microarrays and GWAS

Tools for robust analysis in genome-wide association studies using STATA

Within the context of genetic association studies (GAS) and genome-wide association studies (GWAS) there is a variety of statistical techniques in order to conduct the analysis but a common problem is the lack of knowledge concerning the model of inheritance. Several approaches have been proposed for deriving robust procedures that will detect the true underlying model of inheritance and, at the same time perform the analysis maximizing the power and preserving the nominal type I error rate. The primary goal of this work is to implement as many as possible robust methods within the statistical package STATA and subsequently to make the software available to the scientific community. Robust methods based on the MAX statistic, the MERT statistic, the MIN2, as well as the GMS and the GME procedures were implemented in STATA and immediate commands were constructed. The main difficulty in implementing the above-mentioned methods is the fact that they are computationally intensive since (with the exception of MERT) the asymptotic properties of the estimators cannot be derived analytically and other methods are needed. Concerning MAX, GMS and GME, we used a several fast Monte Carlo simulation methods in order to calculate accurate p-values, whereas for MIN2, we relied on numerical integration. This is the first complete effort to implement procedures for robust analysis and selection of the appropriate genetic model in GAS or GWAS using STATA. Since there are only a few available software implementations of the robust methods for meta-analysis of GAS or GWAS our future goal is to extend our software in the context of meta-analysis using STATA.

Field Synopsis and Risk Estimation of Cardiovascular Diseases using Genetic Factors

Genetic tests are expected to improve the prediction of a healthy person’s probability of developing cardiovascular diseases. We conducted a field synopsis composed of 163 studies with statistically significant associated gene polymorphisms. A total of 340 genes involving 558 gene polymorphisms were reported. The overall risk for three cardiovascular subtypes and five different individual groups was calculated under a multiplicative approach. BioCompendium tool was used to uncover molecular and metabolic characteristics of the selected genes. Nine biochemical pathways were constructed with cytokine-cytokine receptor interactions pathway being the one with most genes, implicating inflammation in the creation of atherosclerosis.

Field Synopsis and Construction of genomic Profile for the Prediction of colorectal Cancer

Colorectal Cancer (CRC), being the third most common cancer type, necessitates the construction of genetic tests where individuals at high risk could be identified prior to the onset of the disease – at a time when primary prevention strategies could be safely administered. We conducted a field synopsis where a total of 129 genes with 180 SNPs were found to be statistically significant associated with CRC and 5 genes with 4 SNPs with Adenoma. The overall risk for CRC and different individual groups was calculated under a multiplicative approach (RRc=20.00 for Whites). BioCompendium tool was used to test the molecular and biochemical function of the selected genes. Nine biochemical pathways were constructed with Bladder Cancer, Focal Adhesion and Cytokine-Cytokine Receptor interactions pathway involving the majority of the genes while several oncogenes were also present.

Field Synopsis and Risk Estimation of Diabetes Mellitus using Genetic Factors

Diabetes Mellitus has reached epidemic proportions worldwide. This makes the understanding of the pathogenesis of Diabetes of utmost importance for the implementation of national diagnostic and treatment strategies. In order to enrich the prediction of genetic tests which are currently used, a field synopsis was implemented and the overall risk for Diabetes for different individual groups was calculated under a multiplicative approach. A total of 6 genes with a statistically significant genetic association for T1D and 73 genes with 120 polymorphisms for T2D were retrieved. BioCompendium tool was used to capture the molecular and biochemical characteristics of the selected genes. The analysis revealed the coexistence of certain genes in both T2D and some cancer types and further research is needed to elucidate this finding.

Field Synopsis and Risk Estimation of Autoimmune Diseases using Genetic Factors

Autoimmune diseases are currently the major cause of chronic diseases affecting the health of more people than heart disease or cancer. Thus, the development of genetic tests is necessary for prevention strategies and enhanced therapeutic protocols. We conducted a systematic review consisting of 149 studies with statistically significant associated gene polymorphisms with five autoimmune diseases. A total of 402 genes involving 863 gene polymorphisms were reported. The overall risk for five autoimmune diseases (Crohn’s disease, CD, Ulcerative colitis, UC, Systemic lupus erythematosus, SLE, Systemic sclerosis, SS, and Multiple sclerosis, MS) and for different individual groups was calculated under a multiplicative approach. BioCompendium tool was used to uncover molecular characteristics and common metabolic pathways of interrelated genes. The findings that few of the genes we found are already known as responsible for autoimmune diseases and innate immune response increase the value of our method and open ways for commercially available prognostic genetics tests.

Field Synopsis and Risk Estimation of Breast Cancer using Genetic Factors

Breast cancer (BC) is the most prevalent disease in women worldwide. The incidence rate of the disease in year 2012 was 1.7 million new cases worldwide. In particular, in Western Europe, 90 new cases per 100,000 women are diagnosed annually, compared with 30 per 100,000 in eastern Africa. It is estimated that 231,840 new cases of invasive breast cancer will be diagnosed among women in the USA during 2015. In the present study, we conducted a field synopsis in order to construct a gene profile revealing the genetic predisposition to BC. A total of 180 genes with 285 polymorphisms with statistically significant genetic association were identified. The overall risk for BC and for various ethnic groups was calculated under a multiplicative approach. Bio-Compendium tool was used to evaluate our method by means of investigating the function of the identified genes, their interrelation and their participation in biochemical pathways already known to be involved in BC development.

Identification of differentially expressed genes in cardiovascular diseases: a Meta-analysis

In this study, we performed a meta-analysis of gene expression studies in order to identify genes that are differentially expressed in cardiovascular diseases (CVD). To this end, we combined information from multiple sources. Regarding myocardial infraction (MI), four GEO datasets, including 31,180 genes, 93 patients with MI and 89 controls, were retrieved. Regarding the coronary artery disease (CAD), 9 GEO datasets that included 26,156 genes, 838 CAD patients and 644 controls were collected. Thus, two independent meta-analyses, for MI and CAD, were conducted. From the meta-analyses, we were able to identify a total of 2,101 and 288 differentially expressed genes that are potentially associated with MI and CAD, respectively; the FDR value was adjusted at 0.01. These two diseases share 33 common differentially expressed genes. Also, several other methods of multiple testing correction (Sidak, Bonferoni, Holm, Holland) were applied in order to reduce the number of false positive genes. Therefore, these genes could be used as potential biomarkers for the diagnosis, prognosis and monitoring of different types of CVD In this way, gene expression signatures associated with CVD risk were discovered.

Efficient radiation therapy is characterized by enhanced tumor cell killing involving the activation of the immune system (tumor immunogenicity) but at the same time minimizing chronic inflammation and radiation adverse effects in healthy tissue. The aim of this study was to identify gene products involved in immune and inflammatory responses upon exposure to ionizing radiation by using various bioinformatic tools. Ionizing radiation is known to elicit different effects at the level of cells and organism i.e. DNA Damage Response (DDR), DNA repair, apoptosis and, most importantly, systemic effects through the instigation of inflammatory ‘danger’ signals and innate immune response activation. Genes implicated both in radiation and immune/inflammatory responses were collected manually from the scientific literature with a combination of relevant keywords. The experimentally validated and literature-based results were inspected, and genes involved in radiation, immune and inflammatory response were pooled. This kind of analysis was performed for the first time, for both healthy and tumor tissues. In this way, a set of 24 genes common in all three different phenomena was identified. These genes were found to form a highly connected network. Useful conclusions are drawn regarding the potential application of these genes as markers of response to radiation for both healthy and tumor tissues through the modulation of immune and/or inflammatory mechanisms.

Unraveling the mechanisms of extreme radioresistance in prokaryotes: Lessons from nature

The last 50 years, a variety of archaea and bacteria able to withstand extremely high doses of ionizing radiation, have been discovered. Several lines of evidence suggest a variety of mechanisms explaining the extreme radioresistance of microorganisms found usually in isolated environments on Earth. These findings are discussed thoroughly in this study. Although none of the strategies discussed here, appear to be universal against ionizing radiation, a general trend was found. There are two cellular mechanisms by which radioresistance is achieved: a) protection of the proteome and DNA from damage induced by ionizing radiation and b) recruitment of advanced and highly sophisticated DNA repair mechanisms, in order to reconstruct a fully functional genome. In this review, we critically discuss various protecting (antioxidant enzymes, presence or absence of certain elements, high metal ion or salt concentration etc.) and repair (Homologous Recombination, Single-Strand Annealing, Extended Synthesis-Dependent Strand Annealing) mechanisms that have been proposed to account for the extraordinary abilities of radioresistant organisms and the homologous radioresistance signature genes in these organisms. In addition, and based on structural comparative analysis of major radioresistant organisms, we suggest future directions and how humans could innately improve their resistance to radiation-induced toxicity, based on this knowledge.

Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. Several permutation and bootstrap t-test methods have been proposed in order to find the differentially expressed genes. In this study, we developed Significance Analysis of Microarray (SAM) and four modified t-test methods that have been proposed in STATA. The source code of the methods was tested on different data sets and is available in the ttest_stata_code.zip for public use.

We developed the principal component analysis selection method in the statistical program STATA. This method classifies the characteristics (variables) of a data set in order to find the features that contain the maximum information on the large/full dataset. This method can be used for the selection of genes that play a crucial role in certain situations (e.g., association with diseases). Also, PCA feature selection method can be used for analyzing data from microarrays. A small number of differentially expressed genes can be selected from a large data set of thousands of genes and a gene signature can be created. The source code of the method was tested on different data sets and is available in pca_stata_code.zip for public use.

Key challenges for the creation and maintenance of specialist protein resources

As the volume of data relating to proteins increases, researchers rely more and more on the analysis of published data, thus increasing the importance of good access to these data that vary from the supplemental material of individual articles, all the way to major reference databases with professional staff and long-term funding. Specialist protein resources fill an important middle ground, providing interactive web interfaces to their databases for a focused topic or family of proteins, using specialized approaches that are not feasible in the major reference databases. Many are labors of love, run by a single lab with little or no dedicated funding and there are many challenges to building and maintaining them. This perspective arose from a meeting of several specialist protein resources and major reference databases held at the Wellcome Trust Genome Campus (Cambridge, UK) on August 11 and 12, 2014. During this meeting some common key challenges involved in creating and maintaining such resources were discussed, along with various approaches to address them. In laying out these challenges, we aim to inform users about how these issues impact our resources and illustrate ways in which our working together could enhance their accuracy, currency, and overall value.