This month, GT brings you a technical guide covering best practices in GWAS. Over the past two years, the number of genome-wide association studies has increased dramatically. With that has come whopping numbers of new disease-associated genes.

While two years has allowed researchers time to perfect the discovery phase of their GWAS, putting standards into place to move these studies toward having diagnostic and therapeutic value has recently seen a big push. Of increasing importance is the need for larger data sets in order to perform replication studies, integrating different phenotype data across studies, maintaining individual patient privacy, and keeping up with changing technologies.

In this guide, we’ve culled expert opinion from the forefront of genome-wide association study analyses on best practices. Our questions cover a broad range of topics, from how to successfully collaborate to how to maintain patient privacy among distributed data sets, the best way for dealing with false positives in replication studies, and more. For answers to all your GWAS questions, don’t miss our experts’ timely advice as well as the helpful resource section at the end.

— Jeanene Swanson

Index of ExpertsMany thanks to our tireless experts for taking the time to contribute to this technical guide, which would not be possible without them.

Yohan Bossé
Laval University

Stephen Chanock
NCI

Jeanette Erdmann
University of Lübeck

Hakon Hakonarson
The Children’s Hospital of Philadelphia

Kevin Jacobs
NCI

Andrew Patterson
The Hospital for Sick
Children, Toronto

Meredith Yeager
NCI

Q1: What makes a successful collaboration when it comes to performing studies on large sample sets?

The high standards to report and replicate GWA studies have created a pressing need for collaboration. The large numbers of clinically well-characterized samples that are required to conduct GWA studies have made this kind of business practically impossible for a single investigator or site. This has led to an unprecedented willingness among geneticists to share and combine their data. As of today, many successful networks of collaboration have been created. The Genetic Association Information Network and the Wellcome Trust Case Control Consortium are good models.

A successful collaboration is a multifaceted endeavor that requires careful planning. Planning is especially important for future data sharing. The importance of proper consent documents and approvals by IRB cannot be stressed enough. Common problems with consents include restriction of data use for a single study or investigator and failure to address the following points: data sharing, potential risks for the participants, options for withdrawals, and discussion about genetic research. A setup that makes individual-level genotype and phenotype data available greatly facilitates collaborations.

Collaboration can also be facilitated by establishing the roles of all parties from the start. Proper discussions about intellectual properties and publication timing must take place early in the process. These are delicate negotiations where important issues such as career advancement and grant money are on the line, but often not articulated openly. A big challenge for collaborations in GWA studies is the difficulty to acknowledge the individual contributions. The current practices taken by institutions and funding bodies to evaluate scientists’ performance on poor indicators (the numbers of papers, positions in lists of authors, and journal’s impact factor) create bumpy roads for good collaborations in large-scale projects such as GWA studies. Novel ways of acknowledging authors and contributors are required.— Yohan Bossé

It’s really amazing how the scientific community is now putting together all their data to perform large-scale studies involving 50,000 to 100,000 individuals and even more. This development was not foreseen a few years ago and I personally feel very excited about this. Today, consortia like DIAGRAM (meta-analysis for Diabetes), GIANT (meta-analysis for anthropometric phenotypes) or CARDIoGRAM (meta-analysis for myocardial infarction and coronary artery disease) have emerged. The latter one will be coordinated by the University of Lübeck.

From our experiences from the recent papers we published in Nature Genetics with more than 65 co-authors each, the prerequisite for a fruitful collaboration is an open discussion throughout the whole project between all participants. The framework for such collaboration should be defined very early. A written agreement describing the terms of collaboration is very helpful as this avoids awkward discussions about the role of each participant during the project. The role of each partner should be defined precisely in this agreement, so everyone is aware of his rights and duties.

To keep the group together, regular conference calls and updates by e-mail are essential. The whole group needs to be involved in the project over the
full period.— Jeanette Erdmann

Availability of DNA samples, good quality phenotypes, and willingness to share data are key factors. Experience with high-throughput genotyping/sequencing and the proper infrastructure to be able to handle and analyze large datasets are the other components.— Hakon Hakonarson

The most important are consistent phenotyping and measurement of traits across sites as well as clarity about inclusion and exclusion criteria. In addition, meticulous attention to both the quality and quantity of DNA for genotyping are necessary. Tracking
of samples and data are also necessary.— Andrew Patterson

In our experience, successful collaboration is achieved when a group of investigators, epidemiologists, biostatisticians, geneticists, etc., work together and bring their diverse sets of expertise to the project. This helps to ensure that all angles of these studies are explored most effectively. When combining information from different studies, it is also important to play close attention to the harmonization of phenotype and other metadata, as even minor differences in semantics can alter one’s findings.— Meredith Yeager, Stephen Chanock, Kevin Jacobs

Q2: What do you take into account when integrating GWAS across a larger number of phenotypes from multiple human tissues??

I see three main considerations when one wants to combine data from multiple studies. First, genotype data must be combined. The challenge of this task is related to the rapid development in genotyping technologies that constantly changes the panels of SNPs that are genotyped in GWA studies. In addition, multiple platforms and generations of chips containing only a small number of overlapping SNPs are available. A common way to get around this problem is to probabilistically infer the missing genotypes. Major progresses, facilitated by the availability of large genotype databases (e.g. HapMap), were made in this field. A variety of methods are currently available for imputation and testing ungenotyped markers in unrelated and family-based data. With these recent developments, it is relatively straightforward to infer the missing genotypes and combine genotyping data from multiple studies.

The second consideration is related to phenotype integration. This task can be the easier or the most challenging one depending on the diseases or biological traits under study. Objective biological traits that are easy and cheap to measure can be combined easily across studies. The GIANT consortium is a good example where body mass index for tens of thousands of individuals were combined to identify susceptibility loci of obesity. In contrast, great heterogeneity in classification of outcomes across studies can make phenotype data integration really difficult. Asthma is a good example, where patients can be deemed asthmatics based on patient self-report, physician’s diagnosis, combined or not with varying thresholds from objective measures of airway responsiveness. A cure for this problem is to have standard definitions of diseases defined by well-respected organizations that are accepted and applied internationally. Unfortunately many completed studies were not designed with data sharing and future collaboration in mind. New projects can be planned and developed with this new reality to facilitate integration from multiple studies. For completed and ongoing projects on biological traits and diseases that lack standardization, it is basically left up to the investigators of the different studies to come up with a middle ground definition that allows comparison. So the solution for phenotype integration across multiple studies is really disease-specific.

Finally, the third consideration is related to study design and data analyses. The first question that must be asked when combining multiple studies is whether study designs are similar enough to allow individual genotype and phenotype data to be combined prior to analysis. The alternative is to perform the analyses for each study individually and then combine pre-computed association data, which becomes more like a meta-analysis. The latter is favored when heterogeneity across studies exists in terms of genetic background and environmental exposures. In many cases, I found it more informative to report both individual and combined genetic association results. Accordingly, the best solution in terms of data analysis is specific to the nature of studies being combined.— Yohan Bossé

The handling of these very large data sets is really a demanding task. In our team we have database administrators who ensure the correct merging of the different data sets. However, before we start an analysis, a very stringent quality control of the data is a prerequisite.— Jeanette Erdmann

The ancestry of the population is key, and one needs to ensure there is homogeneity among cases and controls; optimal population matching and high-quality genotype data will give you the answers if you have a large enough sample size — GWAS is a numbers game where the allele frequencies, effects sizes (the unknown) and the population size, and quality of data determine your success.— Hakon Hakonarson

We certainly perform GWAS on numerous traits measured from the same individuals, some traits from the same organ, others from across different tissues. Typically we make no adjustment for the analysis of multiple traits. The only exception may be when we have measures of traits that are very highly correlated, for example, adjacent lipid fractions from density gradient ultracentrifugation.— Andrew Patterson

Q3: What guidelines do you follow to maintain patient privacy?

We have implemented a multi-layer model to conduct genomic research that involves three parties: 1) medical staff that see study participants and collect clinical data; 2) tissue banks that store and manage biological materials; and 3) laboratory staff, researchers, and analysts that generate and interpret genomic data. Scientific and ethics committees oversee the entire pipeline. All samples are coded at the tissue bank layer and only electronic files of de-identified individual-level participant data are submitted to the next layer. In our system, the geneticists never see the study participants and do not have access to any information to identify them. Research participants enrolled in a new study are invited to sign a specific informed consent for the study. They are also invited to sign a second and more broad informed consent to have their tissues and clinical data stored in the tissue bank and used for future genetic/genomic research. Individuals that give consent for broader uses are not re-contacted for future studies conducted with their tissues and data.— Yohan Bossé

Within the CARDIoGRAM consortia and other collaborations we decided to share only summary statistics and never any individual genotypic information. As far as I know this is the way all large consortia handle the issue of patient privacy. Summary statistics allows meta-analytical analysis of the GWAS data; however, no individual patient data is shared between the different collaborators.— Jeanette Erdmann

We have all samples/data encrypted multiple times so there is 100 percent protection of privacy. The researchers who handle the clinical samples/phenotypes are uncoupled from the genotypes to protect privacy. We do not give information back to the patients for that same reason.— Hakon Hakonarson

We use identifiers that are only linked to identifiable information by the study investigators so that people in the lab, as well as those performing the statistical analysis, cannot associate the samples with any identifying information. In addition, genotype data are stored on secured servers that are not accessible to the Internet.— Andrew Patterson

Patient protection is central in our minds during the conduct of GWAS. No patient-identifying information is ever requested or received at our genotyping center. We rely on our clinical collaborators to supply anonymized DNA samples and phenotype information. All requests for access to data generated in our labs must be approved by data access committees. Applicants must be bona fide researchers and must agree to a data use certification agreement that requires careful control of data and forbids any action that would jeopardize the privacy of the GWAS participants.— Meredith Yeager, Stephen Chanock, Kevin Jacobs

Q4: What normalization strategy do you follow?

Nowadays, this question can imply either data sharing among a group of collaborators working on a specific disease or broader data sharing to the entire scientific community. In both instances, it is important to keep the data secure. This goes beyond password-protected electronic databases that are controlled by local investigators.

During the last few years, data sharing structures have been created to facilitate broad distribution of data generated by public funds. For GWA studies, investigators can now submit their data to the Genotypes and Phenotypes (dbGaP) database or the European Genotyping Archive. These databases were developed to archive and distribute the results derived from GWA studies in order to accelerate scientific discoveries and, at the same time, protect the identity of study participants.— Yohan Bossé

Database administrators are in charge for the data management. Every computer script that is written in our group for data merging or data analysis is cross-checked. Our statisticians undertake very stringent QCs before data analysis. We document every step of the QC, and the resulting documents are shared with collaborators.

Before we share data with other groups we again cross-check the relevant data file by two independent individuals to ensure that only correct data will be shared with others. In this time of data sharing between so many different groups, everyone has to ensure that he or she complies with very high standards of data quality. In my view, it would be helpful to define more general guidelines for sharing GWAS data. Such guidelines could be of great help for future collaborations.— Jeanette Erdmann

In terms of data sharing, this is typically done using summary statistics from each study (e.g. the SNP-specific parameter estimates, standard errors and p-values). Care is required to assure consistent allele coding between labs, especially for imputed SNPs and across different genotyping platforms.— Andrew Patterson

For us, proper data management starts with strict data integrity and version controls. Given the number of incremental revisions that go on during the conduct of a GWAS, one must always be sure what version of the data is being used. Requests for data access are reviewed by a project manager for completeness and then forwarded for consideration by the sitting data access committee members. If approved, authorization is added to a specially designed data access portal for the requester. This portal is password-protected and allows access to only the approved data sets. Authorized users may browse GWAS data online, run simple queries, or elect to bulk download the results using secure encrypted data transfer protocols.— Meredith Yeager, Stephen Chanock, Kevin Jacobs

Q5: How do you address issues related to quality control, normalization, analysis, and biological interpretation of your GWAS data?

We have established different procedures to ensure quality of genotyping data. First, we include one standard HapMap sample, one study sample duplicate, and one blank on each 96-well plate. After genotyping, we perform quality control at three levels: SNPs, samples, and genotype data. For SNPs we check for call rate, minor allele frequency, Hardy-Weinberg equilibrium, and concordance with HapMap genotypes and among internal duplicates. For samples we check for call rate and sex misidentification. For genotype data we check for population substructure and cryptic relatedness. Single-point association tests for variant passing quality control are carry out with PLINK. The choice of association tests depend on the study design and the phenotypes under investigation. We use the genomic control method to correct for possible case-control differences in the genetic structure. We obtain the variance inflation factor by genotyping 100 ancestry informative markers. Significant SNPs that resist adjustment for multiple testing are inspected in greater detail for quality control by visualizing the signal intensity cluster plot. To visualize GWA results we use quantile-quantile and genome-wide Manhattan plots. Association tests with CNVs are becoming a systematic part of GWA studies.

Biological interpretation of GWA data is a main challenge. Associations are often found for SNPs located in gene deserts or at a significant distance from a known gene. One strategy that we have recently used is to restrict the analyses for SNPs located within 10 kb of a known gene. This is not a practice that I like to encourage since a large and certainly important fraction of the genome is ignored, but the strategy greatly facilitates the biological interpretation and helps us to focus on things that we understand better.

Biological interpretation also remains a challenge at many levels for SNPs located in known genes. First one needs to identify the causal variant. Then experimental proofs are required to demonstrate the molecular effect of the variant on the gene product and the disease/phenotype. All these functional laboratory-based validations are highly dependent on the type of variants, genes, and diseases. A current and promising trend is to run additional high-throughput genomic tools (e.g. whole-genome expression arrays on relevant tissues) in parallel to the GWA studies in order to facilitate biological interpretation.

Finally, many bioinformatics databases and tools are currently available to help with the biological interpretation of GWAS data. These are especially useful in terms of data gathering and quick access to available information.— Yohan Bossé

To address the above mentioned issues we have brought together an interdisciplinary team of experts: molecular biologists, cardiologists, bioinformaticians, human geneticists, and statisticians. Each brings in a lot of experience and ensures that the project runs successfully.

Important issues like stringent quality control of the genotypic data are thoroughly discussed between the people doing the wet lab applications and the statisticians, who work with the data in the end.— Jeanette Erdmann

We have put together a streamlined process — the data are all QC’ed the same way, ancestry is determined the same way, etc. We have had great success with this system.— Hakon Hakonarson

This all starts with DNA quality, and we typically use PicoGreen to assess every sample before putting it on a chip. I know that CIDR and other sites use a set of SNPs to detect cryptic duplicates and potential sample mix-ups prior to GWAS genotyping and to fingerprint the samples so that they can be tracked through the genotyping process. In addition, if other genotype data has been generated, we compare the genotypes from that chip with the GWAS data to detect sample mix-ups. Standard quality control at the individual and SNP level are particularly important to help identify contaminated samples, poor-quality DNA, or SNPs that do not produce clear genotype clusters. For the traits, we check the distribution of values, perform appropriate transformations, and build multivariate models to take into account the effect of important covariates. Biological interpretation typically takes a long time after replication of association signals in appropriately powered studies.— Andrew Patterson

Our group is involved in nearly a dozen GWAS projects and has formed a core team of bioinformaticians and geneticists to process the resulting data, apply quality control procedures, harmonize genomic metadata, manage the phenotype data, perform association analysis, and aid in the interpretation of the results. In addition to developing a core group with significant expertise, this centralized approach allows for the development of increasingly sophisticated tools and methods, as insights and solutions applied to one scan may be immediately leveraged in all of the others.— Meredith Yeager, Stephen Chanock, Kevin Jacobs

Q6: What are your recommendations for dealing with false positives in replication studies?

A big challenge for GWA studies is to separate true associations from the large numbers of false positives. The current gold standard to achieve this is to replicate the findings in independent data sets. However, what constitutes a proper replication is still a matter of debate. Specific criteria for unequivocally establishing a valid genotype-phenotype association are not available.

Follow-up studies of initial findings are more credible when the same SNP (or a SNP in tight linkage disequilibrium) and direction of risk conferred by a given allele are found. One must also consider the appropriateness of the replication studies to mirror the phenotype and genetic background of the initial study. The strength of the genotype-phenotype association (p-value) must also be considered. However, a fixed p-value threshold cannot be used as it is influenced by many factors, including the total number of genetic variants validated in the replication sets and the type of variants. It is a common trend to lessen the threshold of replication for non-synonymous SNPs or for functional genetic variants supported by laboratory experiments. Similarly, an association with a biologically relevant gene is also more convincing.— Yohan Bossé

I recommend analyzing GWAS data in a three-stage study design. I would suggest starting with a screening GWAS of meaningful size (at least 1,000 cases and 1,000 controls for common complex diseases). The second step should be an in silico replication step, meaning that replication is sought in available GWAS data from collaborators. The threshold for SNPs to be replicated in the in silico stage should be less than 10-3. In most GWAS this means you have to replicate several hundred SNPs. After the in silico replication step one will be left with only a few SNPs showing nominal replication and the effect in the same direction in every GWAS data set. These SNPs (I would call them the real interesting ones) will then go into further replication. As the number of positive replicated SNPs left is very likely relatively small (not more than 5-10), the third step could be a wet lab replication step.

We claim a SNP rock solid-associated with the phenotype if the combined p-value of the three stages is less than 5×10-8.— Jeanette Erdmann

There are always certain SNPs that misbehave to give you false positives, but these are relatively easy to sort out by finding SNPs in LD and genotyping on a different platform for the replication. The multiple testing performed in GWAS requires replication in ideally two or more independent populations. If this is done and the data is quality controlled, there is matching in ancestry and multiple SNPs support the signal, one can trust the results if the study was GW significant to begin with and replicates to a significant p-value.— Hakon Hakonarson

The use of appropriate multiple testing correction is the most common approach to dealing with false positives. Replication in independent datasets is the most obvious way to deal with false positives, but we have to be very careful about the comparability of the discovery and replication datasets. Many studies have complex and subtle ascertainment schemes, as well as participation biases, which can make it rather difficult to identify any single study which could be considered a true mirror of the discovery dataset. It is therefore important to describe differences between datasets, since these could be responsible for non-replication. The other important consideration is bias in effect size estimated obtained from the discovery dataset, commonly known as the Winner’s curse (or Beavis effect). This can result in an inflated estimate in the discovery data set, and consequently require much larger replication studies to have sufficient power to detect an unbiased effect size.— Andrew Patterson

The simplest approach to dealing with them is through additional replication studies that increase the sample size and will hopefully add additional evidence for a true association or exclude the finding. Investigators should be aware of opportunities for collaboration, data sharing, and meta-analysis. Likewise, each investigator should attempt to share their data and results, so that others may leverage their findings. To do otherwise would be inefficient, slow our progress, and ultimately add to the already significant burden caused by serious diseases.— Meredith Yeager, Stephen Chanock, Kevin Jacobs

List of Resources

Our panel of experts referred to a number of publications and online tools that may be able to help you get a handle on interacting proteins. Whether you’re a novice or pro at best practices in GWAS, these resources are sure to come in handy.