Frequently Asked Questions

Sample Preparation FAQ

What are Complete Genomics Sample Requirements?

At this time, Complete Genomics accepts human DNA only. DNA may be extracted from blood, frozen tissue, cell lines, or saliva. Complete Genomics cannot accept DNA that has been treated with whole genome amplification (WGA) or that is derived from formalin-fixed, paraffin-embedded (FFPE) samples. The DNA sample requirements are as follows:

A260/280 ratios should be between 1.8 and 2.0. Values outside of this range suggest that there are impurities in the sample that could affect sequencing performance.

Acceptance criteria are based only on measurements performed by Complete Genomics, and are not based on the amounts reported by the customer. Because there is inherent variability in measurement between sites and users, targeting the minimal amount (3.5 μg)will likely result in a significant number of samples failing to meet acceptance criteria and lead to a delay in sample processing. For this reason, Complete Genomics strongly encourages customers to send additional DNA when available. Unused DNA can be returned after sequencing, upon request.

Which kits are recommended for DNA extraction?

DNA samples extracted using the kits listed here have consistently provided high-quality results. In general, commercial kits are recommended because the associated reagents have been subjected to quality control before use and are not likely to introduce problems. Complete Genomics strongly recommends following manufacturer’s guidelines with respect to the amount of cell/tissue extract loaded per column:

Overloading columns decreases DNA yield and increases the likelihood of producing ‘dirty’ DNA that will not perform well in the sequencing process.

Reducing the volume of the column washing solution may result in carryover of contamination, which interferes with the DNA fragmentation and library construction process.

* The prepIT•L2P kit (which is based on ethanol precipitation) and the PD-PR-015 whole sample protocol is recommended for extraction of DNA from saliva samples. Extracting the entire Oragene sample allows for maximum DNA recovery and concentration. At this time, the prepIT•C2D kit (which is column-based) is not recommended because it does not allow for the entire sample to be extracted with a single aliquot.

How can I ensure that my saliva-derived DNA is within the correct concentration range?

If using the recommended prepIT•L2P kit for extraction and PD-PR-015 protocol , the DNA can be eluted in 0.2 to 1 mL of TE solution. To ensure that the DNA is in the correct concentration range for Complete Genomics sequencing, but also fully re-hydrated, 500 μl is recommended for the elution volume. Take care not to over-vortex the sample or to heat it for longer than the hour prescribed in the protocol.

How can I reduce the change of ethanol contamination in my DNA extraction using a column-based protocol?

It is important to make sure that the ethanol is completely removed from the sample. If using a column purification method, spinning twice after the ethanol wash step can help completely dry the column and to ensure removal of excess alcohol prior to elution of the DNA.

What type of ethanol should I use for best results?

Wash buffers for DNA extraction should be made using high-quality “200-proof” or “absolute” ethanol. Some vendors sell versions that are “molecular biology grade” or “biotechnology grade.” Do not use denatured alcohol or denatured ethanol as they contain isopropanol, methanol, and/or other solvents that can interfere with the production of high-quality DNA.

How would I know if there is ethanol in my DNA?

Ethanol contamination in DNA can cause the following:

DNA prep smells odd or reminiscent of ethanol.

DNA “floats” back out of the well when loading a gel, even if loading dye is added to the sample.

DNA doesn’t freeze well at –20°C.

Do you accept DNA extracted with Phenol-Chloroform?

While we have successfully processed samples from customers who have used phenol-chloroform to extract DNA, we have also observed issues. These issues are likely the result of residual amounts of organic solvents that may interfere with the sample handling. If phenol-chloroform has been used to extract DNA, Complete Genomics highly recommends performing a cleanup step—such is as available in the DNeasy Blood and Tissue Kit—after the extraction and before shipping to Complete Genomics.

Do you accept DNA extracted from formalin-fixed, paraffin-embedded (FFPE) samples?

Complete Genomics does not currently accept FFPE samples.

What if there is RNA in the sample?

Contaminating RNA can result in an overestimation of DNA concentration when measured using UV spectrophotometry such as NanoDrop. The use of a quantitation method that specifically measures double-stranded DNA, such as PicoGreen, should help avoid such overestimation. However, large amounts of contaminating RNA can also result in overestimation of DNA concentration due to high occurrence of RNA secondary structures. See Which kits should I use for DNA quantitation? for more information. Contaminating RNA should not affect sequence quality.

What if there is protein in the sample?

Protein-DNA complexes have slower mobility than pure DNA when running through an agarose gel and might be present as a slower migrating DNA band or doublet running above the primary genomic DNA band. Alternately, protein may appear as an intercalator-stained material visible inside the agarose gel wells, since it hasn’t entered the gel. Initial results suggest that the presence of such a slow-migrating band may increase the risk of sequencing problems, including an inability to deliver CNV and SV results and an increase in small variant error rates.

To minimize the risk of sequencing problems, Complete Genomics highly recommends that you provide a replacement sample that does not contain slow-migrating DNA. If there is no replacement available, Complete Genomics recommends removing the protein by proteinase K treatment followed by column-based purification. Gel electrophoresis should be repeated to ensure that the DNA-protein complex band has been eliminated.

What is the effect of high temperature during DNA extraction?

Incubation at temperatures above ~45°C during DNA extraction can lead to the generation of single-stranded DNA. This can have the same effect as protein contamination, leading to increased risk of sequencing problems, including an inability to deliver CNV and SV results and higher small variant error rates. Some protocols recommend incubations with Proteinase K at higher temperatures. In this case, it is important to limit the amount of time the DNA sample spends at the higher temperature, ideally to a maximum of a couple of hours.

Which buffer should I use to store my extracted DNA?

To maintain DNA integrity, Complete Genomics recommends that all DNA samples be provided in 1x TE, pH 8.0. Samples should not be supplied in H2O.

My DNA is eluted in QIAGEN buffer AE. Is this acceptable?

Yes. Though the solution is 1x TE, pH 9, sequencing has been successful with DNA eluted and stored in this buffer.

My DNA was eluted in water instead of 1xTE pH 8. What should I do?

Tris-EDTA (TE) buffer prevents DNA degradation by inactivating nucleases and maintaining a neutral pH. To prevent DNA degradation during shipment, Complete Genomics recommends adding a small amount of 10x TE to bring the final concentration up to 1x. Keep the concentration requirements in mind when doing so to avoid diluting the DNA too much.

How much of the DNA source is required to obtain sufficient DNA for sequencing?

If you are planning to isolate DNA from blood, refer to the following article from QIAGEN, which investigates the effects of a combination of factors on final DNA yield for one of its recommended kits, the QIAGEN Gentra Puregene Blood Kit:

Follow the commercial instructions of the kits exactly and do not overload the columns in an attempt to get additional DNA. For more information, refer to the specifications of the kit used for extraction.

For DNA collection from saliva samples, 2 mL of saliva mixed with 2 mL of Oragene solution, with a total volume of Oragene/saliva solution of about 4 mL, should result in > 20 μg DNA (using the recommended extraction kit and protocol). Saliva sponges used to collect saliva from infants or young children will yield less DNA, as described in this DNA Genotek collection white paper:

How do I quantitate my DNA?

Complete Genomics performs several quality checks prior to accepting samples for sequencing. These include determining sample quantity using the Quant-iT™ PicoGreen® dsDNA kit from Invitrogen and determining sample integrity by electrophoresis. Both of these approaches should be used for each sample before the sample is submitted to Complete Genomics. Details on the sample requirements can be found in the Sample Submission Instructions, and details on the Sample QC performed after sample receipt can be found in the Sample Quality Control Protocol.

What if I don’t have access to a recommended DNA quantitation kit?

All samples received by Complete Genomics will be measured using PicoGreen, and the internal measurements will be used to determine which samples can proceed toward sequencing or require replacement or top off.

The benefit of PicoGreen for DNA measurement is that it specifically measures double-stranded DNA, the template used for Complete Genomics whole genome sequencing. When using alternate approaches such as spectrophotometry, contaminating protein, RNA, or single-stranded DNA will contribute to the measured A260 absorbance and thus result in an overestimation of the amount of DNA present in the sample. Other contaminants can also contribute to the A260 absorbance resulting in overestimation of the DNA concentration.

If there is no access to PicoGreen or another approach for DNA measurement using fluorometry, Complete Genomics strongly recommends sending at least 2 times more DNA than estimated by NanoDrop spectrophotometry or an alternate approach. Sending as much DNA as possible is the best way to ensure that there is sufficient DNA for sample acceptance. An alternate approach using fluorometry for DNA measurement is the Invitrogen Qubit® fluorometer.

Which kits should I use for DNA quantitation?

Complete Genomics recommends the following fluorometry-based kits that specifically measure the concentration of double-stranded DNA.

Note: Using the proper standards to create the standard curve is essential to determining the proper DNA quantity. Complete Genomics strongly recommends using the standards provided by the kits.

Note: Using UV absorbance to quantitate DNA may result in a different quantity measured than the assays in this table.

Note: PicoGreen can be highly sensitive to the detergent CTAB, sometimes used for DNA extraction. If you use this approach, make sure to remove any residual CTAB prior to quantitation.

To mitigate the risk of sample failure due to low quantity, Complete Genomics recommends sending as much extra DNA as possible.

Why do PicoGreen and NanoDrop Results Differ?

PicoGreen specifically measures double-stranded DNA and therefore provides a more accurate detection of DNA concentration and amount. NanoDrop spectrophotometry indirectly infers DNA concentration by determining the absorbance of light at 260 nm. Unfortunately, while double-stranded DNA has the highest absorbance at this wavelength, other molecules also absorb light; including protein, RNA, single-stranded DNA, free nucleotides, phenol, and other contaminants. Therefore, NanoDrop spectrophotometry measurements are prone to overestimation of DNA amounts due to the detection of contaminants in the sample.

Can I get information on the Sample QC protocol followed by Complete Genomics?

What if I’m not sure about my gel results?

Complete Genomics recommends that you provide a gel image to your Complete Genomics Project Manager prior to shipping your samples. If you are not sure about the compatibility of any DNA samples with our sequencing technology, the Project Manager will provide feedback on whether or not the samples are suitable to be shipped and sequenced.

What should I look for in the gel results?

DNA should be intact, with a single band indicating that it is > 20 kb, and minimal degradation. See the Sample Submission Instructions for examples of gel images.

Some agarose gels reveal a smear or band migrating above the main band of DNA, indicating that the sample contains some DNA with abnormally slow mobility (potentially single stranded DNA or the effect of protein contamination). See What if there is protein in the sample? for more information. This slower migrating material may also result from high-voltage or pH gradients (which are more likely when using TAE buffer) during electrophoresis.

What if the DNA is not in the proper concentration range?

Complete Genomics does not concentrate or dilute samples that are outside of the required concentration range, as described in the Sample Submission Instructions. The liquid handling systems used for sample and library preparation have not been validated for concentrations and volumes outside of the required ranges. Because samples cannot be accepted outside of the concentration range, and there can be variability between customer measurements and Complete Genomics measurements, we recommend that the sample have a concentration well within the range required, rather than at the border of the range.

If the DNA is too concentrated, it can be diluted by adding 1x TE, pH 8.0.

If the DNA is too dilute, recommended options to concentrate the sample include the following:

Note that there can be loss of DNA during sample concentration, as the yields are not 100%. For example, the estimated yield for genomic DNA using the QIAamp DNA Micro Kit is 60%. Using a speed vacuum for concentrating samples is not recommended due to the risk of cross-contamination and increasing the salt in the sample and should only be used when 100% yield is absolutely necessary. If taking this approach, take extra care to ensure that the sample is not completely dried, which can also result in reduced yield.

What are the most common reasons that samples fail quality control prior to sequencing?

Insufficient amounts of DNA or incorrect concentration. To ensure that samples pass QC, we recommend customers measure sample DNA concentration using the PicoGreen assay as described in the Sample Quality Control Protocol, and ship additional DNA to buffer for any difference in the customer measurement from Complete Genomics’ measurement.

Partial degradation. DNA must be provided in 1x TE, pH 8.0. To minimize the likelihood that samples will fail Sample QC due to partial or extensive degradation, we recommend resolving all samples on a 0.8% agarose gel and sharing the results with your Project Manager prior to shipping samples.

Improper DNA storage and shipment. Ensure that sample wells are tightly sealed and that the plate is pre-frozen before placing it in the shipping container. Maintain DNA in proper storage conditions (in a -80˚ C freezer) until ready to ship and then ship securely in sufficient dry ice to ensure that the sample does not thaw in transit.

What is a Sample Manifest?

The Sample Manifest is a Microsoft Excel form that provides all of the details of the samples that are being shipped on a plate. Each sample plate is matched with one sample manifest, and the manifest includes all of the information for all of the samples on the plate.

Where do I get a Sample Manifest?

The Complete Genomics Project Manager will provide you with a Sample Manifest. If you have previously shipped samples to Complete Genomics, please check with the Project Manager as there may be an updated version of the Sample Manifest available for your next project.

Where do I get information on how to fill out the Sample Manifest?

Details on how to fill out the Sample Manifest are provided in the Sample Submission Instructions. If you have any questions on how to fill out the Sample Manifest, please contact your Project Manager.

What is the difference between a Top-off Sample and a Replacement Sample?

Top-off samples supplement previously submitted sample that failed sample QC due to inadequate amounts of DNA, or to adjust the concentration of the DNA sample into the acceptable range. Each top-off sample must be identical to the sample that failed Sample QC because they will be mixed together.

Replacement samples are sent to be full replacements of failed samples; they may be the same as the failed sample or they may be completely different samples.

Can I ship new samples and replacement samples on the same plate?

All top-off samples must be sent on a separate plate from either new samples or replacement samples. This is because they will be taken through our top-off protocol, which includes confirmation that the two samples to be mixed are identical, before starting our standard Sample QC. Replacement samples and new samples can be on the same plate. See What is the difference between a Top-off Sample and a Replacement Sample? to understand the difference between these sample types.

Who is Complete Genomics?

Complete Genomics, Inc. is a leading commercial provider of complete human genome sequencing services. Our sequencing service provides high-coverage and high-accuracy results at an affordable price. We do not sell instruments or reagent kits; rather we provide a service that includes sample quality control, library construction, complete genome DNA sequencing, and bioinformatics analysis of human DNA samples.

Where can I learn more about Complete Genomics sequencing?

A publication in the journal Science authored by Complete Genomics scientists and collaborators (Drmanac et al. Science 2010; Science Express 2009) describes our process and also reviews results from three reference genomes. A publication in the Journal of Computational Biology authored by Complete Genomics bioinformaticians (Carnevali et al. J Comp Bio 2011) describes our original variant calling pipeline. Updates since this earlier version include the introduction of allele fraction calling and indel rescoring. Current details on variant calling are captured in the Small Variants Methods document as well as other Methods documents available on our website.

Where can I get sample Complete Genomics data and what is available?

Complete Genomics has made several complete human genome data sets available on its FTP server (ftp2.completegenomics.com ). The genomes were sequenced at the Complete Genomics commercial genome sequencing center in Mountain View, California as part of our Complete Genomics Analysis Service (CGA™ Service). These data are largely consistent with the quality and attributes of data provided to Complete Genomics customers.

These data sets include 69 genomes representing the output from the Standard Sequencing Service, including trios, a large pedigree, and a diverse set of samples from nine different populations. Collections were drawn from the Coriell Institute for Medical Research.

There are also four samples representing the output from the Cancer Sequencing Service, including two tumor-normal cell line pairs. These collections were drawn from ATCC.

Where do I get technical support for data sets and tools produced by Complete Genomics?

What is the turnaround time for the Complete Genomics sequencing service?

Complete Genomics quotes the turnaround time at 90 to 120 days. In late 2010, we delivered data to customers with an average turnaround time of 83 days. In 2011, the median turnaround time was just 68 days. Complete Genomics continues to focus on driving this number down.

What are the input sample requirements?

Complete Genomics recommends ≥ 5 µg and requires 3.5 µg (based on quantity measurements performed by Complete at the time of sample QC) of high molecular-weight double-stranded DNA (majority over 20 kb). Samples must be at a concentration of 30 to 300 ng/µl, with a volume of 50 to 200 µl and in TE, pH 8.0. Note that because there is inherent variability in quantitation between sites and users, targeting 3.5 µg for sample submission could result in a significant number of samples failing to meet sample acceptance criteria, resulting in a delay in sample processing. To ensure that samples are processed efficiently, Complete strongly encourages customers to send ≥ 5 µg when available. Currently, whole genome amplified (WGA) DNA or formalin fixed paraffin embedded (FFPE) samples are not supported. DNA should be quantified by a PicoGreen assay (preferably with the Quant-iT™ PicoGreen® dsDNA kit from Invitrogen). Spectrophotometric quantification (by optical density) is not recommended, as contaminating protein and RNA may result in inaccurate estimation of concentration. DNA concentration should be measured after diluting to the range specified above to improve the accuracy of DNA quantification, minimizing the number of sample failures due to insufficient DNA. A detailed protocol for how PicoGreen quantitation is performed at Complete Genomics is available in the Sample Quality Control Protocol. The minimum number of samples Complete Genomics accepts is eight. Complete Genomics is actively working on additional protocols to support smaller sample DNA amounts.

What sequencing technology does Complete Genomics use?

Complete Genomics’ sequencing platform employs high-density DNA nanoarrays that are populated with DNA nanoballs (DNBs™). Base identification is performed using an unchained ligation-based read technology known as combinatorial probe-anchor ligation (cPAL™). The sequencing instrumentation is custom-developed to support this process. Details are described in our Sciencepublication (Drmanac et al., 2010). See Where can I learn more about Complete Genomics sequencing?

Does Complete Genomics perform paired-end or mate-pair sequencing? What is the gap size between the reads? What are the implications of this?

DNBs are a mate-pair construct. We target ~400-500 bp inserts to maximize the power of the data (a) for assembly through many duplications and repeats (most particularly, Alu elements, which are numerous), and (b) for identification of structural variants and larger indels. The actual gap insert size in any specific library can be empirically measured from the mapping results, and (as of version 1.7 of the Complete Genomics Analysis Pipeline) we provide such a distribution with each genome. Useful metrics, such as mean mate gap estimated for the library and the 95% confidence interval for the mate gap distribution range, are reported in the summary-ASM-ID].tsv file in the ASM directory.

Some mate-pair library protocols have been known to generate biased or low-complexity libraries. Complete Genomics’ recent laboratory protocols have been extensively tuned to reduce bias (for example, as a function of AT/GC ratio), and we achieve very high complexity (as measured by low duplication rates) using these methods. As a result, we have been able to provide thousands of genomes with a median genome call rate > 96% and exome call rate > 98% in the first three quarters of 2012. Additional sequencing metrics that quantify library performance are present in the summary files provided with each genome.

Can I order a phased genome, using Long Fragment Read (LFR) technology?

Complete Genomics is currently working on commercializing the LFR technology, and plans to offer whole genome sequencing with phasing based on LFR in 2013.

How does Complete Genomics map reads and call variants?

Reads are initially mapped to the reference genome using a fast algorithm, and these initial mappings are both expanded and refined by a form of local de novo assembly, which is applied to all regions of the genome that appear to contain variants (SNPs, indels, and block substitutions) based on the initial mappings. The de novo assembly leverages mate-pair information, which allows reads to be recruited into variant calling with higher sensitivity than genome-wide mapping methods provide. Assemblies are diploid, and thus we produce two separate result sequences for each locus in diploid regions. Variants are called by independently comparing each of the diploid assemblies to the reference. The process is described in our Science paper (Drmanac et al. Science, Jan 2010), and our assembly algorithms are described in detail in our publication in the Journal of Computational Biology (Carnevali et al, Journal of Computational Biology 2011).

Copy number variable (CNV) regions are called based on depth-of-coverage analysis. Sequence coverage is averaged and corrected for GC bias over a fixed window and normalized relative to a set of standard genomes. In the case of a tumor-normal comparative analysis provided through our Cancer Sequencing Service, coverage in the tumor genome is normalized to coverage for the same region in the matched normal genome. A hidden Markov model (HMM) is used to classify segments of the genome as having 0, 1, 2, 3 copies…up to a maximum value.

Structural variations (SVs) are detected by analyzing DNB mappings found during the standard assembly process described above and identifying clusters of DNBs in which each arm maps uniquely to the reference genome, but with an unexpected mate pair length or anomalous orientation. Local de novo assembly is applied to refine junction breakpoints and resolve the transition sequence. Novel insertions of mobile elements into the sequenced genome are identified as clusters of reads that uniquely map to the reference genome with one arm and to ubiquitous sequence with the other arm.

The location, type, and orientation of the inserted elements are identified using mate pairs that map in the vicinity of the insertion site, aligning each unmapped arm to sequences of a defined set of mobile elements. The process for CNV and SV detection is described in more detail in Complete Genomics Data File Formats.

What type of events does Complete Genomics call?

Complete Genomics identifies small variants, including SNPs, indels, and block substitutions, as well as copy number variations (CNVs), structural variations (SVs), and mobile element insertions (MEIs). For the Standard Sequencing Service, all of these variation types are identified in comparison to the human genome reference. For the Cancer Sequencing Service, somatic small variants, CNVs, and SVs are each also identified in comparison to the baseline sample within a pair or trio.

Please explain the gaps within the reads. What are the implications of these gaps on mapping, assembly, and variant calling?

By contrast with some other sequencing technologies, which have a high rate of within-read indels (most single molecule sequencing and pyrosequencing-based methods have this attribute), the intra-read gaps in Complete Genomics data are relatively easy to handle. First, they always occur at, and only at, precisely known locations in each paired end. Secondly, the gaps sizes are highly predictable and generally only +/- 1 base pair from the known mid-value. Thus, algorithms can readily be designed to map, assemble, and call variants in these reads. Complete Genomics’ analysis methods for these gapped reads have been shown to produce high quality results for both SNPs and indels variant calls.

Because of the gaps, coverage for comparable detection power does need to be modestly higher than if the reads had no gaps. However, this coverage requirement is balanced by the improved base-call accuracy (and consistent accuracy over the length of the read) in Complete Genomics sequences, improving the power of the data on a per base-call basis. This accuracy is enabled by the gapped construct that provides multiple sequencing reaction priming sites in each arm of each DNB. Because the sequencing of DNBs is highly cost-effective, Complete Genomics can also generate very deep coverage of these reads and thus produce high-quality variant calls over a large fraction of the genome.

What do you mean by a “called” base or locus?

We use stringent thresholds in our variant-calling algorithms that take into account base-call accuracy, mis-mapping probability, and both quantity and consistency of evidence. A fully called position is one where we have determined the full diploid sequence (that is, we have assembled both alleles) at these thresholds. By contrast with some other pipelines, Complete Genomics’ data analysis methods are careful to distinguish regions of the genome that are confidently called homozygous reference from those which are no-called. This greatly facilitates comparison between genomes by reducing false negatives.

For clarification, when we measure a percentage of the genome called, we are referring to a percentage of the bases corresponding to the complete NCBI reference genome sequence. We are not referring to a fraction of the non-repetitive or non-degenerate genome, or to a fraction of the genome within a certain AT/GC range.

If Complete Genomics adds a new feature to its pipeline and I wish to have my data reprocessed, can I?

Customers with genomes processed by Assembly Pipeline version 1.5.0 or later can order re-analysis of these genomes using Assembly Pipeline version 1.10 and later. Customers have the option to indicate whether they prefer a specific version or would prefer reanalysis on the most current Assembly Pipeline version at the time of processing. Since Complete Genomics does not retain customer data, the complete and original data set must be shipped back to Complete Genomics via hard disk drive. For more information, see the Reanalysis Flyer, or contact us at info@completegenomics.com.

Can I get a copy of Complete Genomics’ data processing pipeline to run on my computers?

Complete Genomics’ data processing software is not distributed at this time.

Does Complete Genomics retain customer data after it has been delivered to a customer?

Complete Genomics retains data for not less than thirty days after delivery to a customer, but deletes the data thereafter. After receiving data from Complete Genomics, customers are strongly advised to confirm immediately that the files are valid and to create a backup copy.

How big is the data sent by Complete Genomics for each genome? Are the data compressed?

A single genome at standard coverage (40x) is approximately 300 to 350 GB, although the data set may be larger. Genomes at higher coverage (80x) are approximately double (600 to 700 GB). A tumor-normal pair is therefore approximately 1.2 to 1.4 TB, and a trio is approximately 1.8 to 2.1 TB.

About ninety percent of this volume is used by the reads and initial mappings, while the processed data comprises the remaining ten percent. For part numbers providing variations only (no reads and mappings), the data set is approximately 35 GB to 60 GB per genome, depending on coverage.

Samples submitted to the Cancer Sequencing Service will have additional data in the Evidence directories, resulting in data sets of approximately 75 GB per genome for variations only.

Most of the files are shipped compressed. Uncompressing all of the data files will increase the required storage for a single genome approximately 3- to 4-fold (for example, to over 1.5 TB). Decompression is not required for compatibility with Complete Genomics downstream analysis tool package, CGA ToolsTM. For these reasons, most of Complete Genomics’ customers leave many of these files in their compressed format.

What data formats does Complete Genomics use?

All Complete Genomics data are provided as text files that can be examined and further analyzed using many different tools on all standard computing systems.

Many of Complete Genomics’ text data file formats are specific to our platform and provide rich descriptions of the data we generate. These files are also optimized for information density and to keep file sizes as manageable as possible. In addition to platform-specific files, both Standard and Cancer Sequencing Services provide variant calls in VCF 4.1 format.

Do I get the individual reads? Can I re-map or re-assemble them using some other software?

Customers receive a complete read data set unless they have ordered variant-only services. Read-level data includes all reads and mappings, as well as Phred-scale base quality scores and other useful related information such as library gap size distributions.

Complete Genomics is not presently aware of any broadly released programs optimized to handle the unique aspects of Complete Genomics read data, such as the intra-read gap structure. We have found that mapping and assembly programs such as MAQ or Velvet, which are well optimized for other data types, will not produce satisfactory results with Complete Genomics data.

Can I get the mappings in some other format, like SAM/BAM?

Complete Genomics has an open source tools package, Complete Genomics Analysis Tools (CGA Tools), for downstream analysis of Complete Genomics data. Currently, CGA Tools contains file format converters to transform the Complete Genomics data to other data formats (such as SAM/BAM and VCF). However, please thoroughly consider the response to Can I call variants from mapped Complete Genomics reads using some other program? before doing so.

How accurate are the individual reads? How does accuracy change over the length of a read?

We have examined a number of our data sets in detail and found that the highest scoring 85% of all raw base-calls in uniquely mapped reads are >99.5% concordant with reference (corresponding to a Phred score >23). We also find that our calibrated Phred-scale quality scores are excellent predictors of base-call accuracy. It is important to note that this low discordance rate is achieved with no additional filtering of raw reads. Note also that the small number of discordances at these higher quality scores include not only sequencing errors but also real polymorphisms.

Because of the unique aspects of our sequencing chemistry, our read accuracy does not degrade over the length of a read, and the error profile is relatively flat. There are modest fluctuations in accuracy of some positions over others owing to the different oligonucleotides used in each ligation. Our algorithms measure this position-specific discordance rate and use it as a prior on error rate in variant calling.

What is the coverage provided?

Complete Genomics offers two coverage levels. For the standard-coverage products, Complete Genomics generates ≥ 120 GB of reads mappable to the reference genome, providing an average coverage of ≥ 40X across the reference genome. Furthermore, Complete Genomics provides ≥ 90% completeness defined as making a diploid call (i.e., both alleles) at unique loci in over 90% of the reference genome. We believe this high level of coverage provides excellent accuracy for calls over the vast majority of the genome of any sample. On genomes to date we have typically well exceeded these metrics. These metrics are reported for each sequenced genome in an output file (summary-[ASM-ID].tsv) that is provided to customers.

For the high-coverage products, the coverage level is doubled, generating ≥ 240 GB of reads mappable to the reference genome, providing an average coverage of ≥ 80X across the reference genome. The additional coverage is useful for increased sensitivity, particularly for heterogeneous samples such as tumors. In the case of saliva samples, the amount of sequencing performed is equivalent to the high coverage products, but because of the bacterial DNA also present and also sequenced, the mapping rate is lower. For saliva samples, Complete Genomics guarantees an average coverage of ≥ 50X across the reference genome. Samples with low bacterial load will generally receive significantly higher coverage. Samples with high bacterial load will receive additional free sequencing to ensure that ≥ 50X coverage is attained for the samples.

What is the read length? What coverage of the genome does this allow?

We sequence 70 bases per DNB, 35 from each end. At the high level and uniformity of base-call accuracy we achieve, a 35-base read has equivalent mapping power of somewhat longer reads from other methods. Perhaps more importantly, the vast majority of mapped base calls contribute significantly to variant detection (such as SNP calling), by comparison with other technologies where accuracy drops off significantly along the length of the read.

We have a variety of data, both from actual assemblies and simulation studies, which show that about 96% of the reference human genome is addressable using this library and sequencing strategy, including a significant fraction of the high-copy repeat sequences in the genome. The remaining 4% includes degenerate regions and larger, highly conserved sequences which are difficult to access using most sequencing methods.

What is the stated accuracy in a Complete Genomics data set?

We have multiple data points regarding accuracy from validation studies (comparing Complete Genomics data to other laboratory methods), technical replicates, and family studies (using Mendelian constraints to measure errors). These data suggest that Complete Genomics fully calls approximately 97% of the reference genome (and 98% of the exome) with SNP false positive and false negative rates of 1.56 x 10-6 and 1.67 x 10-6, respectively. The net Mendelian Inheritance Error concordance of all small variant call types (SNPs, indels and substitutions) was observed to be 99.99971% in called non- repetitive bases and 99.99947% genome-wide. For more information, contact your sales representative or customer support for a copy of Complete Genomics’ Accuracy White Paper.

What bioinformatics skills and IT infrastructure do I need to work with Complete Genomics data?

Many current Complete Genomics users have had excellent scientific success by studying only the processed results provided by Complete Genomics, in particular, the called variants and their annotations. These customers find they do not require highly specialized bioinformatics skills (such as in genome assembly) nor expensive high-end compute clusters to work with the data. Many sophisticated analyses of these data can be done on high-end desktop and mid-range servers with access to enough disk storage required to keep the data. However even the processed data in the variant and annotation files are large, and these files can be difficult to work with using many desktop software programs. Most notably, this applies to Microsoft Excel, where even the most recent versions of Excel have a 1 million-row limit. Also, visually inspecting even a fraction of the variant calls in any genome can be daunting.

Since there are important bioinformatics considerations as well as logistical issues when interpreting any large genomic data set (including those produced by Complete Genomics) we often recommend that customers have access to at least one individual with good bioinformatics skills, including basic programming (PERL or Python scripting is common), and access to a Unix/Linux environment. This person should have technical experience manipulating large data sets and have a good scientific understanding of genetics, genomic sequence, and genome-annotation databases.

Do I need a data processing pipeline for mapping, assembly, or variant detection to work with Complete Genomics data?

No. Mapping, assembly, variant detection, and annotation are performed by Complete Genomics and are included in the data set provided to customers.

If I am just using the variant files and other processed output, can I get rid of the reads and initial mappings? At least, can I keep them off my computer?

It is up to you to determine which data you need to archive, but keep in mind that Complete Genomics deletes customer data a short time after delivery to you, so any data you permanently delete is irretrievable. Also, recall that all disk drives, including those sent by Complete Genomics, have a finite lifetime and a failure rate. Complete Genomics strongly recommends that you make and keep backup copies (at least two separate copies on separate devices) of any critical data.

If you intend to publish your results, then you may be required by the journal or by your funding source to submit the reads to a central database. You may wish to investigate any such requirement before making decisions about data retention.

If you will be focusing on the processed data from Complete Genomics (such as variant calls), but wish to retain the reads and initial mappings, you may want to consider storing them on slower less expensive storage than the other files. Cloud storage such as Amazon’s Web Services (AWS) may also be an option worth considering. AWS is an infrastructure web services platform that provides remote computing power, storage, and other services.

The ST001V and SHC001VAR part number options offer delivery of all variant files and other processed output, without including the reads and initial mappings, for customers that do not intend to use or store the raw data at all.

Can I call variants from mapped Complete Genomics reads using some other program?

Yes, however the results will differ from those that the Complete Genomics pipeline generates, and customers should be cautious, as the results may be far less accurate. We are not aware of any broadly released variant-calling tools optimized for Complete Genomics data.

Complete Genomics’ assembly and variation calling methods have been tuned to various aspects of Complete Genomics’ data, such as the flat error profile, the presence of specific length intra-read gaps, and the properties of the analytical process we have chosen. Because of the division of labor between our mapping and assembly processes, our initial mappings have a somewhat different character than mappings often produced for other platforms. For example, traditional SNP calling directly from these initial alignments tends to produce far less satisfactory results than our local de novo assembly approach.

Where can I get tools for further processing or visualizing Complete Genomics data?

Additional tools are also available at the Complete Genomics User Community. The Tool Repository at this site includes scripts and programs that have been written by Complete Genomics experts, but are not formal product offerings and, as such, are not fully supported by Complete Genomics.

How does the Cancer Sequencing Service differ from the Standard Sequencing Service?

The Standard Sequencing Service supports whole genome sequencing and data delivery for individual genomes. The Cancer Sequencing Service supports whole genome sequencing and data delivery for genome pairs and genome trios. Genome pairs consist of a tumor genome with a matched normal, while genome trios consist of two tumor genomes from the same patient along with the matched normal genome for those two tumors. Pairs and trios (also referred to as Sample Groups) are tracked and coordinated throughout QC, processing, sequencing, assembly, and delivery. Data delivered includes the full data set for each genome within the Sample Group as well as results from paired analysis between the tumor and the matched normal genome samples.

Can I get deeper coverage for my tumor samples?

Yes! Complete Genomics offers complete human genome sequencing at two coverage levels. For standard coverage, Complete Genomics guarantees a minimum average of 40x coverage across the complete genome. For high-coverage genomes, the number of sequencing lanes applied to sequencing each genome is doubled, with a guarantee of a minimum average of 80x coverage across the complete genome. It is common for cancer researchers to apply high coverage to tumor samples to mitigate some of the challenges introduced by heterogeneity (the presence of contaminating DNA from multiple genomes within a sample) and gross aneuploidy (widespread copy-number changes). The high coverage option is recommended for researchers working with samples known or expected to exhibit heterogeneity or gross aneuploidy.

Can a tumor sample be submitted using the Standard Sequencing Service rather than the Cancer Sequencing Service?

Yes. Unpaired tumors should be submitted using the Standard Sequencing Service, as they represent individual genomes. Tumor-normal genome pairs could be submitted separately using the Standard Sequencing Service, but there is no benefit to doing this, and there will be no paired analysis provided (i.e., no identification of somatic events). It is recommended that tumor-normal pairs (or trios) be submitted using the Cancer Sequencing Service to enable the detection of somatic events specific to the tumor samples.

Can a sample set containing only tumors or only non-tumors be submitted using the Cancer Sequencing Service rather than the Standard Sequencing Service?

There is no restriction on the types of samples that are submitted using either product type, but it is important to understand some caveats to submitting tumor samples without matched normal samples using the Cancer Sequencing Service. These include the following:

Somatic output, summarizing small variants, copy number variation, and structural variations, is unidirectional. It is produced comparing the non-baseline sample to baseline sample only. A comparison in the reverse direction is not performed.

CNV calling will work best when the baseline genomes are diploid/euploid.

Can I get the same paired analysis provided by the Cancer Sequencing Service using CGA Tools?

CGA™ Tools supports some of the paired analyses provided in the Cancer Genome Sequencing Service. For example, calldiff and junctiondiff enable the identification of somatic small variants and structural variations. Not all of the analyses provided in the Cancer Genome Sequencing Service are reproduced by CGA Tools. Refer to the Genome Comparison Tools section of the CGA Tools User Guide .

How are Sample Groups treated differently than individual samples throughout the complete workflow?

Samples submitted as a group under the Cancer Sequencing Service are treated as a Sample Group. The workflow and output for samples submitted as a Sample Group differs slightly from samples submitted as individual genomes under the Standard Sequencing Service as follows:

Sample QC involves a confirmation that the samples within a group are in fact related, using a panel of 96 SNPs.

Sequencing results for samples within a group are included in and influence the output for samples within a Sample Group. Specifically:

Paired output is provided for each pair, in which one sample is compared to the paired baseline to identify the somatic events specific to the sample. Somatic output includes somatic small variants, somatic CNVs, somatic structural variations, and a somatic Circos plot. A VCF file is provided including variants from both samples within a pair.

Allele read counts for paired samples are included in the masterVar file for each given sample.

Evidence at a given locus (mapped reads) is provided not only wherever a variant is found but also wherever a variant is found in the paired sample, even if no variant was found in the specific sample.

What if my Sample Group contains more than three samples, or if I want additional comparisons within my group?

For Sample Groups that contain greater than three samples or that require additional sample comparisons, submit all samples within the Sample Group as a combination of pairs, trios, and/or individuals, depending on sample numbers. For the additional comparisons desired that are not accomplished through the assignment of pairs and trios in the first submission, new sample comparisons can be performed by choosing one of the following options:

CGA Tools comparisons: Complete Genomics provides analysis tools (CGA Tools) that identify and score somatic small variants and identify somatic SVs. Identifying somatic CNVs by directly comparing the CNV output for each sample is also often performed by customers. Please contact support for more information.

Reanalysis: Sample groups can be submitted to our Professional Services group for automatic reanalysis immediately after sequencing so that the additional desired paired analyses are delivered shortly after the primary assemblies. If this is the desired route, please inform your Complete Genomics sales representative and Project Manager.

How do I assess the quality of the CNV, SV, or MEI data for my genome?

The summary file (summary-[ASM-ID].tsv) contains a variety of CNV, SV, and MEI metrics that one would expect to be roughly consistent across genomes from individuals of the same ethnicity or disease type. These metrics can be quite useful for quality assessment. They include:

Total CNV segment count

Total number of bases in CNV segments

Fraction of novel CNV (by segment count)

Fraction of novel CNV (by base count)

Total junction count

Mobile element insertion count

Fraction of novel MEI

Please note that while the application of these and other metrics to normal diploid genomes is relatively clear, correctly interpreting these and similar calculations for a cancer or non-diploid genome can be more difficult.

In addition, the REPORTS directory includes several files reporting various aspects of the sequence data that can be used to assess the quality of the delivered genome. For example:

circos-[ASM-ID].html and circos-[ASM-ID].png: visual summary of small and large variation data for each genome; includes junctions, called level or called ploidy, and Lesser Allele Fraction (LAF). somaticCircos-[ASM-ID].html and somaticCircos-[ASM-ID].png) summarize variants present in the tumor sample that are absent from the matched normal (or baseline) sample.

coverage-[ASM-ID].tsv: Reports number of bases in the reference genome covered (overlapped) by no reads, by one read, by two reads, etc. Two forms of coverage are computed and reported: uniquely mapping mated reads, and multiply mapping mated reads, appropriately weighted by a mapping confidence factor between 0 and 1 (“weight-sum” coverage). With this information, you can create a plot of genome-wide coverage distribution. For standard-coverage genomes, you would expect the mean coverage to be at least 40, and for high-coverage genomes the mean coverage would be at least 80.

coverageCoding-[ASM-ID].tsv: Reports same information as coverage-[ASM-ID].tsv for only the coding regions of the reference genome.

coverageByGcContent-[ASM-ID].tsv: Reports normalized coverage for cumulative GC base content percentile, allowing you to assess the level of GC bias across the genome.

coverageByGcContentCoding-[ASM-ID].tsv: Reports normalized coverage for cumulative GC base content percentile, allowing you to assess the level of GC bias across the exome.

How can I get updated CNV/SV data for the genomes that Complete Genomics has already sequenced and delivered to me?

Customers with genomes processed by Analysis Pipeline version 1.5.0 or later can order re-analysis of these genomes using Analysis Pipeline version 1.10 and later. Customers have the option to indicate whether they prefer a specific version or would prefer reanalysis on the most current Analysis Pipeline version at the time of processing. Since Complete Genomics does not retain customer data, the complete and original data set must be shipped back to Complete Genomics via hard disk drive. For more information, see the Complete Genomics Reanalysis Service Flyer, or contact support.

Where can I find more information on Complete Genomics data and results?

What is the difference between the different CNV files, and how do I select which file to use for my samples?

Complete Genomics provides two versions of each CNV summary file: one version for samples that are presumably diploid and one version for samples expected to exhibit gross copy number abnormalities.

CNV Files

When to use these files

cnvDetailsDiploidBeta cnvSegmentsDiploidBeta

Use this set of files when interested in identifying CNVs (compared to the reference genome) in samples that are primarily diploid. Typically, this case includes the majority of non-tumor samples and samples that
are homogeneous with regard to limited copy number changes; including, for example, trisomy 21.

cnvDetailsNondiploidBeta cnvSegmentsNondiploidBeta

Use this set of files when interested in identifying CNVs (compared to the reference) in samples that exhibit widespread or long-range divergence from a copy number of two across all autosomal chromosomes. This case includes the majority of tumor samples and perhaps genomes with mosaic aneuploidy.

somaticCnvDetailsDiploidBeta* somaticCnvSegmentsDiploidBeta*

Use this set of files when interested in identifying somatic copy number events (compared to a matched normal sample) in samples that are primarily diploid. While this is unusual for tumor samples, these files
may be useful for some tumor-types that are known to exhibit low levels of genomic copy number alterations, such as leukemias.

somaticCnvDetailsNondiploidBeta* somaticCnvSegmentsNondiploidBeta*

Use this set of files when interested in identifying somatic copy number events (compared to a matched normal sample) in samples exhibiting gross copy number abnormalities.

* Cancer Sequencing Service output, only.

How should I use the various coverage values provided in the depthOfCoverage and coverageRefScore files?

The coverageRefScore and depthOfCoverage files come with several different calculations for coverage, including: uniqueSequenceCoverage, weightSumSequenceCoverage, gcCorrectedCvg, and avgNormalizedCoverage. Each of these values may be useful in understanding certain aspects of the genome or the analysis process.

uniqueSequenceCoverage: This value counts only full mappings of DNBs whose weight ratio is > 0.99, indicating that the estimated probability of the mapping being correct is > 99%. Its primary value over the other measures described here is that in repeat regions of the genome (especially ubiquitous repeats or segmental duplications), only those DNBs that can be assigned to a specific copy with some confidence are counted. For repeat copies that contain unique differences from other copies, this may allow determination that a specific copy of a repeat has been duplicated or deleted. It may also be useful in evaluating the quality of a variation call in a repeat region, or in understanding why a given region was no-called despite there being many (nonunique) mappings to the region.

weightSumSequenceCoverage: This value counts all mappings, giving fractional attribution to alternative placements of a single DNB. It can be useful, compared with uniqueSequenceCoverage, in evaluating regions of the genome that are near-perfect copies of other regions of the genome, as such regions will receive essentially no uniqueSequenceCoverage. The ratio of the two measures (as captured by the fractionUnique metric in cnvDetails files) may provide insight into whether a CNV called at a particular location is really a duplication or loss of that region per se, or whether it might be better understood as reflecting a change in the copy number to a repeat class rather than a specific instance. As compared to gcCorrectedCvg, this value is closer to the raw data, and might help decide whether a given contrast post-GC correction seems justified. It might be useful as input to an alternative approach to bias correction.

gcCorrectedCvg: This value is based on the weightSumSequenceCoverage, and reflects a transformation of that value that adjusts for coverage biases corresponding to local GC content; coverage at each position is adjusted based on the GC content of a 1000-base window whose 501st base is the position of interest. This correction improves the overall comparability of coverage values sample-to-sample in the event of any library-to-library change in the extent and pattern of GC content bias to coverage. As such, this value might be preferred as input to a sample-to-sample comparison or normalization process.

avgNormalizedCoverage: This value is based on the gcCorrectedCvg, and reflects a transformation of that value that adjusts for forms of repeatable coverage bias other than local GC content. Coverage in each window is adjusted based on the coverage in a collection of ‘baseline’ samples. This adjustment attempts to account for the apparent copy number relative to what would be expected of a sample that was diploid in that window. This may be useful in estimating copy number in the current sample using a model that expects its inputs to be proportional to absolute copy number (as opposed to relative change in copy number relative to a reference standard).

How do I interpret CNV types “hypervariable” and “invariant”? Should I consider them as candidate CNV regions?

In the cnvSegmentsDiploidBeta file, a segment labeled as “invariant” indicates that coverage does not support normal ploidy but the abnormal ploidy is observed to be invariant in both the sequenced genome and the 52 samples that represent the reference genomes for CNV analysis. See Figure 1. Invariant regions may result from some regions of the reference being wrong (such as missed contig overlaps) or rare. In either case, it may not be useful to include the segment in a list of CNVs. Segments labeled “hypervariable” indicate that coverage does not support normal ploidy but coverage is highly variable across the sequenced genome and the 52 reference genomes and is without appreciable clustering. See Figure 2. Hypervariable regions may result from sequencing artifacts or high copy repeat regions (such as STRs or segdups) with high degree of polymorphism. Depending on your relative tolerance for false positives and false negatives, it may be reasonable to include these as candidate CNV segments to be validated using orthogonal technologies.

Figure 1: Invariant Region

An invariant region is called when coverage consistently indicates same CNV across many genomes, including the sequenced genome itself.

Figure 2: Hypervariable Region

A hypervariable region is called when coverage is highly variable across many genomes, including the sequenced genome, without appreciable clustering.

In comparing CNV data, how can a region be labeled with cnvType “invariant” in one genome and be labeled ‘+’ in the other if assignment of “hypervariable” and “invariant” regions is based on the coverage profiles of the reference genomes?

The assignment of “hypervariable” and “invariant” regions depends not only on the reference genome set, but on the specifics of the genome in question. Specifically, there are some heuristic cutoffs for determining whether to label a region “hypervariable” or “invariant”. The cutoffs include the coverage of the genome of interest. Thus, it is possible for the same genomic region to be assigned different cnvType values in different genomes. This occurs when the coverage for the genome in question is far enough outside the range of values seen in the baseline set. Furthermore, regions are labeled “hypervariable” or “invariant” only if they would otherwise be cnvType ‘+’ or ‘-‘. Thus, a genome can be labeled cnvType ‘+’ or ‘-‘ in a region that is sometimes called “invariant” or “hypervariable” in other genomes. Additionally, a genome can be labeled cnvType ‘=’ in a region that is sometime called “invariant” or “hypervariable” in other genomes.

Figure 3 illustrates an example of a region that could potentially be called “hypervariable” in some genomes and cnvType ‘+’ or ‘-‘ in other genomes. This figure shows coverage profile plots for individual genomes in the reference genome set, along with the coverage profile of the sequenced genome of interest (represented by the thicker grey line). The red called-segment line marks a region where the lack of sharply separated clusters is considered grounds for assignment of “hypervariable” region. However, because the sequenced genome of interest is clearly well separated from the reference genomes, the region is assigned a cnvType ‘-‘. If, in a different sequenced genome, the coverage is not well separated from the reference genomes, the region would be assigned “hypervariable”.

Figure 3: Coverage Profile Plots

A segment with cnvType ‘-‘ is called for the sequenced genome in a region where the reference set genomes display hypervariable characteristics.

What is “calledLevel” in my cnvSegmentsNondiploidBeta file and how does it relate to ploidy?

In tumor genome processing, we identify discrete coverage levels based on the distribution of the observed normalized coverage levels in the tumor genome. Once the levels are determined, a hidden Markov model (HMM) is used to segment the genome into regions assigned to the identified levels. The called levels are identified by their coverage relative to the median of the diploid portion of the genome. Thus, the results describe segments with values > 1 being amplified relative to the genome median and values < 1 being reduced relative to the genome median. Due to either or both tumor heterogeneity and/or normal contamination (presence of DNA from normal cells in the tumor sample), coverage levels may not correspond to integer-valued ploidy.

How do I identify segments that are amplified or reduced in my tumor genome?

Because tumors are usually nondiploid samples (due to heterogeneity and widespread copy number aberrations), the CNV files for ‘Nondiploid’ samples are generally the most appropriate for studying copy number changes in these samples. In these files, segment ploidy and CNV Type are not reported, but it is possible to filter on the calledLevel column of the cnvSegmentsNondiploidBeta file. Look at the header of this file and identify #MEAN_LEVEL with value closest to 1. This indicates a coverage level that is most representative of the median genome coverage. Filtering on the calledLevel column for value not equal to #MEAN_LEVEL with value closest to 1 will help you obtain segments of interest. However, this approach does not necessarily correspond to identifying non-diploid segments; for instance, in a tumor that is largely tetraploid, it will result in labeling regions that have three copies or fewer as “reduced” and regions that have five copies or more as “amplified.” Due to the potential for misinterpretation, we do not explicitly identify levels as amplified or reduced.

I have matched tumor and normal samples. How do I get “somatic” CNV calling?

For samples submitted for the Cancer Sequencing Service, Analysis Pipeline versions 2.0 and later provide paired analysis between tumors and baseline samples (generally matched normals) submitted as a group. The output is provided in the somaticCnvDetails-[ASM-ID]-N1 and somaticCnvSegments-[ASM-ID]-N1 files and reflects the direct comparison of coverage in the tumor sample with coverage in the matched normal.

For samples submitted for the Standard Sequencing Service, no somatic analysis is performed. The CGATM Tools mkvcf command can be used to compare coverage windows between different samples, though it does not output somatic CNV calls. Files that could be useful for comparing copy number between to samples include the cnvDetails file, which provides normalized coverage, as well as GC-corrected coverage, and cnvSegmentsNondiploid, which identifies called CNV segments and boundaries.

What are the limitations of using a single matched genome as the baseline for the paired analysis used to identify somatic CNVs?

Although the paired analysis approach is most appropriate for identifying somatic CNVs in tumornormal pairs, there are certain limitations of the paired analysis, where a single sample is used as the baseline genome:

The CNV calling is based on an HMM model containing a fixed set of states; in the case of the ‘diploid’ analysis, these correspond to integer-valued copy number, while in the case of the ‘nondiploid’ analysis, they correspond to strongly exhibited relative coverage levels. In portions of the genome where the paired baseline sample is itself not diploid, a change in copy number in the derived sample, for example a gain of one copy, will result in different changes in relative coverage than in portions of the genome where the baseline sample is diploid. For instance, if the baseline sample is triploid, an increase in the derived genome to tetraploid will result in a relative coverage of 1.33 whereas in a region where the baseline is diploid, an increase to triploid will result in a relative coverage of 1.5. This may lead to difficulty calling the correct copy number change in the derived sample in the regions where the baseline is nondiploid. In the limit, if a highly aneuploid sample is used as the baseline, it may substantially negatively impact the paired CNV calling, which will adversely affect the analysis of matched samples where the sample designated as the baseline is an aneuploid tumor.

The paired baseline sample, being a single sample, may by chance be more different from the target sample in some regions than is the standard baseline. That is, the measurement variance on the baseline sample will be higher than that for the multi-genome baseline. This may lead to a modest number of somatic CNV calls even where two paired samples are truly copy number identical.

How do I evaluate the confidence of a called CNV? What sorts of underlying evidence can I look at?

The Phred-like scores reflecting the confidence that the segment has the called ploidy and that the segment has the correct CNV type are indicated in the ploidyScore and CNVTypeScore fields. The higher the ploidyScore and CNVTypeScore, the more confidence you have in the CNV. In addition to these statistical measures of confidence, segments represented by three or more window lengths (6 kb) are more likely to be true CNVs, as segments smaller than 6 kb tend to be associated with higher false positive rates.

What does it mean when “CNVTypeScore” is much greater than “ploidyScore”?

This means that we are much more confident that the segment is either amplified (if CNV type = “+”) or deleted (if CNV type = “-“) than we are in the actual magnitude of copy number change.

How do CNV and SV results relate to one another? Would structural variation events that involve changes in copy number be reported in both CNV and SV files?

Results from CNV and SV analysis are generated independently, using different methods. We employ read depth and discordant mate pair methods to detect CNVs and SVs, respectively. Each approach has strengths that allow detection of a class of event that the other does not. For example, read-depth analysis is able to identify CNVs in complex regions of the genome rich in segmental duplications, while discordant mate pair mapping is able to detect events, such as inversions, that do not result in copy number changes. While some events, such as deletions, may be detected by both methods and reported in both CNV and SV files, we do not provide information that would directly link the calls that represent the same event.

Why did Complete Genomics miss a known event or an event that is obvious from the raw data in my sample?

A known limitation in our current pipeline that we are working to improve is a reduced sensitivity to CNVs less than 5 KB in length and low sensitivity to CNV less than 3 KB in length. Also, regions where there is high copy number polymorphism (such as high copy segmental duplication, short tandem repeats, and heterochromatic regions), are generally no-called. For tumor genome segmentation, sufficient heterogeneity in a tumor may make it difficult to correctly identify all the relevant coverage levels. In addition, excessive normal contamination may make differences in ploidy within the tumor portion of a sample lead to differences in coverage that are too small to be modeled reconstructed, even if the tumor is itself homogeneous.

Can I get access to the reference data set used to generate the baseline?

Yes. For samples analyzed using Analysis Pipeline version 2.0 or later, the baseline genome set is comprised of 52 unrelated genomes from the Complete Genomics Diversity Panel. A file that summarizes the underlying data and normalization constants for each of the CNV baseline genomes is available from the Complete Genomics FTP site .

The accompanying CNV Baseline Genome Dataset: Data Format Description document provides the identifiers for each genome in the CNV baseline set and describes the data file format for the CNV baseline genome composite file. For more details on how the CNV baseline genomes are used to normalize and ‘no-call’ CNV data for the sequenced genome, please consult the CNV Methods document. Note that the same genomes are used to construct our CNV, SV, and MEI baseline sets.

What is the Lesser Allele Fraction (LAF), and where can I find it?

The allele fraction for alleles at a heterozygous site in a diploid portion of the genome should be 50% each, but due to heterogeneity, loss of heterozygosity (LOH), and copy number variation, the allele fraction of each allele may be greater or less than 50%. The Lesser Allele Fraction (LAF) is the fraction of the sample containing the allele that is present in ≤ 50% of the sample. Therefore, the range of LAF values is 0 to 0.5. The LAF is similar to “B-allele frequency” estimates from microarray genotyping data, but captures the fraction of the less abundant allele/haplotype rather than the fraction of an arbitrary allele at each locus.

Single-sample LAF estimates are provided for all genomes. These calculations apply to 100 kb windows and are based on the read counts of each allele at all fully-called loci within the window. Paired-sample LAF estimates are provided for tumor (non-baseline) samples submitted to the Cancer Sequencing Service. Paired-sample LAF calculations apply to 100 kb windows and are based on allele read counts in the tumor at loci that are called heterozygous in the matched normal (baseline) sample. LAF measurements are provided in several output files, including masterVarBeta, all cnvSegments and cnvDetails files, the vcfBeta and somaticVcfBeta files, and all Circos plots.

Where LAF is = 0.5, the simplest explanation is that the sample is diploid, with two distinct haplotypes in that region, though it can also reflect any even copy number with equal numbers of two haplotypes.

Where LAF is < 0.5 indicates copy number variation with unequal numbers of the major and minor haplotypes.

Regions that have experienced loss of heterozygosity (LOH) will have low but typically non-zero estimated LAF and the ‘noisiness’ of the estimate will be apparent in extended regions. This is because the sites used for LAF estimation will be false heterozygous variations, hence they will be sparse and typically with low reads counts.

Paired-sample LAF:

Where LAF is = 0.5, the simplest explanation is that the sample is diploid, with two distinct haplotypes, in that region, though it can also reflect any even copy number with equal numbers of two haplotypes.

Where LAF is < 0.5 indicates copy number variation with unequal numbers of the major and minor haplotypes. The LAF in conjunction with depth-of-coverage information may permit improved estimation of major and minor allele copy number, as well as estimation of normal contamination or other heterogeneity.

Where LAF is = 0 across a region of the genome, this is indicative of Loss of Heterozygosity (LOH).

How do I identify Loss of Heterozygosity (LOH)?

In single samples, LOH may be identified using the single-sample LAF measurements as described in How do I interpret the Lesser Allele Fraction (LAF)?. In tumor samples submitted to the Cancer Sequencing Service, there are multiple methods available to identify LOH:

Paired-sample LAF measurements can indicate LOH regions of 100’s of kb or greater. These measurements, provided in the somaticCnvSegmentsNonDiploid and somaticCnvDetailsNonDiploid files, indicate regions of LOH when the paired-sample LAF values are low, approaching zero.

Variant flags can be examined to identify shorter putative regions of LOH. The masterVarBeta-[ASM-ID]-T1 (baseline sample) and somaticVcfBeta files contain flags indicating small variants loci that are homozygous in the tumor but heterozygous in the matched normal. In the masterVarBeta file, such loci are flagged with the lohVar-[comparison-modifier] flag in the varFlags column. The results of the comparison between the two genomes are also provided in the locusDiffClassification column(s). In the somaticVcfBeta file, such loci are flagged with the value “LOH” in the Somatic Status (SS) tag in the FORMAT column. Note that the information on variants whose genotypes are consistent with LOH in the masterVarBeta and somaticVcfBeta files is based on a simple comparison at the level of an individual locus or superlocus. Any attempt to identify LOH blocks using this information would require an algorithm that parses this information over larger blocks and tolerates some errors. For more information about Complete Genomics files, see the Data File Format documents.

What is a junction?

A junction is defined as two separate regions of the reference genome that appear to be near each other in the genome being sequenced. Deletions are represented by a single junction (Figure 1), while other events such as inversions and intrachromosomal translocations can be represented by more than one junction (Figure 2).

As shown in Figure 1, deletion of segment BC in the sequenced genome would be represented by junction AD: a junction that connects sections A and D. leftStrand, leftPosition, rightStrand, rightPosition, and distance are fields reported in junction files. leftStrand and rightStrand values indicate that left and right side of the junction have the same strand orientation, while the distance value of 2,000 indicates that the position on left and right side of the junction closest to the breakpoint is 2,000 bp apart on the reference genome.

Figure 2: Inversion of Segment BC in the Sequenced Genome Represented by Two Different Junctions

As shown in Figure 2, inversion of segment BC in the sequenced genome would be represented by two different junctions: junction AC that connects sections A and C and junction BD that connects sections B and D. Note that unlike what is shown in Figure 5, coordinates for paired junctions are not typically identical for real events. leftStrand and rightStrand values indicate that, for both junctions, left and right side of the junction have the opposite strand orientation, while the distance value of 2,000 indicates that the positions on left and right side of the junction closest to the breakpoint are 2,000 bp apart on the reference genome.

Does Complete Genomics indicate the structural variant type represented by a junction in SV files?

Complete Genomics provides two files—allSvEventsBeta and highConfidenceSvEventsBeta—that report structural variation events involving identified junctions found in the allJunctionsBeta and highConfidenceJunctionBeta files, respectively. The CGA™ Tools junctions2events command is used to identify structural variation events such as deletions, inversions, and translocations from lists of junctions. It determines which event type a junction is consistent with by identifying possible relationships among the provided junctions. Single-sample junctions are rationalized into event types using this tool, but somatic junctions are not rationalized into event types at this time.

Are somatic events identified for my tumor-normal pair?

Currently, Complete Genomics does not report somatic events. In other words, we do not attempt to rationalize somatic junctions into somatic events. The EventId, Type, and RelatedJunctions annotations in the somaticAllJunctionsBeta file and the somaticHighConfidenceJunctionBeta file refer to a description of the event identified in the tumor sample.

Does Complete Genomics indicate zygosity of the junction?

Currently, we do not attempt to call zygosity of the junction. However, zygosity can be inferred, to a certain extent, by interrogating the coverage in the junction region. For example, if coverage in a putative deletion junction region is near zero, you can infer that it is likely a homozygous deletion event.

Are the indels reported in the small variant files also reported in the junctions files?

Small insertion and deletion events are detected during the assembly process. They are only reported in the small variant files (e.g. var, masterVarBeta, vcfBeta), and not repeated in the junctions files, as they are not detected by the discordant mate pair mapping method employed for the detection of larger structural variations.

The allJunctionsBeta, highConfidenceJunctionBeta, somaticAllJunctionsBeta, somaticHighConfidenceJunctionBeta, and evidenceJunctionClustersBeta files have the same file format. What are the differences among these files?

These five files represent outputs at various steps of our SV detection pipeline. Junctions are detected by identifying clusters of DNBs in which each arm maps uniquely to the reference genome, but with an unexpected mate pair distance or anomalous orientation. If a cluster contains three or more DNBs, a junction is output. These junctions, annotations such as the putative junction breakpoints and size of structural variant, and transition length estimated from this initial clustering of DNBs are reported in the evidenceJunctionClustersBeta file. Once junctions are detected, local de novo assembly is attempted on each junction. These junctions, annotations such as breakpoint, size of SV, transition sequence, and length that have been refined by local de novo assembly are reported in the allJunctionsBeta file. So, while evidenceJunctionClustersBeta and allJunctionsBeta files report the same junctions, for junctions in which local de novo assembly was successful, junctions annotations differ.

A set of filtering criteria is applied to junctions in the allJunctionsBeta file to obtain a list of high-confidence junctions, which are then reported in the highConfidenceJunctionBeta file. So, highConfidenceJunctionBeta file contains a subset of the allJunctionsBeta files, but the annotations for junctions found in both files are the same.

For samples submitted for the Cancer Sequencing Service, two additional files are provided. The somaticAllJunctionsBeta file represents junctions that are identified in the allJunctionsBeta file for the tumor but not in the allJunctionsBeta file for the normal sample. The somaticHighConfidenceJunctionBeta file includes junctions identified in the highConfidenceJunctionBeta file for the tumor, but not in the allJunctionsBeta file for the normal match.

How do I evaluate the confidence of a called junction? What sorts of underlying evidence can I look at?

There are several columns of information in the allJunctionsBeta and highConfidenceJunctionBeta files that can be used to gauge the confidence level of the called junction. These same metrics are used to filter for high-confidence junctions reported in the highConfidenceJunctionBeta file.

DiscordantMatePairAlignments indicates the number of discordant mate pairs in the cluster that supports the called junction. A higher number of discordant mate pairs supporting a junction indicate higher confidence that this junction is present in the sequenced genome.

junctionSequenceResolved indicates whether assembly of sequence across the two sides of the junction was successful. A value of “Y” indicates success and lends strong support to the inference that there is a physical connection between the left and right side of the junction (that is, a higher confidence that this junction corresponds to a real event).

KnownUnderrepresentedRepeat indicates whether left or right sections of the junction overlaps with repetitive genomic elements that are known to be underrepresented in the human reference genomes. A value of “Y” indicates overlap and indicates less confidence that the junction is real.

LeftLength and RightLength indicate the lengths of the left and right sections of the junction. Longer lengths indicate higher confidence that junction is real.

If a junction implies an interchromosomal translocation event (left and right sections of junction map to different chromosomes), the value in FrequencyInBaseline field can be used to gauge confidence in the junction, along with metrics described above. FrequencyInBaseline indicates the frequency at which the junction is detected in the 52 normal genomes used as the baseline reference set. Given that an interchromosomal event in a normal genome is rare, higher frequency indicates less confidence that the junction is real and that the junction is more likely to have resulted from a processing artifact or sequence similarity of one mate pair to another region of the genome than a true physical connection between left and right sections of the junction.

Are there limitations on the classes of junctions Complete Genomics can discover?

Our pipeline has known limitations that we are working to improve. These limitations are:

Sensitivity for deletion events < 1 kb in size.

Due to the fact that we only use unique mappings for junction detection, we do not detect junctions involving high-identity repeats such as segmental and tandem duplications.

Insertions of transposable elements such as common LINEs and AluY or insertions of sequences not found in the reference genomes are not detected.

Can I get access to the reference data used to create the SV baseline?

Yes. For samples analyzed using Analysis Pipeline version 2.0 or later, the SV baseline genome set is comprised of 52 genomes from the Complete Genomics Diversity Panel. You can download a file that summarizes the detected junctions and their frequencies across the SV baseline set from the Complete Genomics FTP site .

The accompanying SV Baseline Genome Dataset: Data Format Description document provides the identifiers for each genome in the SV baseline set and describes the data file format for the SV baseline genome composite file. Note that the same genomes are used to construct our CNV, SV, and MEI baseline sets.

Are MEIs detected using the same method as SV detection?

No. The discordant mate pair mapping method used to detect SVs only considers mate pairs in which both ends map uniquely to the reference genome. To detect MEI events, we identify mate pairs in which one end maps outside the event and one end maps within the events. Thus, we look for clusters of mate pairs in which one end maps uniquely to the reference genome while the other end maps to a sequence that is ubiquitous in the reference, that is, to a sequence for which reads are marked as “overflow”.

What is the resolution of insertion site detection? Does Complete Genomics assemble the insertion site?

The range of likely insertion sites is reported for each event in the InsertionRangeBegin and InsertionRangeEnd fields within the mobileElementInsertionsBeta file. This range is determined from the initial mapping of mate pairs that map in the vicinity of the insertion site with one arm and map to ubiquitous sequence with the other arm. Currently, we do not attempt to perform local de novo assembly of the mate pairs that map across the insertion point to refine the position to a single base pair resolution.

What MEI type does Complete Genomics detect?

We align each unmapped end of mate pair clusters that support MEI events to sequences of a defined set of mobile element sequences. This set contains the most active mobile element sequences such as L1s, Alus, and SVAs, which are more likely to contribute to the structural variations in the human genome. Please refer to the Data File Format document for information on this list. As more information becomes available regarding the contribution of other mobile element sequences to variations in the human genome, Complete Genomics will add these relevant sequences to its database.

How should I filter for high-confidence MEIs?

A ROC curve is generated for each sequenced genome to demonstrate the tradeoff in sensitivity and specificity as a function of two metrics that are well correlated with quality of the event detection: insertionScore and insertionDnbCount. This curve is provided in the mobileElementInsertionsROCBeta file; Figure 1 shows an example. The Y-axis is computed with regards to the fraction of 1000 Genomes Projects MEI events that is also called in the sequenced genome to estimate sensitivity at a given insertionScore or insertionDnbCount. The X-axis uses the count of MEI events that are called in the sequenced genomes that were not called by the 1000 Genomes Project to estimate specificity at a given insertionScore or insertionDnbCount. These will not all be false positives, as suggested by the nearly linear relationship below score 1598, which could indicate that the set of true high scoring events are a mix of events found by the 1000 Genomes Project and events missed by the 1000 Genomes Project. The curve enables you to determine the insertionScore or insertionDnbCount that would achieve a desirable balance between sensitivity and specificity. Once this threshold is determined, events can be easily filtered by this threshold value using the KnownEventSensitivityForInsertionScore or NovelEventCountForInsertionScore columns in the mobileElementInsertionsBeta file.

Figure 1: ROC Curve

This image demonstrates the trade-off in sensitivity and specificity as a function of insertionScore and insertionDnbCount.

The ROC curve graph in the mobileElementInsertionsROCBeta-[ASM-ID].png file is provided to facilitate selection of a threshold that would best meet your requirements on sensitivity and specificity of the MEI detection.

For each candidate mobile element insertion site, the data also includes the count of the number of DNBs that map across the insertion site—DNBs where one arm map upstream and one arm map downstream of the reference range where the insertion is likely to be located—with mate gap distance that would be unlikely had the DNBs come from the allele where the insertion was present. The count is reported in the referenceDnbCount field of the mobileElementInsertionsBeta.tsv file and allows determination of the zygosity of the MEI events. A distribution graph of these counts for the sequenced genome is provided in mobileElementInsertionsRefCountsBeta-[ASM-ID].png to help with the selection of the appropriate threshold to separate heterozygous and homozygous events.

Does Complete Genomics identify somatic MEIs?

At this time, Complete Genomics does not provide somatic analysis for MEI detection.

Is the zygosity of events reported?

The zygosity of an insertion event is not reported in the mobileElementInsertionsBeta.tsv file. However, the plot provided in the mobileElementInsertionsRefCountsBeta-[ASM-ID].png file can be used to filter for homozygous and heterozygous events. This plot shows the distribution of mate pairs that support reference allele for MEI events detected by the 1000 Genomes Project that were also detected in the sequenced genome. As shown in Figure 2, this distribution is usually bi-modal, corresponding to the homozygous insertions (peaking at zero DNBs) and heterozygous insertions (centered at approximately 80 DNBs for this genome). Thus, for this genome, any threshold between 10 to 30 DNBs would be reasonable to apply when filtering on the ReferenceDnbCount column to separate homozygous and heterozygous MEI events.

Figure 2: Bi-Modal Distribution of Mate Pairs

Can I get access to the reference data used to create the MEI baseline?

A composite file profiling MEIs and their frequencies across the baseline set that is used to annotate the mobileElementInsertionsBeta file is currently not provided. Because the same genomes were used to construct the CNV, SV, and MEI baseline sets, see either the CNV Baseline Genome Dataset: Data Format Description or SV Baseline Genome Dataset: Data Format Description documents for the list of genomes. These documents are packaged with the Baseline Genome Set summaries available from the Complete Genomics FTP site:

How does Complete Genomics map reads and call variations?

Reads are initially mapped to the reference genome using a fast algorithm, and these initial mappings are both expanded and refined by a form of local de novo assembly, which is applied to all regions of the genome that appear to contain variation (SNPs, indels, and block substitutions) based on these initial mappings. The de novo assembly fully leverages mate-pair information, allowing reads to be recruited into variant calling with higher sensitivity than genome-wide mapping methods alone typically provide. Assemblies are diploid, and we produce two separate result sequences for each locus in diploid regions (exceptions: mitochondria are assembled as haploid and for males the nonpseudoautosomal regions in the sex chromosomes are assembled as haploid). Variants are called by independently comparing each of the diploid assemblies to the reference.

Copy number variable (CNV) regions are called based on depth-of-coverage analysis. Sequence coverage is averaged and corrected for GC bias over a fixed window and normalized relative to a set of standard genomes. A hidden Markov model (HMM) is used to classify segments of the genome as having 0, 1, 2, 3 copies…up to a maximum value.

Structural variations (SVs) are detected by analyzing DNB mappings found during the standard assembly process described above and identifying clusters of DNBs in which each arm maps uniquely to the reference genome, but with an unexpected mate pair length or anomalous orientation. Local de novo assembly is applied to refine junction breakpoints and resolve the transition sequence. The process for CNV and SV detection is described in more detail in Complete Genomics Data File Formats.

How do I assess the quality of a genome produced by Complete Genomics?

In the summary file (summary-[ASM-ID].tsv), you will see a variety of metrics that may be helpful in understanding the quality of the delivered genome. For example:

Fully called genome fraction — percentage of reference genome with full (diploid) calls in the sequenced sample (following assembly)

Fully called exome fraction — percentage of reference exome with full (diploid) calls in the sequenced sample (following assembly)

Genome fraction where weightSumSequenceCoverage ≥n — Fraction of the reference genome bases where coverage is greater than or equal to n, with n being 5x, 10x, 20x, 30x and 40x.

Exome fraction where weightSumSequenceCoverage ≥n — Fraction of the reference exome bases where coverage is greater than or equal to n, with n being 5x, 10x, 20x, 30x and 40x.

There are additional biological metrics which one would expect to be roughly consistent across genomes from individuals of the same ethnicity (even to genomes sequenced using other methods). These are also quite useful for quality control. They include:

SNP total count (for genome and exome)

SNP heterozygous/homozygous ratio (for genome and exome)

SNP transitions/transversions ratio (for genome and exome)

SNP novelty fraction (for genome and exome)

Please note that while the application of these and other metrics to normal diploid genomes is relatively clear, correctly interpreting these and similar calculations for a cancer or non-diploid genome can be more difficult.

In the REPORTS directory of our data delivery, you will find several files reporting various aspects of the sequence data that can be used to assess the quality of the delivered genome. For example:

circos-[ASM-ID].html and circos-[ASM-ID].png (also: somaticCircos-[ASM-ID].html and somaticCircos-[ASM-ID].png): Shows a visual summary of small and large variation data for each genome. The image includes density of homozygous SNPs, density of heterozygous SNPs, gene symbols for impacted genes, and, when applicable, density of somatic variants.

coverage-[ASM-ID].tsv: Reports number of bases in the reference genome covered (overlapped) by no reads, by one read, by two reads, etc. Two forms of coverage are computed and reported: uniquely mapping mated reads, and multiply mapping mated reads, appropriately weighted by a mapping confidence factor between 0 and 1 (“weight-sum” coverage). With this information, you can create a plot of genome-wide coverage distribution. For standard-coverage genomes, you would expect the mean coverage to be at least 40, and for high-coverage genomes the mean coverage would be at least 80.

coverageCoding-[ASM-ID].tsv: Reports same information as coverage-[ASM-ID].tsv for only the coding regions of the reference genome.

coverageByGcContent-[ASM-ID].tsv: Reports normalized coverage for cumulative GC base content percentile, allowing you to assess the level of GC bias across the genome.

coverageByGcContentCoding-[ASM-ID].tsv: Reports normalized coverage for cumulative GC base content percentile, allowing you to assess the level of GC bias across the exome.

What is the difference between “Gross mapping yield” and “Both arms mapped yield” in the summary file?

“Gross mapping yield” counts aligned bases within DNBs where at least one arm is mapped to the reference genome, excluding reads marked as overflow (large number of mappings to the reference genome indicative of highly repetitive sequence). “Both arms mapped yield” counts aligned bases within DNBs where both arms mapped to the reference genome on the correct strand and orientation and within the expected distance.

What are the definitions for Fully Called, Partially Called, Half-Called and No-Called?

“Fully called” indicates that the assemblies of both diploid alleles meet the minimum required confidence thresholds, and thus both alleles are considered called. In this case, both alleles may be variant, or one may be reference and the other variant. If both are variant, they may be the same (homozygous) or different (heterozygous).

At a “partially called” or “half-called” site, only one allele meets the threshold to call the site confidently while the other does not. The Complete Genomics software reports this partial information for that locus (rather than no-calling the site entirely). Effectively, this is a statement that “we know this allele is present” but we can say little about what other allele is also present in a diploid region.

In the summary-[ASM-ID].tsv file, how is the number of homozygous SNPs calculated?

The number of homozygous SNPs is calculated from the var-[ASM-ID].tsv file, and is equal to the sum of all diploid loci where the same SNP is present on both alleles.

In the summary-[ASM-ID].tsv file, how is the number of heterozygous SNPs calculated?

The number of heterozygous SNPs is calculated from the var-[ASM-ID].tsv file, and is equal to the sum of SNPs present in the following types of loci:

het-ref SNP: A single-base diploid locus where a SNP is present on one allele, and the other allele is reference.

alt-alt SNP: A single-base diploid locus where each allele contains a different SNP.

In the summary-[ASM-ID].tsv file, how is the total number of SNPs calculated?

The total number of SNPs is calculated from the var-[ASM-ID].tsv file, and includes SNPs present in all of the following types of loci:

het-ref SNP: A single-base diploid locus where a SNP is present on one allele, and the other allele is reference.

hom SNP: A single-base diploid locus where the same SNP is present on both alleles.

alt-alt SNP: A single-base diploid locus where each allele contains a different SNP.

hap SNP: A single-base haploid locus where a SNP is present on the single allele.

other: All SNPs that occur in loci that do not fall into the above categories, including SNPs present in loci containing no-calls.

In the summary-[ASM-ID].tsv file, what regions of the genome are included in the “exome”?

The exome is defined as the coding regions (CDS) of protein-coding genes, plus all of the untranslated genes, minus any transcripts (coding or otherwise) that are rejected by the annotation pipeline. A small percentage of transcripts in Build 36 and Build 37 are excluded from the annotation results due to the one or more of the following reasons:

When Complete Genomics calls a variation (or a variant allele) what exactly does that mean?

A “variation” or “variant” refers to an allele sequence that is different from the reference at as little as a single base or for a longer (potentially much longer) interval. In general the distinction between “variation” and “polymorphism” is that polymorphisms are by definition variable sites within or between populations. “Variation” makes no assumption about degree of polymorphism except by comparison between a sample and the reference (recall that the reference sequence can be wrong at some sites). Thus, scientists will sometimes use the term Single Nucleotide Variant (SNV) over SNP (single Nucleotide Polymorphism). However, we continue to use the acronym SNP as it is more ubiquitous, if not entirely precise in this case.

What types of variants are indicated in the variation files?

SNPs, small insertions and deletions, and small block substitutions are indicated as variants in the Complete Genomics variation files, found in the ASM directory. By “small”, reported indels may be up to about 50 bases in length, although the precise upper limit varies by region and coverage. In addition to these variants, we also call Copy Number Variants (CNVs), Structural Variants (SVs), and Mobile Element Insertions (MEIs), which are reported in separate folders within the ASM directory. While these variants are all determined in comparison to the human genome reference, genomes submitted for the Cancer Sequencing Service are additionally analyzed for somatic variants called in comparison to the baseline genome within the submitted pair or trio.

What exactly is a reference call? How is this different from a no-call?

Complete Genomics makes a strong distinction between a no-call and a confident homozygous reference call. Some other pipelines identify variants in sequence but do not make this distinction. Where they fail to call variants, one must rely on rough surrogate measures (such as depth of coverage and mapping scores) to help interpret whether non-variant sites are homozygous reference or are simply not callable. This distinction can be one source of confusion when comparing data across technologies.

What is a “sub” or a “delins”?

A “sub” is a block substitution, where a series of nearby reference bases have been replaced with a different series of bases in an allele. The sample’s allele and reference may be the same length (“length-conserving”) or not (“length-changing”). In data generated by Complete Genomics pipeline versions prior to 1.7 a “sub” was denoted as a “delins”.

What defines a “locus”? Are loci variant or are alleles variant? Explain the asymmetric calls Complete Genomics produces at some loci?

Complete Genomics calls variants on each allele by comparing the assembly of that allele to the reference sequence. This process is repeated independently for each of the two diploid alleles at each autosomal locus. Bases in the genome with variants on either or both alleles in close proximity are grouped together as a single variant locus. For example, the middle three positions at the site below are considered one variant locus:

Reference: TAG TCG CCT

Allele 1: TAG TTG CCT one ref + one SNP + one ref

Allele 2: TAG CAC CCT a 3 base block substitution

How does Complete Genomics determine when to call a site with multiple variant bases as one locus or as multiple loci? For example, two neighboring variant bases could be coded as two SNPS or as one two-base block substitution.

Generally, if two or more reference bases on both alleles are called between two variant sequences, then the site is broken into smaller events.

Each variant allele has been identified as allele “1” or “2”. Does that mean that all of the allele 1 variants are located on the same parental chromosome?

No. The allele number is assigned arbitrarily at each locus and does not indicate phase. Where phase is determined, generally because variants are within the same vicinity, the haplink field in the variations file will be populated to indicate this. Variant alleles with the same haplink ID are known to be in cis-phase, that is, on the same parental chromosome.

Note that prior to pipeline version 1.8, the “allele” column in the variation file was called “haplotype”.

What do “N” and “?” in calls mean? Are they always in alleles marked as nocalls?

An “N” indicates that a specific base could not be resolved on the allele in question; however the flanking (non-N) sequence may have been called. A “?” indicates that the unresolved region may include zero or more unknown bases. For example, “ATGC?” means that the exact number and composition of bases (if any) immediately after ATGC on that allele could not be determined.

Please explain “no-call-ri”, “no-call-rc”, “ref-consistent” and “ref-inconsistent” in the var file. How should I use these?

All no-call variant types indicate that the sequence could not be fully resolved, either because of limited or no information, or because of contradictory information. When some portions of the allele sequence can be called but others not, we indicate this as “no-call-rc” (no-call, reference-consistent) if those called portions are the same as the reference. We use no-call-ri (no-call, referenceinconsistent) if they are not. Ref-consistent and ref-inconsistent are the names for no-call-rc and nocall-ri, respectively, used by versions of Complete Genomics pipeline versions prior to 1.7. We changed the names to highlight the fact that these alleles contain no-calls.

In some cases, one may wish to be conservative and consider any such region entirely no-called, and thus neither a match nor a mismatch between sample and reference.

What causes loci to be partially (or half) called?

For a small fraction of assembled loci, there can be support for one allele (reference or variant) but some ambiguity as to whether the other allele is supported by the data. This can happen, for example, when very few reads from one of the two chromosomes are seen. Also, in regions of low coverage, the algorithm may see reads consistent with a single allele (i.e., consistent with a homozygous call), but may judge that too few reads in total were seen to have had a good chance of sampling both chromosomes. In these cases the variation file reports a partial or half-called locus; a fully resolved allele (reference or variant) on one chromosome, but a no-call on the other.

Does Complete Genomics assume a diploid model when calling small variants?

Diploidy is not assumed when calling small variants. The small variant caller considers heterozygous hypotheses at a wide range of allele frequencies between 20% and 80%, including but not limited to 50%. This is to accommodate small variants that occur at sites of copy number variation as well as in samples that are not pure: for example, due to tumor heterogeneity or sample mosaicism.

Note that two variant scores are provided for each called allele: one derived from the probability of this call assuming variable allele fractions (allele1VarScoreVAF, allele2VarScoreVAF, or varScoreVAF), and one derived from the probability of the given call assuming equal allele fraction, or diploidy (allele1VarScoreEAF, allele2VarScoreEAF, or varScoreEAF). Additionally, triploid hypotheses are considered in the assembly optimization step, and the step of alleles in an evidenceInterval record may describe a triploid top hypothesis. Regardless of models used to call small variants, the results of variation intervals where the top hypothesis is triploid will still be presented as two alleles at each locus.

Which score do I use when filtering my small variant calls for quality?

The varScoreVAF and varScoreEAF are the best indicators of variant quality (these correspond to allele1VarScoreVAF, allele2VarScoreVAF, allele1VarScoreEAF, and allele2VarScoreEAF in the master variations file masterVar). The varScoreEAF best reflects the quality of a call for variants at 50% allele fraction, while the varScoreVAF is a better score for variants at low allele fraction.

For reference-called positions, Complete Genomics provides scores in the coverageRefScore files in the REF directory, rather than the var or masterVar files. The reference scores within that directory are the best indicator of the quality of reference calls.

For variants, select the score based on the type of sample being studied, as follows:

The VarScoreEAF is based on the assumption that the sample is diploid. It will generally be the best fit for samples that are homogeneous and are not expected to exhibit gross copy number changes. Examples include population studies, samples representing congenital disorders, and the matched normal samples in tumor-normal pairs.

The VarScoreVAF is based on the assumption that the alleles in the sample can vary greatly from being diploid. It will generally be the best fit for samples that are heterogeneous and/or exhibit gross copy number changes. Examples include most tumor samples as well as samples expected to contain mosaicism.

When there is not enough information about the sample to determine the best score approach, Complete recommends using varScoreVAF as the general-purpose variant score.

How does Complete Genomics handle mitochondrial sequences?

Mitochondrial sequences are treated as having a ploidy of 1. The circular nature of mitochondrial DNA is taken into account so that coverage is not suppressed at the start and end of the mitochondrial chromosome.

How does Complete Genomics handle the sex chromosomes?

In males, the majority of the X chromosome is treated as having a ploidy of 1 while in females the X chromosome has a ploidy of 2. In males, variants in the pseudoautosomal region of the Y chromosome are reported on the corresponding regions of the X chromosome, where ploidy 2 is assumed. The pseudoautosomal region of the Y chromosome itself will be indicated as “PAR-calledin-X” in the variant file.

How does Complete Genomics handle regions of the genome where multiple divergent references are known, such as MHC?

Areas of the genome that are highly variable are assembled using the default reference sequence at this time. Therefore, the no-call rate may be higher than other locations of the genome. We are looking into improved calling methods for these regions in the future.

I see loci in the variant file with the same start and end position and a “?” for the sequence. What is a zero length no-call?

This is a locus in the genome where we cannot rule out the possibility that there is an insertion present.

Yes, we find both length-conserving and length-changing block substitutions in our assembly process, at both homozygous and heterozygous loci. In many genomes, we find a number of these substitutions where a portion of the locus is a known variant (such as, SNP), while the remainder of the substitution is novel and called with high confidence.

What source is used for calling variants?

We presently call variants relative to the NCBI reference in each genome sequenced. This facilitates comparison between any set of samples desired.

Complete Genomics develops an open source tools package, Complete Genomics Analysis Tools(CGATM Tools), for downstream analysis of Complete Genomics data. Currently, CGA Tools contains tools for comparing variants between two genomes. We are working on additional methods for multiple sample and other comparison tools. For more information on CGA Tools, see the Complete Genomics website.

Are known variants (such as those in dbSNP) considered when assembling and calling loci?

Yes. Known variants are used as a supplementary source of seeds for local de novo assembly when searching for candidate variants. Knowledge of which variants are known and which are novel is not used in variant scoring. The set of known variants used to supplement local de novo assembly is comprised of indels and short block substitutions from dbSNP (dbSNP 130 for Build 36, and dbSNP 132 for Build 37) and the Complete Genomics Diversity Panel (69 genomes using assembly pipeline version 1.10).

Can Complete Genomics use a different reference or directly assemble one genome against another?

No, currently Complete Genomics does not use another reference other than NCBI nor can we directly assemble one genome against another.

Does Complete Genomics remove duplicate reads?

Any pair of DNBs from the same library that have at least one arm whose initial mappings have a common mapping (based on chromosome, offset, and strand) are considered candidates for deduplication. Each pair of candidate duplicate DNBs is evaluated for sequence similarity. If the DNBs have at most four discordances for each arm (up to two discordances per read), allowing the gaps to differ by up to two bases (except for the clone end reads), they are considered duplicates, and one of the two DNBs is selected at random for removal. De-duplication is performed such that it does not affect the initial mappings or coverage, but it does apply to all of the following:

What annotations does Complete Genomics provide describing the variants called?

Complete Genomics currently annotates called variants using five external data sources:

dbSNP: to cross-reference variants

RefSeq: to identify overlap with and impact to genes (as well as some ncRNAs)

Database of Genomic Variants (DGV): to cross-reference called CNVs

COSMIC: to cross-reference variants

miRBase: to identify variants that overlap with microRNAs

How does Complete Genomics compute the functional impact of variants in coding regions?

Complete Genomics uses the alignment data from the seq_gene file contained in a NCBI annotation build (see What is NCBI build 36.3 (or 37.2)? How does it differ from build 36 (or 37)?) to compute the location of the variant within the RefSeq mRNA sequence. The variant sequence is then substituted into the corresponding location within the mRNA sequence. The resulting nucleotide sequence is translated to protein sequence using the appropriate codon table (standard code or vertebrate mitochondrial code). This permuted protein sequence is then compared to the RefSeq protein sequence, and the impact, if any, is noted. Synonymous changes are noted as such.

An important consideration is that RefSeq mRNA sequences often differ from corresponding sequences in the reference genome. This difference is because the reference genome is derived from a single individual at any given locus (the same individual was not used for the entire genome), meaning that it will contain alleles different than those seen by RefSeq curators, particularly when the reference genome allele is the rare allele. Complete Genomics explicitly annotates calls that are variant with respect to the reference genome but yield protein sequences identical to that reported in RefSeq.

What version of the reference genome database is used? What versions of the annotation databases are used?

Customers can choose either NCBI build 36 (corresponding to Hg18) or NCBI build 37 (Hg19) as the reference genome. The gene annotations are those provided in each build. For Build 36, the known-variant annotations are from dbSNP 130 and for Build 37 the known-variant annotations are from dbSNP 131 or dbSNP 132, for data generated on assembly pipeline version 1.11 or greater. Prior to software version 1.8, the genome build was NCBI Build 36 and dbSNP 129. The format version was in the #VERSION header.

What is NCBI build 36.3 (or 37.2)? How does it differ from build 36 (or 37)?

NCBI build 36 refers to the genome build—the FASTA files describing chromosomes 1-22, X, and Y. Build numbers such as 36.1 describe annotation releases provided by the NCBI, where features such as mRNAs from RefSeq are mapped to the genome build. There may be multiple annotation releases, each with a different version of the annotation source data, all mapped to the same reference genome build. For example, builds 36.1, 36.2, and 36.3 all contain annotations mapped to genome build 36, but the RefSeq sequences used in each are from three different points in time. Similarly, build 37.1 contains annotations mapped to genome build 37.

The nucleotide and amino acid sequence of the RefSeqs used by NCBI in the production of annotation build can be obtained interactively using the Entrez website (http://www.ncbi.nlm.nih.gov/Entrez/) or in batch using NCBI eUtils (http://eutils.ncbi.nlm.nih.gov/ ). The accession and version specified in the seq_gene file must be used in your query to ensure you are using the exact sequence described in the alignments.

What scores are produced for variant calls? What score thresholds are used?

In all putatively variant regions, the assembler considers many hypotheses (essentially, possible consensus sequences) and computes probabilities of the observed read data under each these hypotheses. We perform a likelihood ratio test between the most likely hypothesis and the next most likely, and we express this score in decibels (dB). Bioinformaticists will recognize dB as the basis of the Phred scale: 10 dB means the likelihood ratio is 10:1, 20 dB means 100:1, 30 dB is 1000:1, etc. The variant scores factor in quantity of evidence (read depth), quality of evidence (base call quality values), and mapping probabilities. The column header for the variation score is “total score” in the variations file.

Scores for variants are not calibrated on an absolute scale to error rate. A score of 30 dB does not necessarily indicate that the P(error)=0.001.

Can I compare scores across variants?

Scores for variants can be compared, but only robustly within a specific class of variant, as the probability model for each class of variant is slightly different. For example, a SNP call with a score of 60 is more likely correct than a SNP call with a score of 50, and a deletion with a score of 60 is more likely correct than one at 50. However these numbers do not precisely indicate the strength of evidence for either of the SNPs relative to that for either of the deletions.

The score calibration produced by Complete Genomics may provide insight on actual quality of scores, using varType, zygosity, and local coverage. For more information on score calibration, see Calibration Methods .

Complete Genomics produces separate scores for the two alleles at a locus. How do I interpret this information?

Roughly, the higher of the two scores can be considered the strength of evidence that this allele is present at the locus. The lesser score similarly indicates the strength that we have fully called the complete diploid genotype correctly at the locus.

What criteria are used to call a region homozygous reference?

We report the strength of evidence for homozygous reference calls in the coverageRefScore files. See What is the “Reference Score” and what is it used for? Reference score is one metric used to flag regions of the genome for de novo assembly. (A second metric based on De Bruijn graphs is designed to also flag regions for de novo assembly containing indels and block substitutions, where mapping individual reads is more difficult.)

To call an unassembled region homozygous reference, a reference score of 10 dB or higher must be achieved. To achieve this score, typically at least four reads need to map to the position and be consistent with the reference sequence, although the precise number needed depends on mapping probabilities, base-call quality scores, and the number of concordant and discordant calls. In addition, a small number of homozygous reference regions are determined by the assembler.

Can I change calling thresholds to detect more true variants (at a higher false-positive rate) or detect fewer (at a higher false-negative rate)?

The variants shipped by Complete Genomics all meet our minimum thresholds and we do not report possible variants below that level. Thus it is not possible to drop thresholds below this level.

What depth of coverage is required to call a variant allele?

Various parameters are taken into account when calling and scoring variant alleles, such as coverage depth, mapping probabilities of reads, and base-call qualities. Heterozygous variants marked as VQHIGH generally require at least two high quality, well mapped reads per allele. Homozygous variants marked as VQHIGH generally require at least seven reads. Variants marked as VQLOW may have fewer reads supporting the call, and are accompanied by a lower score indicating the lower confidence in the call.

Note that most variants are called with much greater read count support (depth of coverage) than these minimums, and we find that the scores (reference scores and variant scores) are excellent

What criteria are used to no-call a region on one or both alleles?

Sites that do not meet the criteria to be called either homozygous reference or variant are considered no-calls. Loci can be partially (or half) called, where one allele sequence is determined but the other is not.

Alleles can also be “incompletely” called in some cases, which is a different behavior than a partial call. In this case, some of the bases of an allele are determined at the minimum threshold but others are not. These alleles will have “N” or “?” in their sequence, and will be marked as “no-call-rc” or “nocall-ri”. Some loci are considered partially called because one of the two alleles is incompletely called.

How are duplications and highly conserved repeats in the reference handled?

Mate-pair information is used both in mapping and to recruit reads for assembly. Even when the initial scan indicates that one or both ends of a clone may have multiple possible locations in the genome, the pair may have a single location consistent with the library’s known orientation and distance of the read ends. We are thus able to assign many reads to assemblies even in non-unique regions and accurately call variants in those regions.

Even when a single location for a mate-pair is not indicated by the initial mappings, we can allow these reads to participate in more than one assembly, weighted by their mapping probabilities. Recall that assembly is both more sensitive than initial mapping (as it can allow greater degrees of mismatch and yet still align) and more stringent (as it demands that the accepted set of reads consistently explain the final consensus sequence). We have found that this approach allows us to accurately call variants in many (certainly not all) duplications. When a read contributes to variation calls at multiple loci due to sequence similarity, the scores for all affected calls are adjusted down to reflect the correlated evidence. This is not only reflected in the variant scores but also is reported in the correlations file contained in the EVIDENCE folder. If the correlations are too high, which happens when the duplicate regions cannot be well discriminated, then scores will be below threshold and some or all variations within the correlated regions are no-called.

How do I identify somatic small variants?

For samples submitted through the Cancer Sequencing Service, each tumor sample is compared to the baseline sample within the pair or trio. The small variants that are unique to the tumor genome and not present in the normal can be isolated by looking at either the somaticRank column or the somaticScore column in the master variations file. The somaticRank column provides the estimated rank of the mutation amongst all true variants for the specific classification of variant. The value is a number between 0 and 1, and is empty for mutations that are not somatic. The somaticScore column indicates the quality for all somatic mutation calls, and is empty for mutations that are not somatic.

For samples submitted through the Standard Sequencing Service, somatic events can be identified using cgatools calldiff, which includes somatic scoring.

How do I find the variants which might be disrupting or changing a known protein coding gene?

The information provided by Complete Genomics in the geneVarSummary and gene files may help. See Complete Genomics Data File Formats for details. Alternatively, you can use the coordinates in the variation file to compare against any database of annotations you wish.

Are SNPs or indels called more likely to be true-positives?

In our Science paper (Drmanac et al, Science 2010) we showed that the false positive rate for indel variants was somewhat higher than that for SNPs at the thresholds used. This is consistent with additional data, including family studies, which we have analyzed since. We expect these methods to continue to improve over time.

Are novel vs. known variants more likely to be true-positives?

Random sequencing errors are most likely to appear as novel variants. Depending on the goals of the analysis, one’s statistical prior on the P(error) of a novel variant call might be greater than that for a known variant call.

How do I find variants that might be disrupting or changing a known locus that is not a protein coding gene?

The gene file provided by Complete Genomics summarizes changes in the coding portion of transcripts annotated on the genome in the NCBI build. For other annotated loci in the genome, one would need to look in the variations file by chromosome and position.

How do I separate novel vs. known SNPs and indels?

The dbSNP file provided compares the results of a Complete Genomics assembly to known variants in dbSNP, and reports a genotype of each site based on the Complete Genomics sequence data. The versions of dbSNP that we are presently using (for Assembly Pipeline version 1.11) are 130 for NCBI Build 36 and 132 for Build 37 (note: dbSNP Build 130 incorporated the 1000 genomes project data). As of pipeline version 1.8 release, we have added the dbSNP version number for when each SNP was added to the database. This can be helpful for filtering novel SNPs from different dbSNP database releases. When looking at these data, please keep in mind that most estimates of the error rate in dbSNP are relatively high, as are most estimates of the error rate in publicly available dbSNP genotypes of reference samples. As these rates are generally thought to be higher than the error rate of Complete Genomics sequence data, a number of discrepancies are to be expected.

How do I compare variants between two or more samples?

Complete Genomics has an open source tools package, Complete Genomics Analysis Tools (CGATM Tools), for downstream analysis of Complete Genomics data. For more information on CGA Tools, see the Complete Genomics website.

When is comparing SNPs between samples problematic?

As discussed in the section above, we believe Complete Genomics has excellent sensitivity to detect not only SNPs but also insertions, deletions and block substitutions. However as we look at many genomes with this method, we discover that the rate of non-SNP alleles we call is high, and in some cases they occur at loci with simple SNP variants on the other allele. This leads to some complexity when comparing samples.

To illustrate, we’ll re-use an example shown previously and refer to it as individual A:

Position:

123

456

789

012

Reference:

TAG

TCG

CCT

ACG

locus includes bases 4 to 6

Allele A1:

TAG

CAC

CCT

ACG

3 base block substitution

Allelle A2:

TAG

TTG

CCT

ACG

one ref + one SNP + one ref

This same location in another genome (individual B) might contain the following called as two separate loci because the changes are further apart:

Position:

123

456

789

012

Reference:

TAG

TCG

CCT

ACG

Allele B2:

TAG

CCG

CCT

ACG

locus #1 at base 4, one het SNP

Allelle B1:

TAG

TCG

CCA

TGG

3 base het sub, locus #2

Methods to handle such situations vary depending on the scientific goal. Questions for choosing a method would include whether you wish to consider, for example, Allele B2 above (a single SNP) the same variant as the corresponding part of Allele A1 (the block sub, which includes the same SNP). If so, one would consider A1 as having a second variant locus at position 6, distinct from the other SNP. Alternatively one might wish to consider all four of these alleles shown as distinct versions of this locus.

We see a few modes of comparing data at this level of detail that have different uses. One method is to maximally break loci into SNPs, which may be required when comparing Complete Genomics data against (array or sequence based) data sets focused on SNP calls. Another would be to group these loci together into blocks when looking across samples to check for conservation of the entire region.

Complete Genomics also has an open source tools package, Complete Genomics Analysis Tools (CGATM Tools), for downstream analysis of Complete Genomics data such as performing comparisons of SNP calls between two samples. For more information please contact support.

How do I compare indel variants between two or more samples?

A further complication arises in indels and in non-length-conserving block substitutions. The exact point where any algorithm will define the start and stop of the change is based on rules, but even small differences can move the start and end coordinates of a variation, making comparison based on coordinates more difficult. For example:

Reference:

ATAATTTTTTTTTGTGTGTGT

Allele 1:

ATAATTTTTTTT-GTGTGTGT

Allele 2:

ATAATTTTTTTTT-TGTGTGT

Homopolymers and simple sequence repeats (SSRs) (such as AAAAA, CACACA…, or TAGTAGTAG…) present most obvious examples of this problem, where the choice of indel point is essentially arbitrary. While a fixed rule for defining that point does work well, it fails to provide consistency when a handful of other sites in the SSR sequence have changed (errors or real variation), which can greatly influence the alignment.

For example, consider a complex variant, spanning bases 12 through 18 in the alignment shown below.

1234567890123456789012

Reference:

GGAACTGAACA—–GCTAGC

Allele 1:

GGAACTGATAAGAAATGCTAGC

Allele 2:

GGAACTGAAGA——-TAGC

With a single base change, Allele #2 (and the entire diploid locus) could be assigned a different start and end position (10 and 16):

1234567890123456789012

Reference:

GGAACTGAACA—–GCTAGC

Allele 1:

GGAACTGATAAGAAATGCTAGC

Allele 2:

GGAACTGAA——-GCTAGC

Complete Genomics also has an open source tools package, Complete Genomics Analysis Tools (CGA™ Tools). Use CGA Tools for downstream analysis of Complete Genomics data, such as performing comparisons of indel calls between two or more samples. For more information please contact support.

How should I handle alleles with N’s or ?’s when comparing variants between two samples?

In many cases, one may wish to be conservative and consider any such region entirely no-called, and thus neither a match nor a mismatch between the samples.

For more precision, one can use the notion of compatible vs. incompatible alleles. Alleles are “compatible” if one can align the two and in doing so does not reject the hypothesis that they are the same. Incompatible alleles are those that must be different according to this type of analysis. For the reasons described above, an alignment based analysis is required to avoid falsely calling sites different that are in fact compatible.

I am doing a family study. Can I use the constraints of Mendelian inheritance to further reduce errors? How?

Yes. We note a few things to take into account. First, simple Mendelian constraints on a variant by variant basis will detect a number of errors, but certainly not all, particularly in smaller families. A more powerful method has recently been presented by the Institute for Systems Biology using Complete Genomics data (Roach et al, Science 2010). This method involves first using the variant data to build a complete high-resolution recombination map of the family. This map allows greater power to detect errors than the simpler method. With families larger than a trio the power of this method to detect errors can be quite high.

How big are local assemblies and how much of the genome is typically assembled this way?

Complete Genomics local de novo assemblies are typically 30-40 bp although they can be smaller or much larger (100s of bp). Approximately 5-10% of genome is typically assembled.

Do variations reported correspond one-to-one with individual assembled regions? What are the implications of this on phasing closely neighboring variants?

No, one assembly can contain multiple variant loci. It’s possible to phase closely neighboring variants if they are inside the same local assembly. The variant file has a haplink column that indicates phasing, if it can be determined.

How do I see the assembly around a variant call?

By chromosome and position, you can find the appropriate row in the corresponding evidenceIntervals file. This data provide you with the assembler results (essentially, consensus sequences) for each allele, with a gapped alignment of that result against the reference. The underlying reads in each assembly can be found in the corresponding evidenceDnbs file, by looking up records (rows) using the evidence interval ID found in the evidenceIntervals file.

Why is the reference allele always included in an evidence interval even when it is not called (that is, when neither allele appears to be reference sequence)?

The possibility of one or both alleles at a site being reference is always considered as a possible hypothesis in the likelihood ratio tests. Essentially, we demand that the data disprove this hypothesis, which is an appropriate null in that most sites in the genome of any sample are indeed reference.

Why are there reads in the evidenceDnbs file assigned to a particular locus that are not mapped to that location in the initial mappings files? Similarly why are there reads mapped to a location that are not in the evidenceDnbs file?

The assembly process has both more sensitivity and specificity than the initial mapping process. Reads which were sufficiently different from the reference (for example, those containing many indels or groups of SNPs) or which had multiple possible initial mapping locations may not be initially mapped to the variant site but can be brought into assembly using the mapping of the corresponding mate-pair.

Conversely, reads initially mapping to a region but which prove inconsistent with the preponderance of evidence in an assembly can be down weighted or excluded. Moreover, reads must have their mate-pair mapped nearby to a variant region to participate in an assembly, and thus reads with only one end mapping during the de novo assembly process are presently always excluded from assembly.

How do I find the evidence underlying a site called homozygous reference? How do I know if another interpretation (such as variant) might be possible?

Complete Genomics does not produce assemblies (evidence intervals) for regions of the genome where the mapped reads are highly consistent with the reference sequence. Furthermore, assemblies are not reported for regions where variants are not found (this is an unusual case, however). Thus, variation scores are also not produced for these regions.

Instead, Complete Genomics computes a “reference score” for every base in the reference genome that is reported in the coverageRefScore file. This score indicates whether the corresponding mapped reads are consistent with the reference sequence (positive values) or not (negative values). This score is an excellent predictor for the strength of evidence for homozygous reference calls. See What is the “Reference Score” and what is it used for?

How do I see evidence for any possibility other than the called variants at a particular locus?

Complete Genomics provides an assembly for the reference allele for all sites, including those called diploid variant.

Determining evidence for other alternate hypotheses is not easily supported by our assembler output at this time. Users would need to perform read level analysis, essentially recapitulating the details of the recruitment and assembly processes at a site (as described in Drmanac et al. , Science 2010).

How should I interpret the depth of coverage in Complete Genomics’ coverageRefScore files?

As of Complete Genomics pipeline version 1.7.1, our calculation of depth only included bases of reads which, based strictly on the initial mapping results, are highly likely to be uniquely placed. Thus, it generally undercounts actual coverage in areas of duplication or repeats. The initial mapping algorithm is also not resilient to indels nor high degrees of divergence between the sample and reference, so those regions are also undercounted.

In pipeline version 1.7.2 we added an additional “weighted depth” metric to the coverageRefScore files. This score may be better for some purposes (such as CNV detection) as it also counts nonuniquely mapped reads, giving them fractional counts corresponding to the confidence of each mapping. Like the unique depth metric, it does not consider post-assembly depth. This value is used as input to our CNV detection pipeline.

In pipeline version 1.10 we added a “GC-bias corrected depth” metric to the coverageRefScore files. The depth-of-coverage approach we take to call CNVs assumes that the number of reads mapping to a genomic region is proportional to the genomic copy number of the region. This assumption is violated when distribution of mappings is biased (for example, in high/low GC content in the DNA sequences). To correct for this bias, coverage is adjusted for GC content calculated over a fixed window. In pipeline version 1.10 a “gross weighted depth” metric was also added to the coverageRefScore files. This value represents the number of half-DNBs which may map to this location, each weighted by their mapping weight ratio.

What is the “Reference Score” and what is it used for?

Complete Genomics computes a value called the reference score reported in the coverageRefScore file. This score indicates whether the corresponding mapped reads are consistent with the reference sequence (positive values) or not (negative values). This score is an excellent predictor for the strength of evidence for homozygous reference calls.

Similarly to the method by which variant scores are computed, the reference score is the log-likelihood ratio of P(ref) over P(non-ref), expressed in dB, where the P(non-ref) involves examining only a limited number of alternate hypotheses. These include all possible SNPs at every position in homozygous and heterozygous form, plus, at selected positions, one-base insertions and deletions, as well as some changes in homopolymer or tandem repeat length. This computation is performed based on the initial mapping results and, like the variation scores, is not precisely calibrated to P(error). Reference scores are also not precisely calibrated to variation scores.

In spite of the lack of calibration, a reference score in one sample can be considered against the variation score of another sample to assist in sample-sample comparison, particularly when asking whether a variant seen in one sample might be a false negative in another.

I see very high levels of depth in some locations. Why?

This happens at some repeats. The centromeric regions are the most extreme example. Many repetitive genomic regions have the “overflow” flag set on all reads in the mappings file as the read maps to too many sites to be computationally tractable, and thus have no mappings reported: they can appear as coverage zero.

Why does the coverage vary from one base to the very next one?

This is a consequence of mapping reads with intra-read gaps and is consistent with our knowledge of the biochemistry and alignment properties of these reads.

I just received a shipment of data from Complete Genomics. What should I do first?

After receipt of your data, you should immediately verify that all data files are present and uncorrupted (see How do I verify that the data files are present and uncorrupted? When should I do this?). Complete Genomics strongly recommends that you make backup copies—at least two separate copies on separate devices—of any critical data. Complete Genomics makes no commitments to retain data after delivery . If you delete or lose data that you have not backed up, it is not retrievable.

If I am just using the variant files and other processed output, can I get rid of the reads and initial mappings?

It is up to you to determine which data you need to archive, but keep in mind that Complete Genomics does not retain customer data, so any data you permanently delete is irretrievable. Also, recall that all disk drives, including those sent by Complete Genomics, have a finite lifetime and a failure rate. Complete Genomics strongly recommends that you make and keep backup copies (at least two separate copies on separate devices) of any critical data.

If you intend to publish your results, then you may be required by the journal or by your funding source to submit the reads to a central database. You may wish to investigate any such requirement before making decisions about data retention.

If you will be focusing on the processed data from Complete Genomics (such as variant calls), but wish to retain the reads and initial mappings, you may want to consider storing them onto slower less expensive storage than the other files. Cloud storage such as Amazon’s Web Services (AWS) may also be an option worth considering. AWS is an infrastructure web services platform that provides remote computing power, storage, and other services.

What type of computer do I need to transfer the data from a Complete Genomics disk drive?

Disk drives provided by Complete Genomics are formatted using NTFS, which is readable by most operating systems.

What is the fastest method for copying data from drives shipped by Complete Genomics? How long will it take?

Drives are shipped with a USB 3.0 interface. One can expect that the transfer will require 15-30 minutes to transfer data from the drive to the computer for one standard (40x) genome, depending on other aspects of the system.

If you are connecting the drive to one computer and transferring the data through your network to a second computer (such as a file server), then the network speed will also greatly impact the time required. Almost any wireless network will be quite slow, as will either a 10Mbit or 100Mbit wired connection. Furthermore, when using these types of networks, your data copy may effectively monopolize the network and impact other users. A 1-gigabit wired network with good quality switches is strongly preferred. You may even wish to set up a dedicated subnet for this purpose.

We recommend that you contact your computer support staff for help on these issues. Complete Genomics can only be of limited assistance because each computing and network environment is unique.

How do I verify that the data files are present and uncorrupted? When should I do this?

As of Complete Genomics’ pipeline version 1.9, we provide a manifest file containing SHA-256 checksums of each file, suitable for use with the sha256sum tool present on most Linux operating systems (and available for many other platforms). The checksums are computed on all the delivered files in our EXP package (except for the manifest.all and manifest.all.sig). If the files are uncompressed and recompressed then the SHA-256 hashes may not match.

Assuming the data is copied to another system immediately upon receipt (both to provide working storage and as a backup), customers should check the SHA-256 sums on the copy made, and if any problems arise check the SHA-256 sums on the original hard disk drive. Customers should immediately contact support if any issues are noticed.

If no errors were reported on each step the verification was successful.

Can I put the Complete Genomics data files on a Windows PC and work with them?

Yes, however there are important caveats to keep in mind. Refer to What type of computer do I need to transfer the data from a Complete Genomics disk drive? for more information on compatibility of specific Windows versions. Further, while in general, text files are the same between Windows and other operating systems (particularly Mac OS X, Unix, and Linux), Windows text files use different conventions for marking the ends of lines (a CR and LF character is used, while a LF only is used on the other non-Windows systems). Because most Complete Genomics users have requested to work with the data on Unix, Linux, and Mac OS, Complete Genomics files are provided with LF-only line breaks. Some Windows software works well on these files, in spite of a lack of a CR character, while other Windows programs will require that the files be converted to work properly. Contact your Windows support staff for utilities that can help with this conversion.

Some Complete Genomics users prefer to use a Unix-like environment on their Windows machines to handle this data, most commonly the free Cygwin package. Cygwin can be installed in a mode where LF-only files are read and written (Cygwin can also be installed in a CR-LF mode, which is not recommended for working with our data). Note that using such an environment requires commandline Unix skills. In addition, the FAT32 file system on Windows allows a maximum file size of only 2 GB, which is too small for many genomic data sets—including many of the Complete Genomics data files. Contact your Windows support staff for help, particularly if you have an older computer or are using external drives (which often come pre-formatted with FAT32).

Can I put the Complete Genomics data files on a Linux, Unix, or Mac OS X computer and work with them without having to do data format conversion?

Yes; this is what most of our customers do.

Should I uncompress the data?

Uncompressing all of the data files will increase the required storage for a single genome approximately 3 to 4 fold, typically to over 1.5 TB per standard genome and over 3 TB per highcoverage genome. Approximately ninety percent of this volume is used by the reads and mappings files, and most Complete Genomics customers leave these files in their compressed format for this reason. Further, the utilities in CGA™ Tools work with compressed data.

In general, there are a number of methods for analyzing data in compressed files. For example, for those familiar with Unix and Linux commands, to use streaming decompression to count all genotypes of known dbSNP variants on chromosome 21 in a Complete Genomics genome (here, GS0000084-ASM) you could run the command:

bzcat dbSNPAnnotated-GS0000084-ASM.tsv.bz2 | grep chr21 | wc –l

Something seems wrong with a file or I deleted some data. Can Complete Genomics get me a replacement copy of it?

Complete Genomics does not retain data after delivery to a customer. Upon receipt of data from Complete Genomics, if any data appears to be missing or corrupted you should contact support immediately.

Where can I find the format version number for the data files? What is the difference between the format version and the software version numbers?

For Analysis Pipeline 2.0 and later, the versions of the data file formats and the software are synchronized to the same number. The version is provided in the version file located in the root directory (e.g., “/DID-ID/ASM-ID/GSXXXXX_DNA-YYY”) and also in the header of all Complete Genomics data files under the keys “FORMAT_VERSION” and “SOFTWARE_VERSION”.

For Analysis Pipelines prior to 2.0, data included a separate data format version number and software version number. All Complete Genomics data files contained a header (#FORMAT_VERSION) indicating the data format version number. In addition, a small text file named “version” was included in the individual genome results directory (for example, “GS00001-DNA-A01”) of the data file hierarchy for each genome containing the same value. We recommend that your processing programs check this version number to ensure compatibility.

Prior to software version 1.7, the format version was in the #VERSION header. We changed the name for clarity. Prior to software version 1.8, this file was located in the top-level directory.

How do I determine which version of the reference human genome was used during mapping and assembly?

The header in Complete Genomics data files produced with pipeline version 1.7 or greater specifies the reference genome build used (#GENOME_REFERENCE). Data generated with pipeline versions prior to 1.7 (before this header was included) used NCBI build 36. Starting in software version 1.8, you can choose either NCBI build 36 (corresponding to Hg18) or GRCh37 (Hg19) as the reference genome.

How do I determine which version of the annotation databases (such as RefSeq or dbSNP) was used?

The header in Complete Genomics data files produced with pipeline version 1.7 or greater contains annotation version information for the relevant data sources (#GENE_ANNOTATIONS and #DBSNP_BUILD). Data generated with pipeline versions prior to 1.7 (before this header was included) always used dbSNP 129, NCBI build 36.3 and RefSeq alignments.

Complete Genomics sent me disk drives. Can I just work directly with the data on those devices?

With the exception of eSATA of USB 3.0 connections, working directly with data on external drives will be much slower than working with the files on either a local disk drive or on a good quality file server over a fast network. It is also strongly recommended that you create a back up copy of your data before working with it (see I just received a shipment of data from Complete Genomics. What should I do first?). Apple® Mac® users mounting the NTFS hard drive will have Read-Only access to the files.

What is a .tsv file? Why is this extension not recognized by my OS or software?

TSV stands for “tab-separated values.” This generic format is also called “tab-delimited text.” Unfortunately, there is no consistent naming convention across all software and systems for this format. Some software defaults to looking for such data in .tab, .tdt, .text, or .txt files. Microsoft applications will default to .csv (generally meaning “comma-separated values”) when importing tab-delimited files. To find the file, you can select “all files” in the file type filter and import a .tsv file into your software program of choice.

I have a file derived from Complete Genomics-produced data that is not mentioned in the Data File Formats document. Who can help me with it?

Complete Genomics may be of limited assistance in understanding data that has been transformed by tools not supplied by Complete Genomics, or that has been further analyzed by non-Complete Genomics software.

What is Complete Genomics C++ library and API? Do I need to know how to program in C++ to use Complete Genomics data?

While the variants were provided in text files, Complete Genomics data prior to pipeline version 1.4 required use of a C++ library and API to access the individual reads and mappings. The C++ API does not work with data produced after this time (after version 1.4 or later than September 2009).

I have the data transferred, backed up, and checked for integrity. What do I do next?

Genome Voyager™ FAQ

The Genome Voyager platform is an interpretation and reporting solution that enables institutions to analyze and interpret sequencing data based on gene panels, exomes as well as whole genomes. The Genome Voyager platform has not been evaluated by the U.S. Food and Drug Administration or any other regulatory body for diagnostic use.

BETA release means that the core functionality of the solution has been implemented and has gone through internal testing. However, the solution is not yet production quality and may contain known and unknown bugs. Our BETA program will be limited in duration, and by number of participants and number of uploaded sequencing data. The platform may be unavailable for use for periods of time during updates. Access is offered without any representations or warranties. The goal of the BETA release is to get real-world testing and feedback to identify bugs and gather input for future product enhancements.