The algorithm version is v1.0.6, the original version used in conjunction with HAPSEG. For an overview of the original published workflow, using Affy SNP data, see theUsing HAPSEG and ABSOLUTE in GenePatternpage. The CGA group provides an example dataset for download to use in the workflow. Although data for twenty samples are provided, two are sufficient for the workflow. For an explanation of versioning on GenePattern, see the Concepts guide.

For an alternative workflow in development that utilizes next generation sequence data, contact CGA at recapseg@broadinstitute.org, and ask for AllelicCapseg. AllelicCapseg segments sequencing data, including whole exome sequences, in a manner that is sequencing platform independent. The AllelicCapseg to ABSOLUTE workflow is recommended for sequence data.

Background

Elucidation of the sequence of the multiple genomic events that give rise to tumorigenesis is an ongoing area of research. Genomic events include functional mutations, genomic rearrangements including translocations and chromothripsis, gene conversion or loss of heterozygosity (LOH), and somatic copy number alterations (SCNAs) that range from regional and chromosomal amplifications and deletions to whole genome duplications (Burrell et al.). SCNAs can lead to gene dosage changes impacting phenotype; SCNAs and copy neutral LOH events at heterozygous or mutant loci can lead to unequal dose contributions of one allele over the other.

Current models calculate somatic alterations in units of genomes or DNA mass and are interpreted in the context of a tumor's purity and overall ploidy. However, to compare across samples, copy numbers should be measured in copies per cancer cell. Absolute copy numbers could be inferred by normalizing relative data on cytological measurements of DNA mass per cell or on single-cell sequencing data. Alternatively, ABSOLUTE can be used to mathematically model solutions of tumor cell purity and ploidy.

Inferring absolute copy numbers and ranking solutions depends on the following three factors. These are (1) sample heterogeneity from copy ratios and mutation data, (2) karyotype models from a reference panel built into ABSOLUTE algorithm v1.0.6, and (3) allelic fraction from mutation data. Providing mutation data though optional is recommended.

(1) Sample heterogeneity. Samples are heterogeneous at two tiers. (i) Tumor purity indicates the fraction of tumor cells to normal cells that nearly always contaminate samples, e.g. normal tissue and blood cells. Normal cells are diploid (2N) and are further identified by normal genotype. (ii) Tumor cell heterogeneity, if any, based on polygenomic populations, either segregated or intermixed, due to ongoing subclonal evolution. Each tumor population is grouped by ploidy, which is defined in units of normal haploid genomes for genomic segments. Segments are previously defined by equal copy ratio.

One method to validate purity estimates, used by the authors, compared calculated and histological purity estimates with methylation signatures characteristic of leukocytes given blood is a common sample contaminant.

(2) Karyotype models. Copy ratios data will provide a number of putative integer value solutions of ploidy from which purity is then inferred. In the first two charts above, solutions are in different colors (circles and bars). To better rank these solutions, ABSOLUTE refers to external data in the form of karyotype models. These mixture models of recurrent cancer karyotypes were bootstrapped from thousands of pre-TCGA tumor samples matched to cytological data (Carter et al, 2012). Karyotype models do not impact calculation of individual solutions, only their ranking. Likelihoods from the SCNAs, SSNVS, and pan-cancer karyotype models are combined to produce rankings.

For increased sensitivity for ambiguous cases, when given a primary disease parameter, ABSOLUTE incorporates karyotype models specific to the tumor type in ranking solutions. The impact of this is seen, for example, in differentiating ambiguous solutions, one of which implies a genome doubling event. The frequency of genome doublings vary across tumor types and reflect disease-tissue specific biology. Genome doublings are rare in hematopoietic neoplasms, e.g. ALL and CLL, and have a higher incidence in other types of cancer, such as oesophageal adenocarcinoma (Barrett et al. 1999).

(3) Allelic fraction. ABSOLUTE utilizes the optionally provided, but recommended, mutation data in two ways. (i) ABSOLUTE infers purity of a sample with copy number data in conjunction with mutation data. (ii) ABSOLUTE estimates cellular multiplicity, that is, average allelic copies per cancer cell, to potentially reveal subclonal populations as diagrammed in the fourth chart. Putative solutions incorporating mutation data aid in the manual selection of a best solution. What is key for ABSOLUTE is that the mutation information provide somatic events.

Given the likely divergent instigations of different types of genomic events in cancer, SCNAs alone provide limited resolution in inferring tumor heterogeneity. Sequence mutation information provides ABSOLUTE an alternative point of reference, that is, more incremental information in tumor progression, that then allows a more comprehensive modeling of tumor heterogeneity.

High confidence calling is possible for somatic point mutations, a type of somatic single nucleotide variation (SSNVs), with algorithms such as VarScan or MuTect. ABSOLUTE algorithm v1.0.6 expects point mutations and given other types of mutations, e.g. insertions, still treats these as point mutations, which is not best-practice. A future version of ABSOLUTE will differentially utilize insertion and deletion mutations from point mutations.

Inclusion of germline variants leads to inflated purity estimates as they are present clonally, in both the tumor and the normal.

The module's default parameters reflect the original analysis aims of balancing over-fitting subclonal copy alterations to derive more complex karyotypes against the applicability of a simpler solution in finding tumor samples with high purity. For example, default parameters discard solutions with greater than 5% subclonal fractions and thus skew presented solutions to those of increased ploidy. Change default parameters for samples expected to have a higher proportion of heterogeneous nuclei, especially those for which mutation data are also provided.

Algorithm

Equations used in the algorithm are in the Carter et. al. publication.

ABSOLUTE extracts the absolute copy number of local DNA segments per cancer cell from the mixed DNA population in three steps:

Estimates the tumor purity and ploidy from observed relative copy profiles and, if provided, from somatic point mutation data.

Resolves ambiguous cases of purity and ploidy using pre-computed statistical models of recurrent cancer karyotypes based on a large and diverse reference sample collection.

Attempts to account for copy number alterations and point mutations in tumor subclones.

ABSOLUTE expects copy-ratios very close to 1.0 and will fail if ratios are less than 0.75 or greater than 1.25. ABSOLUTE analysis can fail due to exceeding the max.as.seg.count threshold. Too many segments are associated with noisy or poor quality data.

Parameters

A HAPSEG output file (<plate.name>_<array.name>.segdat.RData) or other segmented copy number data file. If you supply a tab-delimited segmentation file, see the Input Files section for file details.

output file name base *

If specified, provides a base filename for all output files. The default value is the sample name parameter.

Note the downstream module ABSOLUTE.summarize requires each sample name to be unique, not just the output file name. Towards this end, for multiple concurrent file input, only the sample name parameter need be varied for unique sample and file names.

Primary disease of the sample forspecific tumor karyotype matching. Enter 'NA' to use pan-cancer karyotype reference. This parameter impacts ranking of solutions and not solutions themselves. If a provided input does not match to the following list, then ABSOLUTE defaults to the pan-cancer reference:

The name of the sample for display and for use in downstream module ABSOLUTE.summarize, which, for multiple concurrent file input, requires unique sample names.

max as seg count *

Maximum number of allelic segments. Samples with a higher segment count will be flagged as 'failed'. Default: 1500

max neg genome *

Sometimes, due to noise in the data, ABSOLUTE may model the fraction of the genome attributed to tumor subclones to be less than zero. This parameter specifies the maximum allowable fraction of the genome that can be modeled as being less than zero without discarding a given solution. Default: 0.005

max non clonal *

Maximum genome fraction that may be modeled as non-clonal — that is, as being derived from tumor subclones. Solutions implying greater values will be discarded. Default: 0.05

Increase this parameter for samples expected to have a higher proportion of heterogeneous nuclei, especially if mutation data is also provided.

copy number type *

The copy number type to assess based on input data type.

allelic (default) for data from HAPSEG or AllelicCapseg

total for all other data

maf file

If available, somatic mutation data in mutation annotation format (MAF)that includes t_ref_count and t_alt_count columns. See Input Files section for more details. If using this parameter, also specify the min mut af parameter described next.

min mut af

Mutations with lower allelic fractions than the indicated minimum mutation allelic fraction will be excluded from analysis. Zero is an accepted value. Note that if maf file is specified, min mut af must also be specified.

Input Files

Segmented copy ratios data file in either of the following two formats:

For ALLELIC copy number type analysis, supply an RData file produced by HAPSEG or AllelicCapseg. These datasets allow incorporation of copy neutral LOH events. Segmentation data produced by any other means must conform to the output formats of HAPSEG/AllelicCapseg for ABSOLUTE to consider copy neutral LOH events.

For TOTAL copy number type analysis, suppy a tab-delimited segmentation file in plain-text format. File extension does not matter. ABSOLUTE algorithm v1.0.6 requires the following five columns. Additional columns are ignored.

Chromosome

In either chr# or # format.

Start

End

Num_Probes

Segment_Mean

(Optional) Somatic mutation data in mutation annotation format (MAF) and as a plain text file. File extension does not matter and hashtagged header rows (#) may be present. ABSOLUTE algorithm v1.0.6 requires the following seven columns. Additional columns are ignored.

t_ref_count OR i_t_ref_count

Count of reference alleles in tumor.

t_alt_count OR i_t_alt_count

Count of alternate alleles in tumor. Together with t_ref_count adds up to the depth of reads in the tumor BAM alignment. You can calculate a missing value if two of these three values are known or with read depth and the frequency of the alternate allele within the sample. These and other MuTect output columns are described further in the GATK forum.

dbSNP_Val_Status

Fields may be blank and multiple values are separated with nonspaced semicolon. Example values include bySubmitter, by1000genomes, by2Hit2Allele, and byHapMap.

Start_position

Note the lowercase "p". Also, note that the End_position column is not required. This implies that ABSOLUTE algorithm v1.0.6 treats all mutation data equally as point mutations, the expected type of mutation data.

Tumor_Sample_Barcode

Fields may be blank.

Hugo_Symbol

Fields may be blank or "unknown".

Chromosome

Must be in # format and not chr# format. The # value must correspond to that in the segmented copy ratios data file identically. For example, ABSOLUTE does not equate X with 23 and will exclude these mutations as unmapped mutations. Note ABSOLUTE algorithm v1.0.6 excludes X chromosome data but not numbered chromosome, e.g. chr23, data.

Output Files

<output.file.name.base>_plot.pdf

Three to four types of plots showing a number of modeled solutions. The fourth plot type is given if mutation data is provided. Each modeled solution is represented by a color across the plot types and presented in the order of combined likelihood. Please refer to the Analyzing ABSOLUTE Data page for detailed descriptions of the plots. These plots are (1) purity and ploidy solutions, (2) likelihoods of each of the solutions based on SCNAs, karyotype, SSNVs (if given mutation data), and combined, (3) genomic fraction versus copy ratio on an absolute scale for each proposed solution, and (4) if given mutation data, SSNV allelic fraction, SSNV multiplicity, and cancer cell fraction (CCF) charts for each solution.

The order of the presented solutions represents the ranking. Review these solutions and count the number rank of what you consider the best solution. You will use this number when you modify the calls file to override the top ranked solution in finalized results.

<output.file.name.base>.RData

An R file containing an object ‘seg.dat’ which provides all of the information used to generate the plots. This file serves as the input to ABSOLUTE.summarize.

Whether or not you get an error message, or if a PDF is not produced, examine the stdout.txt and stderr.txt files from your jobs for clues on what may have caused the error or to note what portions of data are excluded from the analysis based on filtering mechanisms in place.

For example, the stdout.txt tells you how many mutations were unmapped, that is did not have a corresponding segment to map to in the segmentation file and thus were excluded. Segmentation data may exclude chromosome end regions for which data were too noisy to obtain copy ratios.

If a PDF plot is not produced alongside the RData file, then the stdout.txt may show that all the solutions, that is modes, were removed based on parameter settings.

The results were then passed through ABSOLUTE.summarize, manually reviewed and augmented to select for alternative solutions and finalized through ABSOLUTE.review. Download these example results and the example override file using the following links:

The ABSOLUTE module runs only on GenePattern 3.4.2 or above and requires R2.15 with the following packages, each of which will automatically download and install when the module is installed:

numDeriv_2012.9-1

getopt_1.17

optparse_0.9.5

Please install R2.15.3 instead of R2.15.2 before installing the module. The GenePattern team has confirmed test data reproducibility for this module using R2.15.3 compared to R2.15.2 and can only provide limited support for other versions. The GenePattern team recommends R2.15.3, which fixes significant bugs in R2.15.2, and which must be installed and configured independently as discussed in Using Different Versions of R and Using the R Installer Plug-in. These sections also provide information on patch level fixes that are necessary when additional installations of R are made and considerations for those who use R outside of GenePattern.