Abstract

Small-cell lung cancer (SCLC) is an exceptionally aggressive disease with poor prognosis. Here, we obtained exome, transcriptome and copy-number alteration data from approximately 53 samples consisting of 36 primary human SCLC and normal tissue pairs and 17 matched SCLC and lymphoblastoid cell lines. We also obtained data for 4 primary tumors and 23 SCLC cell lines. We identified 22 significantly mutated genes in SCLC, including genes encoding kinases, G protein–coupled receptors and chromatin-modifying proteins. We found that several members of the SOX family of genes were mutated in SCLC. We also found SOX2 amplification in ~27% of the samples. Suppression of SOX2 using shRNAs blocked proliferation of SOX2-amplified SCLC lines. RNA sequencing identified multiple fusion transcripts and a recurrent RLF-MYCL1 fusion. Silencing of MYCL1 in SCLC cell lines that had the RLF-MYCL1 fusion decreased cell proliferation. These data provide an in-depth view of the spectrum of genomic alterations in SCLC and identify several potential targets for therapeutic intervention.

Lung cancer is the leading cause of cancer mortality in the United States, where it is responsible for over 160,000 deaths annually1. Approximately 10–15% of the new lung cancer cases diagnosed each year are SCLC2. The genomic landscape of SCLC is of particular interest compared to those of other solid tumors, given the unique biological characteristics of this tumor type3. SCLC is an exceptionally aggressive malignancy with a high proliferative index and an unusually strong predilection for early metastasis.

Previous efforts to characterize the genetic alterations present in SCLC tumors identified high prevalence of inactivating mutations in TP53 (75–90%)4, RB1 (60–90%)5,6 and PTEN (2–4%)7, rare activating mutations in PIK3CA, EGFR and KRAS8–10, amplification of MYC family members, EGFR and BCL2, and loss of RASSF1A, PTEN and FHIT6,11.

A better understanding of the genomic changes in this cancer will be essential to developing new therapeutics. To this end, we have applied next-generation sequencing technologies to characterize multiple exomes and a single genome of primary SCLC, as well as exomes of SCLC cell lines, together with genome-wide copy-number analysis and whole-transcriptome sequencing.

Exome capture, sequencing and analysis of 42 SCLC tumor–normal tissue pairs identified 26,406 somatic mutations. Approximately 30% (7,977) of these mutations were protein altering (Fig. 1a and Supplementary Table 2). The somatic mutations identified included 7,154 missense, 536 nonsense, 12 stop loss, 243 essential splice site, 32 protein-altering insertion and/or deletion (indel), 2,674 synonymous, 11,460 intronic and 4,295 other types (Fig. 1a and Supplementary Tables 3 and 4). Comparison of the protein-altering changes identified in this study with those reported in the Catalogue of Somatic Mutations in Cancer (COSMIC)12 showed that 98% (7,824/7,977) of these variations are newly identified somatic changes. Nineteen percent of the protein-altering somatic mutations reported were validated using RNA sequencing (RNA-seq) data or mass spectrometry genotyping, with a validation rate of 91% (Supplementary Table 3). We confirmed the effect of several splice-site mutations using RNA-seq data (Supplementary Table 3). We validated all of the indels reported using Sanger sequencing (Supplementary Table 4). One sample represented a distinct profile, with 2,953 mutations (757 validated protein-altering variants; Fig. 1a and Supplementary Table 3). Given the exceptionally high number of mutations in this sample, we excluded it from our calculations of the background mutation rate. Excluding the hypermutated sample, the SCLC tumors had an average of 175 protein-altering single-nucleotide variants (range 31–388) with a mean nonsynonymous mutation rate of 5.5 mutations per megabase (Fig. 1a). This is comparable to the 92 protein-altering variants observed in the previously sequenced genome of a single SCLC cell line13.

Analysis of the base-level transitions and transversions showed that G-to-T transversions were predominant, followed in prevalence by G-to-A and A-to-G transitions (Fig. 1b), both at the exome (Fig. 1c) and whole-genome (Fig. 1d,e and Supplementary Fig. 1) levels. This pattern is consistent with demonstrated effects of tobacco smoke carcinogens on DNA13.

Our mutation analysis identified protein-altering somatic single-nucleotide variants in 5,179 genes, including 4,775 genes that were mutated in the non-hypermutated SCLC sample set. Frequently mutated classes included genes encoding kinases, G protein–coupled receptors and chromatin-modifying proteins. To further understand the impact of the mutations on gene function, we applied SIFT14, Polyphen15 and Condel16 and found that ~53% of the somatic mutations identified are likely to have functional consequences according to at least two of the three methods (Supplementary Table 3). In contrast, only approximately 17% of germline variants identified in the normal samples are predicted by these methods to have a functional impact (Supplementary Fig. 2).

To further assess the relevance of mutated genes, we applied a q-score metric17 to rank significantly mutated cancer-associated genes. We identified 22 significantly mutated genes in SCLC (q score ≥ 1; false discovery rate ≤ 10%; Supplementary Table 5). These genes included TP53 and RB1 and several genes that have not previously been reported as mutated in SCLC (Fig. 2a and Supplementary Table 5). To further confirm the relevance of the 22 genes, we assessed the mutation frequency for these genes using exome data from a set of 21 additional samples (Supplementary Table 6). We found a significant correlation between the mutation frequencies of the 22 genes in the initial sample set and the validation cohort (P = 1.16 × 10−5, r2 = 0.63; Supplementary Table 7). In addition, we found that 42 genes that were mutated in our primary tumor samples (Supplementary Table 8) were also previously reported to be mutated in the genome of the NCI-H209 SCLC cell line13.

Mutational hotspots are indicative of genes that are relevant to cancer. In this study, we have identified 17 genes with 18 hotspot mutations (Supplementary Table 9). By comparing our mutations with those reported in COSMIC12 and a large-scale colon cancer mutation screen18, we identified an additional 150 hotspot mutations in 116 genes (Supplementary Table 9). Besides known hotspots in TP53, RB1, PIK3CA, CDKN2A and PTEN, several new hotspot mutations were identified. These included genes encoding Ras family regulators (RAB37, RASGRF1 and RASGRF2), chromatin-modifying enzymes or transcriptional regulators (EP300, DMBX1, MLL2, MED12L, TRRAP and RUNX1T1), ionotropic glutamate receptor (GRID1), kinases (STK38, LRRK2, PRKD3 and CDK14), protein phosphatases (PTPRD and PPEF2) and G protein–coupled receptors (GPR55, GPR113 and GPR133). Further, three of the genes with the top q scores—RUNX1T1, CDYL and RIMS2—contained a hotspot mutation.

In addition to the hotspots, we found mutations clustering in particular gene families and pathways (Supplementary Table 10). Evidence of clustering was found in genes in the phosphatidylinositol 3-kinase (PI3K) pathway (PIK3CA, AKT1–3, MTOR, RPS6KA2 and RPS6KA6), the mediator complex (MED12, MED12L, MED13, MED13L, MED15, MED24, MED25, MED27 and MED29), Notch and Hedgehog family members (NOTCH1, NOTCH2, NOTCH3 and SMO), glutamate receptor family members (GRIA1, GRIA2, GRIA3, GRIA4, GRIND1, GRID2 and GRM1–3, GRM 5, GRM 7 and GRM 8), SOX family members (SOX3, SOX4, SOX5, SOX6, SOX9, SOX11, SOX14 and SOX17; Fig. 2b) and DNA repair and/or checkpoint pathway genes (ATM, ATR, CHEK1 and CHEK2). The mutations in SOX family members were mutually exclusive (Supplementary Fig. 3). In contrast to non–small-cell lung cancer (NSCLC)12, we did not observe any SCLC samples with a KRAS mutation. Among the receptor tyrosine kinase genes, we identified mutations in FLT1, FLT4, KDR and KIT and members of the Ephrin family (EPHA1–7 and EPHB4). Notably, the KIT mutation affecting codon 761 has previously been reported in mast cell activation disorder and is likely an activating change19 (Supplementary Fig. 4).

In addition, we identified high levels of amplification (copy number of ≥4) of SOX2 in ~27% (15/56) of the SCLC samples (Fig. 3b). RNA-seq data showed that the majority of the SCLC samples, including those with SOX2 amplification, had higher SOX2 expression compared to adjacent normal samples (Fig. 3c). We further examined the expression of SOX2 by immunohistochemistry (IHC) and copy-number change by FISH in an independent cohort of 110 primary SCLC tumor samples (Fig. 4a,b). Expression of SOX2 was strongly correlated with increased gene copy number and with clinical stage (Fig. 4c,d).

To further assess the relevance of SOX2 in SCLC, we analyzed a panel of SCLC cell lines for SOX2 protein expression and gene copy number (Supplementary Fig. 5). Among these cell lines, H446 and H720 both had strong SOX2 protein expression, and H720 was found to have elevated gene copy number. SOX2 has previously been implicated in the maintenance of proliferative potential and stem cell function22–25. To test whether H446 and H720 were dependent on SOX2 for continued growth and proliferation, we stably transduced them with lentiviruses carrying either a doxycycline-inducible SOX2-targeting short hairpin RNA (shRNA) or a scrambled control shRNA. Induction of SOX2 shRNA in both H446 and H720 resulted in lower amounts of SOX2 protein and reduced cell proliferation (Fig. 3d,e). Previously, amplification of SOX2 and its role as an oncogene have been reported in lung and esophageal squamous cell carcinoma26. Our findings further support the idea of SOX2 as a genuine SCLC driver gene.

Analysis of RNA-seq data obtained from SCLC samples for fusion transcripts identified 41 gene fusions, including 4 recurrent fusions (Supplementary Table 13). A majority of the predicted gene fusions were intrachromosomal (83%, 34/41). All of the gene fusions reported were verified and confirmed to be somatic by RT-PCR Supplementary Table 13). A fusion involving RLF and MYCL1 (Supplementary Fig. 6a) was found in one primary SCLC tumor and four SCLC cell lines (H889, HCC33, H1092 and COR-L47). RLF and MYCL1 are ~259 kb apart and are encoded by opposing strands. The observed fusion requires an inversion event that brings exon 1 of RLF in frame with MYCL1, leading to the expression of a fusion protein composed of the first 79 amino acids of RLF and a MYCL1 protein lacking its first 27 amino acids, thereby generating a 446-residue fusion protein. The clinical sample that had the RLF-MYCL1 fusion also overexpressed MYCL1. This fusion has previously been noted27, but its role as an oncogene in SCLC has not been established. We found that small interfering RNA (siRNA)-mediated targeting of MYCL1 in H1092 and CORL47 fusion-positive cells effectively reduced the proliferation of these cells, strongly suggesting a functional role for MYCL1 in SCLC (Supplementary Fig. 6).

Multiple gene fusions involving kinase genes have recently been shown to be activating28. We identified four such fusions—NPEPPS-EPHA6, SKP1-CDKL3, NEK4-SFMBT1 and ZAK-RAPGEF4—that are predicted by sequence to result in functional fusion proteins (Fig. 5 and Supplementary Figs. 7–9). The roles of these fusion products in cancer remain to be elucidated.

In this study, we have identified multiple new recurrent somatic mutations in SCLC, including multiple mutations and copy-number alterations in SOX gene family members. The potential role of SOX family members in SCLC is further emphasized here by the identification of SOX2 amplification and overexpression in approximately a quarter of the SCLC samples analyzed. SOX proteins have an important role in diverse biological processes, including cell type specification. Among the SOX family members, SOX2 in particular is a key factor in the maintenance of pluripotency and self-renewal of stem cells23. Aberrant SOX2 expression has also been implicated in reprogramming mature cells to acquired pluripotency24. Its expression in mouse fibroblasts, together with FoxG1, has been shown to generate self-renewing neural precursor cells25. Conditional deletion of Sox2 in mice indicates its critical role in lung development22. Conversely, overexpression of SOX2 in lung epithelial cells has been shown to promote tumorigenesis29.

Notably, conditional induction of SOX2 in lung epithelial cells is also known to increase the number of neural progenitor cells30. SCLCs are tumors with neuroendocrine features. SOX2 protein overexpression has previously been noted in high-grade SCLC31, and immunoreactive antibodies against SOX2 have been detected in sera from SCLC patients32. These observations, together with the frequent amplifications identified here, imply that SOX2 has an important role as a putative lineage-survival oncogene in SCLC. This suggestion is further supported by the correlation of SOX2 expression with SCLC stage and the role of SOX2 expression in maintaining SCLC proliferation.

The recurrent nature of the RLF-MYCL1 fusion and its functional relevance provide additional opportunities for therapeutic intervention in SCLC. Recently, oncogenic kinase gene fusions have become a major focus of interest in the therapeutic targeting of NSCLC33–35. Understanding the role of tumor-specific in-frame kinase fusion transcripts identified in SCLC in this study may provide promising opportunities for targeted therapy development.

Patient-matched fresh-frozen primary SCLC tumors and normal tissue samples were obtained from commercial sources or the Johns Hopkins tissue repository (Supplementary Table 1). All samples used in the study had appropriate IRB approval and informed consent from study participants. All tumor and normal tissues were subjected to review by a pathologist to confirm diagnosis and tumor content. The Qiagen AllPrep DNA/RNA kit was used to prepare DNA and RNA.

Sequence data processing

All sequencing reads were evaluated for quality using the Bioconductor ShortRead package36. Sample identity was confirmed by comparing data derived from exome sequencing and RNA-seq against Illumina 2.5 M array data as described18.

Variant calling and validation

Sequencing reads were mapped to the UCSC human reference genome (GRCh37/hg19) using Burrows-Wheeler Aligner (BWA) software37 set to default parameters. Local realignment, duplicate marking and raw variant calling were performed as described previously38. Known germline variations represented in dbSNP Build 131 (ref. 39) but not represented in COSMIC v56 (ref. 12) were filtered out. Variations present in the tumor sample but absent in matched normal tissue were predicted to be somatic. Predicted somatic variations were additionally filtered to include only positions with a minimum of 10× coverage in both the tumor and matched normal tissue, as well as an observed variant allele frequency of <3% in the matched normal tissue and a significant difference in variant allele counts, as determined using Fisher’s exact test. To control for possible low-level tumor contamination in adjacent normal tissue, the allele frequency cutoff was expanded to 5% if a gene was significantly mutated, allowing for an additional 11 variants to be included. We performed whole-genome sequencing of the 1 hypermutated sample and only report the 755 protein-altering variants that were found in both the exome and whole-genome data for this sample. This sample was excluded from background mutation rate calculations. For unpaired samples, in addition to dbSNP, variants were filtered against normal variants from this data set, as well as normal variants from a published colon data set18. In addition, data from 2,500 normal exomes in the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project was used to filter out variants and hotspot mutations. To evaluate the performance of the variant calling algorithm, we randomly selected 594 protein-altering variants and validated them using Sequenom, as described previously17. Of these variants, 91% (539) were validated as somatic. All variants that were invalidated were removed from the final set. Variants that were also validated by RNA-seq are labeled as VALIDATED: RNA-Seq to show confirmed expression of the variant (Supplementary Table 3). Indels were called using the GATK Indel Genotyper Version 2 (ref. 28). Indel validation was performed as described in a recent study18. The effects of all nonsynonymous somatic mutations on gene function were predicted using SIFT14, PolyPhen15 and Condel16. All variants were annotated using Ensembl (release 59).

Mutational significance

We evaluated the mutational significance of genes using a previously described method17, with the addition of an expression filter, as mutation rates are known to vary with expression level13,40. The hypermutated sample was excluded from analysis so that it did not affect the background mutation rate. Because of the variability in background mutation, the uniform background mutation rate used to assess the significance of mutation in cancer-associated genes is at times lower than the actual mutation rate in some regions, resulting in false positive candidates, such as the olfactory genes, seeming to be significantly mutated cancer-associated genes. To address this, a recent study used an RNA-seq–based expression filter to focus on expressed genes, thereby potentially filtering out genes that are expressed at very low levels or are not expressed at all41. In this study, we classified average gene expression on the basis of RNA-seq data into tertiles (high, medium and low) and used this information to remove low expressors that would otherwise be identified as significantly mutated cancer-associated genes.

Whole-genome, RNA-seq and pathway analysis

Whole-genome analysis, RNA-seq–based expression assessment and pathway-level analysis were performed as described previously18.

SNP array data generation and analysis

Illumina HumanOmni2.5_4v1 arrays were used to assay 56 samples (36 primary tumor–normal pairs, 15 SCLC cell line–normal pairs, 1 SCLC cell line and 4 unpaired primary tumors) for genotype, DNA copy number and loss of heterozygosity (LOH) at ~2.5 million SNP positions. These samples all passed our quality control metrics for sample identity and data quality. A subset of 2,295,239 high-quality SNPs was selected for all analyses.

After making modifications to permit use with Illumina array data, we applied the PICNIC42 algorithm to estimate total copy number, allele-specific copy number and LOH, as described recently18. Recurrent genomic regions with DNA copy gain and loss were identified using GISTIC, version 2.0 (ref. 43).

Fusion detection and validation

Fusion identification and validation were performed as has been recently described18.

Cell lines and culture conditions

All cell lines used in the study, except where noted, were cultured in RPMI 1640 supplemented with 10% FBS. H446 and H720 were cultured in RPMI 1640 with 10% tetracycline-free FBS (Hyclone, R10). Cell line identity for lines used to assess SOX2 copy number was confirmed by short tandem repeat (STR) profiling using the StemElite ID System (Promega). HCC33, HCC2433, H289, H2141, H2107, H209, H1963, H1672, H1607, H1450, H1339, H1184, H2171, HCC1772, HCC970, H128 and H2195 SCLC cell lines, their patient-matched lymphoblastoid lines and their culture conditions have been described previously44–46 (Supplementary Table 1). Additional SCLC cell lines were obtained from the American Type Culture Collection (ATCC).

Scrambled or SOX2-targeting (TRC Clone TRCN0000003253) shRNAs were cloned as annealed oligonucleotides (Sigma) into Tet-pLKO-puro (Addgene plasmid 21915) digested with AgeI and EcoRI according to published protocols47,48. Sequence-verified clones were used to produce lentiviral particles according to TRC protocols. Lentiviral supernatants were used to infect cultured H446 or H720 cells in R10 medium at low multiplicity of infection in the presence of 8 µg/ml polybrene for 16 h. After incubation, medium was replaced with fresh R10, and cells were cultured for an additional 24 h before being selected and maintained in 500 ng/ml puromycin. The optimal doxycycline dose for inducible knockdown was determined to be 2 µg/ml, which was the minimum dose that resulted in maximal knockdown of SOX2 after 96 h. The effect of SOX2 knockdown on the amount of SOX2 protein was assessed by protein blot using antibody to SOX2 (Cell Signaling Technology 27485) or GAPDH (Santa Cruz Biotechnology, sc-25778) horseradish peroxidase (HRP)-conjugated secondary antibodies, followed by signal detection with chemiluminescence (GE Healthcare Life Sciences).

Cell viability and proliferation assays

Stable cell lines were plated in quad-ruplicate at a density of 1 × 103 cells per well in opaque 96-well plates in the presence or absence of 2 µg/ml doxycycline. Cells were plated in replicate plates for each time point tested. ATP content was measured as an indicator of metabolically active cells using the CellTiter-Glo Luminescent Cell Viability Assay (Promega) read on a SpectraMax M2e plate reader in luminescence mode (Molecular Devices). Viability was normalized between cell lines at 48 h to correct for differences in the initial number of cells plated in each group. All experiments were repeated a minimum of three times with similar results, and one representative experiment is shown.

Analysis of copy-number variation in SCLC cell lines

SOX2 copy number was assessed by quantitative RT-PCR using TaqMan Copy Number Assays (Hs02719379_cn) on a StepOnePlus Real-Time PCR System (Applied Biosystems). RPPH1 served as the reference gene (Applied Biosystems). Copy-number calls relative to normal human genomic DNA (Promega) were made with CopyCaller v2.0 (Applied Biosystems).

Tissue microarrays

SCLC tissue microarrays were obtained from US Biomax (LC703, LC802a, LC1009 and LC10010a) for IHC and FISH as fresh-cut slides. The four tissue microarrays contain replicate cores and a small set of overlapping cases. For analysis, missing or inconclusive cores were removed, and the replicate or overlapping case core with the highest percentage of tumor area was used for analysis, yielding 110 unique SCLC cases and 15 normal lung cases. Histological diagnosis with SCLC was confirmed by an attending pathologist.

Immunohistochemistry

IHC for SOX2 was performed on the tissue microarrays using a Leica Bond-III automated slide stainer (Leica Microsystems). The 4-µm sections were deparaffinized and subjected to antigen retrieval with Cell Conditioning Solution (high pH CC1 standard, Ventana Medical Systems) for 60 min. Sections were then incubated for 44 min with rabbit monoclonal antibody to SOX2 (1:100 dilution; clone SP76, Cellmarque). Reactions were developed through biotin-free, polymer detection (Ultra-view, Ventana Medical Systems) according to the manufacturer’s instructions.

Scoring was performed on each sample. Nuclear labeling was scored by intensity (no (0), weak (1), moderate (2) or strong (3)) and for extent (expressed as the percentage of nuclei that were positive). Results were expressed by assigning a composite IHC score that was calculated by multiplying the intensity score by the percentage of nuclei with positive staining, with a maximum value of 300.

FISH analysis

FISH was performed on the tissue microarrays. The BAC clone RP11-459K6 containing a human DNA insert from the genomic region of SOX2 (previously validated by PCR) was used for preparation of the SOX2 FISH probe. The SOX2 probe was validated for chromosome mapping and quality of hybridization in the human lymphoblastoid cell line AG09391 (Coriell Institute).

One slide of each tissue microarray was subjected to a two-color FISH assay using a mixture of the SOX2 probe (red) and a commercially available probe for the chromosome 3 centromere (Kreatech) (green). The steps before hybridization were performed using the Zymed Spot-Light Tissue Pretreatment kit (Invitrogen) according to the manufacturer’s instructions.

Analysis was performed on an epifluorescence microscope using single interference filter sets for blue (DAPI), green (FITC) and red (Texas red). For each interference filter, monochromatic images were acquired and merged using CytoVision (Leica Microsystems). Tumor cells were scored for copy-number signals of SOX2 in 30–50 cells. In this analysis, a scoring system was proposed to identify increased levels of copy number per cell. Scores were assigned on a scale from 1–6 (according to pattern of copy-number gain, median per-cell change): 1 (no, 1–2), 2 (low, 2–3), 3 (moderate, 3–4), 4 (high, 4–5), 5 (very high, >5), 6 (gene amplification, gene clusters).

MYCL1 knockdown studies

The SCLC cell lines, NCI-H1092, CORL47 and NCI-H2171 were transfected with siRNA pools targeting MYCL1 (Dharmacon) or with a non-targeting control siRNA (Dharmacon) following a reverse transfection protocol. The cells were incubated at 37 °C for 5 d after transfection and were subjected to a cell viability assay using the CellTiter-Glo kit (Promega).

MYCL1 (Hs00420495_m1) and GAPDH (Hs00266705_g1) TaqMan probes and primers were obtained from Life Technologies and were used to assess knockdown according to the manufacturer’s instructions. Data were analyzed using the ΔΔCT method by normalizing to GAPDH and mock-transfected controls. TaqMan reactions were performed in duplicate to obtain a mean value and s.d. P values were calculated by t test.

Supplementary Material

Suppl Figures

Suppl Tables

ACKNOWLEDGMENTS

The authors would like to thank Genentech DNA Sequencing and Oligo groups for their help with the project. We thank M.A. Huntley and J. Degenhardt for bioinformatics support and the Pathology Core Labs for providing histology, IHC and tissue management support. This work was also supported by grants from the Burroughs Wellcome Fund, the Flight Attendant Medical Research Institute, the Johns Hopkins Specialized Programs of Research Excellence (SPORE) NCI P50CA058184 (M.V.B. and C.M.R.), the Colorado SPORE NCI P50 CA058187 (M.V.-G) and the University of Texas SPORE NCI P50CA70907 (J.D.M., A.F.G. and K.E.H.). D.D.P. is supported by the Coordenacao de Aperfeicoamento de Passoal de Nivel Superior (CAPES) Foundation and the Ministry of Education of Brazil.

Footnotes

Accession codes. Sequencing and genotype data have been deposited at the European Genome-phenome Archive, which is hosted by the European Bioinformatics Institute (EBI), under accession EGAS00001000334.

Note: Supplementary information is available in the online version of the paper.