Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.

Subjects

Abstract

Tobacco smoking causes lung cancer1,2,3, a process that is driven by more than 60 carcinogens in cigarette smoke that directly damage and mutate DNA4,5. The profound effects of tobacco on the genome of lung cancer cells are well-documented6,7,8,9,10, but equivalent data for normal bronchial cells are lacking. Here we sequenced whole genomes of 632 colonies derived from single bronchial epithelial cells across 16 subjects. Tobacco smoking was the major influence on mutational burden, typically adding from 1,000 to 10,000 mutations per cell; massively increasing the variance both within and between subjects; and generating several distinct mutational signatures of substitutions and of insertions and deletions. A population of cells in individuals with a history of smoking had mutational burdens that were equivalent to those expected for people who had never smoked: these cells had less damage from tobacco-specific mutational processes, were fourfold more frequent in ex-smokers than current smokers and had considerably longer telomeres than their more-mutated counterparts. Driver mutations increased in frequency with age, affecting 4–14% of cells in middle-aged subjects who had never smoked. In current smokers, at least 25% of cells carried driver mutations and 0–6% of cells had two or even three drivers. Thus, tobacco smoking increases mutational burden, cell-to-cell heterogeneity and driver mutations, but quitting promotes replenishment of the bronchial epithelium from mitotically quiescent cells that have avoided tobacco mutagenesis.

Data availability

Sequencing data have been deposited at the European Genome-phenome Archive (http://www.ebi.ac.uk/ega/) under the accession number EGAD00001005193. Somatic-mutation calls, including single-base substitutions, indels and structural variants, from all 632 samples have been deposited on Mendeley Data with the identifier: https://doi.org/10.17632/b53h2kwpyy.2.

International Agency for Research on Cancer. Tobacco Smoke and Involuntary Smoking. IARC Monographs on the Evaluation of Carcinogenic Risks to Humans Vol. 83 (IARC and World Health Organization, 2004).

Acknowledgements

This work was supported by a Cancer Research UK Grand Challenge Award (C98/A24032) and the Wellcome Trust. P.J.C. and S.M.J. are Wellcome Trust Senior Clinical Fellows (WT088340MA); S.M.J. receives funding as a member of the UK Regenerative Medicine Platform (UKRMP2) Engineered Cell Environment Hub (MRC; MR/R015635/1) and the Longfonds BREATH lung regeneration consortium, and is further supported by The Rosetrees Trust, the Stoneygate Trust, the British Lung Foundation and the UCLH Charitable Foundation; K.Y. is supported by a Japan Society for the Promotion of Science (JSPS) Overseas Research Fellowship and The Mochida Memorial Foundation for Medical and Pharmaceutical Research; S.M.J. and R.E.H. are supported by the Roy Castle Lung Cancer Foundation; R.E.H. is a Wellcome Trust Sir Henry Wellcome Fellow (WT209199/Z/17/Z); and I.M. is funded by Cancer Research UK (C57387/A21777). The authors thank S. Broad and F. Watt for providing 3T3-J2 fibroblasts, and B. Carroll for help with sample collection.

Contributions

S.M.J., P.J.C., K.Y., K.H.C.G. and H.L.-S. designed the experiments. K.H.C.G performed all of the sample collection, cell isolation, clonal expansion and DNA extraction, with help from D.P.C., E.F.M. and F.R.M. E.F.M. and C.R.B. collected the paediatric samples, and E.F.M., D.P.C. and R.M.T. collected the adult samples. E.A. made sequencing libraries. K.Y. performed most of the data curation and statistical analysis, with H.L.-S., T.C., K.B., A.M., N.K. and T.H. providing assistance and advice. S.E.C. oversaw all of the clinical data collection and curation, and performed the flow cytometry characterization of the clones. R.E.H. and K.H.C.G. performed the qPCR characterization of the clones. M.R.S. oversaw the analysis of mutational signatures. P.J.C. and I.M. oversaw statistical analyses. R.E.H., A.P., K.H.C.G., K.Y., S.M.J. and P.J.C. performed data interpretation and, together with D.P.C., helped to draft and revise the manuscript.

Extended data figures and tables

a, Sorting of EpCAM+ epithelial cells from human airway biopsies. Human haematopoietic and endothelial cells were stained with antibodies against CD45 and CD31, respectively. Within the population of cells negative for those markers, EpCAM-expressing cells were gated. Single, live (DAPI-negative) cells were flow-sorted from this population into individual wells of 96-well plates. b, Quantitative PCR (qPCR) analysis of cultures of clonally derived airway epithelial cells. Airway basal cells express integrin subunit α 6 (ITGA6), keratin 5 (KRT5), cadherin 1 (CDH1) and TP63. Expression is shown in clonally derived cell cultures (n = 13 from 3 donors, coloured blue, green and orange) compared to control bulk human bronchial epithelial cell cultures (HBECs) that were expanded in the same culture conditions and lung fibroblast cell cultures (lung fibs) that served as a negative control. The centre values and error bars indicate mean and s.e.m., respectively. Conditions in which no expression was detected are shown as 0. c, Colony-forming efficiency of CD45−CD31−EPCAM+ cells after single-cell sorting from endobronchial biopsy samples (n = 16). For one ex-smoker, EpCAM was not used to select cells and only CD45−CD31− cells were sorted; as expected, this was the patient with the lowest colony-forming efficiency.

a, Stacked bar chart showing the proportion of reads attributed to the human genome, mouse genome, both, neither, or with ambiguous mapping for the pure mouse fibroblast feeder line (left) or a pure human sample (right), assessed with the Xenome pipeline. b, Clean-up of mutation calls using the Xenome pipeline for one of the samples that was more heavily contaminated by the mouse feeder layer. The Venn diagram on the left shows the overlap in mutation calls before and after removing non-human reads by Xenome. c, Histograms of VAFs for two representative colonies in the sample set. The plot on the left shows a tight distribution around 50%, as expected for a colony derived from a single cell without contamination. The plot on the right shows a bimodal distribution with one peak at 50% (mutations present in the original basal cell) and a second peak at around 25% (probably representing mutations that were acquired in vitro during colony expansion). These second peaks at less than 50% are more evident in colonies from children, owing to the low number of mutations in the original basal cell. d, Histogram of VAFs for a colony seeded by more than one basal cell, leading to a peak at much less than 50%. e, Estimated sensitivity of mutation calling according to sequencing depth. Heterozygous germline polymorphisms were identified in each subject; for each colony sequenced, we calculated the fraction of these polymorphisms that was recalled by our algorithms. f, Comparison of mutational burden in normal bronchial epithelial cells that neighbour a carcinoma in situ (CIS) versus cells distant from the CIS in five patients. The box-and-whisker plots show the distribution of mutational burden per colony within each subject, with the boxes indicating median and interquartile range and the whiskers denoting the range. The overlaid points are the observed mutational burden of individual colonies.

a, Density distribution of mutational burden in cells from ex-smokers (green) and current smokers (purple). The black vertical line shows the threshold for near-normal mutational burden derived for each patient. The x axis is on a logarithmic scale. Note the frequently bimodal distribution of mutational burden, especially in the ex-smokers, with the modes separated at the threshold for near-normal mutational burden. b, Flow cytometric analysis of clones for expression of KRT5, EpCAM, ITGA6, podoplanin (PDPN), NGFR and CD45 or CD31. Lung fibroblasts are included as a comparison. Fluorescence minus one (FMO) is shown. Plots for one clone with a near-normal mutational burden (low-mutant clone) and one with an increased burden (high-mutant clone) are shown, and are representative of five clones from one patient. c, Bright-field images of expanded clones at passage 3, showing cobblestone epithelial morphology. Images are representative of five clones from one patient. A clone with an increased mutational burden is shown at the top, and a clone from an ex-smoker with a near-normal mutational burden is shown at the bottom. For the left images, the magnification is ×10 and the scale bar is 200 μm; for the right images, the magnification is ×20 and the scale bar is 100 μm.

a, Relationship of burden of indels per cell with age. The points represent individual colonies (n = 632) and are coloured by smoking status. The black line represents the fitted effect of age on indel burden, which was estimated from LME models after correction for smoking status and within-patient correlation structure. The blue shaded area represents the 95% CI for the fitted line. b, Stacked bar plot showing the distribution of colonies with 0–7 copy-number changes and structural variants across the 16 subjects. c, Three examples of chromoplexy in normal bronchial cells. Structural variants are shown as coloured arcs that join two positions in the genome around the circumference. The instances of chromoplexy all consist of three translocations (purple). d, An example of chromothripsis in a cell from an 11-month old child. The plot on the right shows the copy number of genomic windows in the relevant region of chromosome 1 (black points); the lines and arcs denote the positions of observed structural variants.

a, Trinucleotide contexts for the signatures extracted by the hierarchical Dirichlet process (HDP) (left) and MutationalPatterns non-negative matrix factorization (right). The six substitution types are shown across the top of each signature. Within each signature, the trinucleotide context is shown as four sets of four bars, grouped by whether an A, C, G or T respectively is 5′ to the mutated base, and within each group of four by whether A, C, G or T is 3′ to the mutated base (the order of bars is the same as that shown in Fig. 2b). Where signatures show high cosine similarity scores between algorithms, they are lined up horizontally. We note that Signature C in MutationalPatterns does not have a match in the signatures extracted by the HDP algorithm, but appears very similar to Signature A in MutationalPatterns (or SBS-5 from the HDP). This means that it probably represents over-splitting of the signatures. b, Heat map showing the cosine similarities of signatures extracted by MutationalPatterns with those extracted by the HDP. Only cosine-similarity scores that are greater than 0.75 are coloured. c, Scatter plots showing the fraction of mutations in each colony (n = 632) assigned to each signature by the HDP algorithm (x axis) versus the MutationalPatterns algorithm (y axis). The correlation values quoted are Pearson’s correlation coefficients (R2). d, Transcriptional strand bias of A>G mutations in an N[A]T context before and after TSSs. Note the absence of transcriptional strand bias in intergenic regions but evidence for both transcription-coupled damage and repair after the TSS, applying similarly in both never-smokers and ex- or current smokers.

Phylogenetic trees showing clonal relationships among normal bronchial cells in the 13 subjects not shown in Fig. 3a. Branch lengths are proportional to the number of mutations (x axis) specific to that clone or subclone. Each branch is coloured by the proportion of mutations on that branch that are attributed to the various SBS signatures.

a, Five indel signatures (ID-1, ID-2, ID-3, ID-5 and ID-8) were extracted by the HDP. The contributions of different types of indels to each signature are shown, grouped by whether variants are deletions or insertions; the size of the event; whether they occur at repeat units; and the sequence content of the indel. b, Stacked bar plot showing the proportional contribution of mutational signatures to indels across the 632 colonies derived from normal bronchial cells, extracted using the HDP. Within each patient, colonies are sorted from left to right by increasing indel burden (bar chart in dark grey above coloured signature-attribution stacks).

a, Six DBS signatures were extracted by the HDP. The contributions of different types of double-base substitution to each signature are shown, grouped by the sequence that is mutated and by what it is mutated to. Five of the signatures have been observed in cancer genomes24, and one (DBS Sig-C) is a novel signature that was extracted here. b, Stacked bar plot showing the proportional contribution of mutational signatures to double-base substitutions across the 632 normal bronchial cells, extracted using the HDP. Note that some of the colonies in children have no double-base substitutions. Within each patient, colonies are sorted from left to right by increasing burden of double-base substitutions (bar chart in dark grey above coloured signature-attribution stacks).

a, Stick plots showing distribution of mutations in TP53, NOTCH1 and other genes that were significantly mutated in our sample set. Mutations are coloured by type. The gene structure is shown horizontally in the centre of each plot, with domains as coloured bars. Above the gene are mutations in this sample set, and below the gene are mutations found in squamous cell carcinomas from the TCGA sample set. b, Fraction of cells with driver mutations in TP53 (left), NOTCH1 (middle) or all other significant cancer genes (right), split by smoking status.

Scatter plot of estimated telomere lengths (y axis) against the age of the subject (x axis). Individual points represent colonies (n = 398 colonies in which less than 10% of the DNA was derived from the mouse feeder layer). Cells with a near-normal mutational burden are coloured gold.