National Cancer Institute

at the National Institutes of Health

Some may have thought that completion of the Human Genome Project almost a decade ago was the conclusion of genomic research. In reality, sequencing the human genome is just the beginning—now that the findings from that landmark effort are widely available, scientists are working to put that data to work to understand the genetic causes of many diseases, including cancer, by using the latest sequencing techniques.

The human genome has about 3.4 billion base pairs of DNA, which act as the body’s chemical information database. Genes are units within our DNA that influence everything about us, from how we look, to how well we fight infection, to our risk of certain kinds of cancer. People can inherit certain cancer-risk genes from their parents (often called germline or Mendelian genes) or evolve them (somatic genes) through exposure to environmental agents, such as ultraviolet light, chemicals, or radiation.

Only about 350 of an estimated 2,000 cancer genes have been identified, according to scientists at St. Jude Children’s Research Hospital, one of the NCI-designated Cancer Centers working of various genome projects. Since we all share the same basic set of genes as well as many of the same regulatory pathways within our bodies, scientists often use reference sequences of the human genome, commonly referred to as genomic datasets, to serve as a starting point for comparison studies of cancer.

Interpreting, translating, and sharing genomic data

Investigators at NCI, and their partners at NIH and elsewhere, are sequencing DNA to learn everything they can about potential genes that cause cancer, with the goal of developing better diagnostics, targeted medications, and treatments for those with certain gene mutations. Among the many worldwide efforts that are disseminating their findings for other researchers to build upon are:

The Cancer Genome Atlas, a key initiative at NIH, is a joint program with a sister NIH entity, National Human Genome Research Institute. TCGA’s goal is to accelerate the understanding of cancer through the application of genome analysis technologies. TCGA is genetically characterizing 20 kinds of human cancers based on hundreds, if not thousands of samples.

A recent example of TCGA work is a large-scale study of colon and rectal cancer tissue specimens. Historically, the scientific community has treated colon tumors as distinct from rectal cases. Using genomic analyses, however, researchers discovered that colon and rectal cancer were nearly indistinguishable at a genetic level, leading them to conclude that these two cancer types can be grouped together as colorectal cancer. As a result of TCGA’s finding, the data taken and analyzed from these 224 colorectal tumor samples may serve as the foundation for new, precision cancer studies.

Genome-Wide Association Studies, performed by many organizations, compare common genetic factors across complete sequences of DNA of many different people to find genetic variants associated with a disease or trait.

GWAS researchers have uncovered genetic variations in breast, colorectal, lung, melanoma, and prostate cancer, as well as other diseases.

As one example, researchers at NCI have used GWAS of prostate cancer data sets to identify and localize the common and rare variants for many prostate cancers. As one result, they discovered that a common genetic variant, previously associated with an increased risk of prostate cancer, reduces the expression of a gene called MSMB in prostate tissue. This finding validates past research that suggests that MSMB plays a role in prostate cancer development, making it a potential target for drug treatment or other therapy.

The Pediatric Cancer Genome Project, an initiative of St. Jude Children’s Research Hospital and the Washington University Pediatric Cancer Genome Project, provides genetic data for childhood cancers. Scientists are sequencing the entire genome of their pediatric cancer patients. To date, they have released 520 data sets of normal and tumor tissue samples, taken from 260 infants and children.

Technologies associated with pediatric cancer sequencing have already provided insights into aggressive childhood cancers. For example, scientists uncovered a gene mutation associated with a nerve cancer in children and young adults, called neuroblastoma. The scientists found that the ATRX gene was mutated only in patients age five and older, and that older children and adolescents were more likely to have a chronic form of neuroblastoma than their younger counterparts, causing them to die just years after diagnosis. This was the first genetic clue in developing different treatments for pediatric neuroblastoma patients at different ages at diagnosis.

More about genome sequencing

This benchtop sequencer can perform a variety of next-generation applications

Two sequencing technologies that are commonly used by scientists in their genetic research are whole exome genome and whole exome sequencing.

Whole exome sequencing looks only at the exome which is the coding region within the human genome that affect proteins. These protein-coding genes account for less than 2 percent of the human genome. Although small in number, these proteins are the heavy hitters of cell function, and may be linked to as many as 85 percent of all diseases, including cancers.

Protein-coding genes are commonly found in both inherited cancers and in tumors that are caused by environmental factors. They may be examined in studies to identify tumor mutations or to identify genetic conditions prevalent within certain populations.

While whole exome sequencing reduces study time, analysis, and cost, false-positives (genes that may, for example, appear mutated when they are not) may exist within this platform. And, in some types of analyses, additional studies, such as genotyping (comparing specific genomic regions to reference samples), may also be required to confirm results. Still, exome sequencing can be a valuable tool to predict or diagnose cancer.

Whole genome sequencing decodes the entire genome–all of a person’s DNA–exploring millions of combinations of genes that make up the human genome. By analyzing this information, researchers can identify mutated or abnormal genes across a complete strand DNA. These genes can, in some cases, be the origin of a disease such as cancer.

Whole-genome scanning, allows researchers to study not just the 98 percent of the human genome made up of genes that are non-coding and structural but to examine 100 percent of the genome.

Non-coding genes do not encode proteins but may be essential to chromosome structure. While this non-coding function is still poorly understood, scientists believe that some DNA regions may have biological functions, or clues, associated with positive selection and evolution. Scientists have also suggested that, rather than being activated, or turned on, by proteins, some non-coding genes have the ability to switch on and block the activity of another gene that’s close by.

Structural changes that are found in the chromosome–an organized structure of DNA and protein found in cells–include:

Copy-number variations: alterations of the DNA of a genome that results in the cell having an abnormal number of copies of one or more sections of DNA

Deletions: genetic mutation in which part of a chromosome, or a sequence of DNA, is missing

Duplications: the basis for biological inheritance, a process that occurs when living organisms copy their DNA. But inadvertent duplications can be harmful.

Inversions: inversion mutation, where an entire section of DNA is reversed.

Translocations: transfer of one part of a chromosome to another chromosome during cell division that can occur with or without a loss or gain of any chromosome material.

Next steps

The cost of genome sequencing has plummeted since the start of the Human Genome Project. In 1990, sequencing a single genome cost an estimated $3 billion. Now, researchers are envisioning $1,000 genomes within the next five years, reports Eric Lander, director of the Genome Biology Program at the Broad Institute of MIT and Harvard, and one of the leaders of the Human Genome Project.

Lander attributes this to two factors. First, the federal government signaled to industry and academia the scientific importance of DNA sequencing which elicited creative ideas for genome sequencing that in turn prompted dramatic cost decreases. Second, technological improvements made generating genomic data easier and more efficient. For example, scientists can now view billions of samples on a slide, where once the number was just a few thousand.

Still, as the volume of genomic data grows in size and complexity, analysis of mass amounts of data becomes more challenging. Stayed tuned, as scientists continue to search for faster, less-costly, and more comprehensive tools in this burgeoning field.