at OpenHelix

Tag Archives: CNV

The sequence data tsunami begins to crash into the shore, at the feet of clinicians and patients who want answers and treatment directions. But sometimes the tsunami is washing in debris. As the amount of sequence and variation information grows, some of it comes without clear evaluations of the impacts. Some of it comes with conflicting information. And some of it comes in wrong.

Attempting to wrangle the information into useful understanding and treatments with standardized descriptions, the team building the ClinGen resources published a paper last week that details their efforts. The paper describes their history and goals, and how they are moving to get to a point where they have useful information for and from patients, their doctors, testing labs, and researchers. Because of the different needs of different groups, there are several moving parts to the overall ClinGen collection.

In addition to the paper–and several related articles in this NEJM special report–there are videos on their site that tackle different aspects of the ClinGen projects. I’m going to highlight one of them here as the Tip of the Week, but you should also check out the others that are available on their webinars page or their YouTube channel. This video shows the Dosage Sensitivity Map features.

This video provides some of the history and framework for the ClinGen efforts, and then also introduces one of the tools that they have made available, a dosage sensitivity map. This piece focuses on “evidence based reviews of dosage sensitivity”, and they indicate haploinsufficiency losses of regions, and triplosensitivity duplications of regions. They describe a scoring system they use to rank structural variations (CNVs, SVs), and their curation of the evidence to support or to refute dosage sensitivity. They also note that their process is conservative, and you should keep that in mind as you consider the their team’s review of the evidence. But they are definitely open and interested in feedback and they hope you will contact them if you have a different understanding from their posted evaluations.

To follow along with the video, use this site to explore the features of this part of the ClinGen tool set: http://www.ncbi.nlm.nih.gov/projects/dbvar/clingen/. But you can also just click their example genes–for instance, the ZEB2 link shows you a typical page with the score information, links to other resources, and a genome viewer right on the page. But you can also choose to look at external browsers at NCBI, Ensembl, or UCSC. I clicked the UCSC Genome Browser one to see how it displayed, and they automatically present to you tracks with the relevant ClinGen data loaded.

In other tips I’ll talk about other pieces of the infrastructure that they are building or coordinating with. Some we’ve talked about before–you can see a previous tip that included the ClinVar resource at NCBI that is foundational to the ClinGen suite and is discussed in their paper as well. They also note the importance of the data from OMIM, and how their mutual efforts are providing important feedback loops to be alerted to needed updates. ClinGen also employs the Human Phenotype Ontology that keeps coming up at OpenHelix lately. Another important piece to this is the standards for naming variants that were recently described by the American College of Medical Genetics and Genomics (paper linked below).

ClinGen, and the various component tools within, are worth looking at, and contributing to, as we try to move more and better information to the clinic for patients and doctors to use effectively. Steven Salzberg has a take on the value of ClinGen here: 17% Of Our Genetic Knowledge Is Wrong.

It’s also very possible that some really important things will happen in the database–new submissions, changes to the status of a variant–that will occur before any papers come out about it. Or it is even possible that a paper never will come out about it. Spend some time learning about the features; I think it will be worth the time.

As I’ve mentioned before, once I start looking over some new tools I’m often led to others in the same arena that offer related but different features. That’s what happened when I looked at the Proband iPad app for human pedigrees. I noted that they are using important community standards, and I decided to follow those threads a bit. That led me to last week’s tip, the Human Phenotype Ontology (HPO).

HPO has been around for a while and I’ve been aware of it, but this recent re-investigation made me realize how mature it has become, and I was impressed with the amount of adoption there’s been in the genomics community in the big projects. But it also led me to some new tools that I hadn’t encountered before. This week’s tip highlights PhenogramViz–combining my appreciation for controlled vocabularies, standards, and data visualization.

The PhenogramViz team illustrates how they analyze and visualize gene-phenotype relationships

Here’s now the PhenogramViz team describes their tool:

A tool that automatically analyses and visualizes gene-to-phenotype relations for a set of genes affected by CNV of a patient and a set of HPO-terms representing the symptoms of said patient. The tool makes full use of the cross-species phenotype ontology “uberpheno” (see here).

So if you have a patient with copy-number variation issues in their genome, you may be able to use this tool to lead to the genes in that CNV segment that convey certain phenotypes. So the goal–as stated in their paper linked below–is to assist with the clinical interpretation of the genome alterations.

The additional layer of this effort that I find useful is that they use another ontology to take this even further for supporting information. They employ the “Uberpheno” cross-species phenotype ontology to find further details in model organisms.

I’ll let you get a sense of how this works with one of their tutorial videos from their YouTube channel. They have others too–which will help you with different aspects on everything from installation to analyses. I’ll embed the one that shows how you start with a list of patient symptoms or phenotypes, then loading the CNVs or genes, then from the results list you can simply click for graphical representations of the gene-phenotype relationships. Then with the Cytoscape tools you can interact with the “phenograms” in more detail. There’s no sound, you can read the guidance in the callouts.

The videos include some abbreviations–like HPO. That’s why I talked last week about the Human Phenotype Ontology. I was prepping you for this one. And in another video (Prioritization of pathogenic CNVs) they reference the scoring strategies, which you will find need further explanation in their paper linked below (Journal of Medical Genetics one). I would spend some time looking over how the scoring and ranking happens to understand what’s shown.

Although the focus of this is using the data for human diagnosis, I think it could also help researchers to choose more appropriate animal model for further testing. There are lots of complaints about the unsuitability of animal models for a range of subjects–but refining those choices would also be a huge benefit. Saving resources by helping to choose the right animal model would be another worthwhile use of this tool.

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

I LOVE the idea of the obituary section for NAR! RT @LabSpaces: Jerm Looks at the Annual NAR Web Server Issue http://bit.ly/pImJma @jermdemo with a great post on bioinformatics and new stuff in the field [Mary]

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Fascinating work-around for a CNV region in the human genome. RT @Awesomics: Hydatidiform moles help close gap in clinically important region of genome MT @GenomeRef The CCL3L1 region of chr17 http://bit.ly/ff01Hy [Mary]

I love the smell of fresh new attributes… RT @NCBI: New attributes have been added to dbSNP to allow for searching and filtering human genomic variations. http://1.usa.gov/kUeJw8 [Mary]

Why we really do bioinformatics? Um, not so much for me. But did crack me up. RT @davebroome: @SimonBux But then you’d have to take out the part saying “bioinformatics”, which is how you impress the ladies. [Mary]

RT @aaronquinlan: IGV 2.0 is out with new NGS features: “split-view”, “view as pairs”, “splice-junctions”. very nice. #bioinformatics http://goo.gl/OyNdv [Mary]

Pierre has uploaded some slides that are great, and made me laugh out loud in the twitter + facebook section: RT @yokofakun: I’ve uploaded a *draft* of my presentation “being a Bioinformatician 2.0″ http://slidesha.re/jinWgY [Mary]

We all know and love dbSNP, and DGV, and 1000 Genomes, and HapMap, and OMIM, and the couple of other dozen variation databases I can think of off the top of my head. But–even though there’s a lot of stuff out there–you never know what you aren’t seeing. What *isn’t* yet stored in those resources? One new consortium suggests that there’s a lot you aren’t seeing. And they aim to make it easier to collect variation data, curate it, visualize it, and have it all in one place. The resource they are constructing is called MutaDATABASE.

MutaDATABASE is a new effort to bring together a lot of variation information that is just not getting into existing databases as it should be. The group is described as “a large consortium of diagnostic testing laboratories in Europe, the United States, Australia, and Asia.” In their Nature Biotechnology correspondence they describe many of the barriers facing deposition of new variants in databases. Among them are lack of incentive (or lack of pressure by publishers and other organizations), challenging/difficult software interfaces for submissions, privacy concerns for medical testing situations, and some desire to withhold novel variations as intellectual property. Not all of these issues can be overcome with some software, but they aim to try.

The structural organization of the consortium and contributor community that they wish to develop is described in this slide, which is like Figure 1 in the publication:

So there is a group of MutaAdministrators who oversee the project as a whole (this name makes me giggle a little bit–like a sci-fi government might be called…). There are MutaCurators who assemble and review data on a given gene (is it really just genes? what about non-genic regions and large deletions and such–this isn’t entirely clear to me). Clinicians can give input into the curation, and MutaCircles is a group of labs that do diagnostic testing for a gene that can also discuss, submit, evaluate data. The MutaCurator role is a gatekeeper and accountability on the final appearance.

The gene-specific collections will be freely available online in their database, and link to disease/phenotype information associated with those variations as well. In the tip-of-the-week movie I’ll show you an example of how you might expect a gene record to look when it’s been filled out to some extent.

MutaREVIEWS is a new “Gene review journal ” published only online, which is freely available to all users. It consists of a compilation of gene review studies that describe the most common human disease genes in a standardised way and lists all observed gene variants. The variants include monogenic variants with high penetrance, rare variants with reduced penetrance and polymorphisms without clinical significance. Each gene review is edited by a specific MutaCURATOR for that gene. These gene reviews are updated every 6 months. There are 12 issues per year.

It’s certainly in the early stages of this project. A lot of the genes I checked just haven’t been curated yet, and I understand that. I hope it works out: I do like the organization and structure, and a one-stop-shop would be handy. But the “build a platform they will come and curate” system has had mixed success elsewhere around biology. And some of the things that need to happen for this to take off are philosophical or possibly legal barriers that are going to vary quite a bit around the research and genetic testing world.

One thing I’d like to see them do is permit and encourage citizen science curation by people who are adopters of personal genomics and looking at data, and by disease community groups who have a specific interest in these genes, but have even more barriers to contribution than the researchers often do. I’ve found stuff from my genome scan that I don’t really have any place to take, and there’s no way to supplement records at that provider’s site as far as I know. But maybe that’s another variation project somewhere….

Anyway, have a look at MutaDATABASE and see what you think. Or if you participate in this project and I’ve not got some part of this right, drop a note in the comments. I know it’s early in the project and I may not have all the finer points in hand from my looking around and reading.

This notice came from DGV (Database of Genomic Variants) while I was on vacation last week, but I wanted to highlight this for a couple of reasons. First–it’s very cool that these groups have now chosen to establish a standard across databases for the representations of the copy-number variation displays. But I also like that they are now also providing support for the red-green colorblind. As someone from a family of the colorblind, that’s something I like to be able to access.

Here’s the note from the mailing list:

As a result of discussions surrounding the representation of structural variants at the recent ISCA meeting, groups at DGV, NCBI and DECIPHER have decided to standardize colour schemes for gains and losses. Moving forward, deletions/losses will be displayed as red, gains/duplications will be displayed as blue. Regions where both gains and losses occur at the same locus will be represented as brown, and we will continue to represent inversions as purple(indigo). In addition to ensuring the colour schemes are consistent across databases, changes have also been implemented to ensure ease of use for individuals with red-green colour blindness.

For this week’s Tip of the Week I’ll introduce Varietas, a resource that integrates human variation information such as SNP and CNV data, and offers a handy tabular output with links to additional databases that will enable researchers to quickly explore other sources of information about the variations or regions of interest.

I think this is the first resource I’ve used from Finland. And it’s definitely the first resource I have used that is plaid. But it struck me that plaid is a pretty good conceptualization of the variations that we see in the genomes. Some are a single thread, some are larger sections, and the overlaps between the variations we observed in the genome are important to our understanding of them as well. And the history of computation leads back to textile manufacturing, in fact. So I thought it was a pretty good concept.

But let’s explore the threads of Varietas. You can read the paper which is linked below, but here I’ll just summarize some of the main features. First let me say the focus of this database appears to be human variation. Although you wouldn’t know that from the site very clearly. As far as I could tell there wasn’t any other species data. But if you want human variation data, you’ll find a variety of threads available to you. If you check out the About page, you’ll see the source data available includes Ensembl, the NHGRI GWAS catalog, SNPedia, and GAD. These sources also provide OMIM data, HGNC nomenclature, phenotypes, and MeSH terms. And the threads out include dbSNP, PubMed, SNPedia, and WikiGenes as well. This is also summarized nicely in Figure 1 of their paper.

It’s a very straightforward interface. There is a basic search with a text box for quick searching, and you select the type of data you are starting with: SNPs, genes, keywords, or locations. And the output will be a table with the results that correspond to your query.

If you have larger sets of features that you want to interrogate you can use the advanced forms to enter more data.

The tabular output can be viewed on the web with all the handy links. Or you can download the data as a text file to be used in other ways.

I’ll demonstrate the sample search for the movie, but you won’t see the full range of data that’s available there. I wish they had samples for each type of search. But I found one sample that will also show CNV results: choose the Location radio button and enter this location range to see some CNV samples 6:1234-123400

Welcome to our Friday feature link dump: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Sorry, wordpress ate the post during an update….will try to re-create ASAP

Researchers to Map Ozzy Osbourne’s Genome, Find Out Why He’s Alive – Um, ok, so are they using Keith Richard & Ronnie Wood’s genomes to confirm the findings? And what about the controls – people who didn’t survive massive over doses (Janice Joplin, Jimi Hendrix, Sid Vicious, etc.)? And don’t even get me started on the quack genetic products that might result… hat tip to Cyndy [Jennifer]

The Economist has a special report on human genomics. Favorite line so far: “The casual observer, then, might be forgiven for thinking the whole thing a damp squib, and the $3 billion spent on the project to be so much wasted money. But the casual observer would be wrong.” Chew on that Mandel. Hat tip to Eric Topol’s tweeting. [Mary]

So, remember that tidal wave of data we were going to get from the human genome project? Yeah. That was a puddle compared to what’s coming your way now. For this week’s tip of the week I will introduce the very ambitious big data project from the International Cancer Genome Consortium (ICGC). In addition, you’ll get your first look at the shiny new interface for BioMart!

People reading this blog know that we have made great progress on many fronts in the war on cancer. But there’s an awful lot we don’t know yet. The ICGC network of researchers plans to change that. This international group of researchers has organized and standardized an effort to learn about tumors. From their homepage:

ICGC Goal: To obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe.

Check that out:

50 tumor types. Oh–and by the way–they will also obtain a normal tissue same from the same individual so you can see what’s part of the normal constitution and what has changed in the tumor.

Hundreds of samples of that tumor type. Except for some rare tumors, they intend to obtain 500 of each tumor.

More than a dozen types of cancer. Breast, lung, brain, pancreas, liver, leukemia…and on and on.

Genomic. Transcriptomic. Epigenomic. Each of these is a separate data set that needs to be obtained. Oh, and already there are simple variations (small numbers of nucleotides), CNVs, structural re-arrangements, expression data….And that’s just the initial release.

Are you overwhelmed yet? 50 x 500 x more than a dozen x 3+ types of data (and that’s just back-of-the-napkin, there’s more…) I am daunted just thinking about the scale of this.

They have organized and standardized the protocols, technologies, data collection, data submissions, and more. You should check out their marker paper for a complete description of their framework. They are going to make 2 types of data available: open access data that is de-identified. And there is a controlled access data set with clinical details that you’ll have to register for access to.

Do note though: the data (like all these large data projects) is subject to data usage policies that you need to be aware of. There is a publication moratorium that enables the data submitters a window to publish their findings before others are allowed to publish. It’s that typical balance of rapid access to data + a non-scoop window for the data providers. Be sure to familiarize yourself with the policies if you are going to use this data.

But let’s say you are ready for it–you understand the framework, you understand the usage policies–how do you get the data? You use the very cool new interface for BioMart to do it! This is your first opportunity to look at the GUI developed for BioMart v 0.8. There’s more coming, this is an early version. But that’s how you are going to be able to build great custom queries on the underlying data and pull it down. You may be familiar with BioMart from any number of places now (Ensembl, Gramene, FlyBase, WormBase….more). But this is the first implementation of the new look–you are going to want to check that out.

For this week’s Tip of the Week you’ll see the ICGC site, and a quick query of the initial data that is available in the Data Coordination Center (DCC). But this is just an appetizer. Brace yourselves–the deluge is coming.

A Nature News article offers a nice overview, but be sure to check out the full paper for the project details.

Be sure to contact the ICGC team if you have any questions. they want to help you to use this data, and will be happy to answer your questions. Personally, I’m making it a mission to help them populate the FAQ–I’ve sent in questions. And so far my answers have been quite speedy

As I was browsing over NCBI’s homepage, I happened to notice an announcement dated March 2nd that stated that the dbVar resource that Mary mentioned briefly in a weekly tip a while back is now publicly available. Here’s the brief announcement:

Tuesday, March 02, 2010, 1:00:00 PM
NCBI’s new database of Genomic Structural Variation (dbVar) archives large scale genomic variation data as well as associations of defined variants with phenotypic information.

From the dbVar documentation, it looks like it is mostly in ‘collection mode’ at the moment with lots and lots of data being added, FAQs on how to submit to dbVar, and some background information on what structural variation is, and how it is detected. It looks like the actual graphical displays of the variations use NCBI’s Sequence Viewer. It will be interesting to see how this new NCBI resource grows and is utilized.

edit: 3/16 9am – links to dbVar all appear to be down today. We have an email in to NCBI & will keep you posted on anything that we hear from them.

edit 2: 3/16 1pm – The links to dbVar are working for me now. Thanks, NCBI, for the quick fix!