Lab Discussion

Large-scale biology projects such as the sequencing of the human genome and gene expression surveys using RNA-seq, microarrays and other technologies have created a wealth of data for biologists. However, the challenge facing scientists is analyzing and even accessing these data to extract useful information pertaining to the system being studied. This course focuses on employing existing bioinformatic resources – mainly web-based programs and databases – to access the wealth of data to answer questions relevant to the average biologist, and is highly hands-on.
Topics covered include multiple sequence alignments, phylogenetics, gene expression data analysis, and protein interaction networks, in two separate parts.
The first part, Bioinformatic Methods I, dealt with databases, Blast, multiple sequence alignments, phylogenetics, selection analysis and metagenomics.
This, the second part, Bioinformatic Methods II, will cover motif searching, protein-protein interactions, structural bioinformatics, gene expression data analysis, and cis-element predictions.
This pair of courses is useful to any student considering graduate school in the biological sciences, as well as students considering molecular medicine.
These courses are based on one taught at the University of Toronto to upper-level undergraduates who have some understanding of basic molecular biology. If you're not familiar with this, something like https://learn.saylor.org/course/bio101 might be helpful. No programming is required for this course although some command line work (though within a web browser) occurs in the 5th module.
Bioinformatic Methods II is regularly updated, and was last updated for March 2019.

MA

I really appreciate these series of courses, I want to thank Prof. Provart and his coligues for their great job on preparing and presenting these series. Thanks a lot!

AR

Aug 24, 2015

Filled StarFilled StarFilled StarFilled StarFilled Star

Excellent course with excellent support by the Professor and Mentor. Covers what you need to know about protein bioinformatics and detailed expression analysis.

从本节课中

Gene Expression Analysis II

When and where genes are expressed (active) in tissues or cells is one of the main determinants of what makes that tissue or cell the way it is, both in terms of morphology and in terms of response to external stimuli. Several different methods exist for generating gene expression levels for all of the genes in the genome in tissues or even at cell-type-specific resolution. In this class we'll be hierarchically clustering our significantly differentially expressed genes from last time using BioConductor and the built-in function of an online tool, called Expression Browser. Then we'll be using another online tool that uses a similarity metric, the Pearson correlation coefficient, to identify genes responding in a similar manner to our gene of interest, in this case AP3. We'll use a second tool, ATTED-II to corroborate our gene list. We'll also be exploring some online databases of gene expression and an online tool for doing a Gene Ontology enrichment analysis.

教学方

Nicholas James Provart

Professor

脚本

In today's lab we're further exploring gene expression data. And we will be focusing on APETALA3 in this lab. Again, as I described last lab, APETALA3 acts with PISTILLATA1 to specify organ identity in the petals and stamens in the developing flower. And, there are a couple of other floral homeotic genes that we'll see in this lab, APETALA2 and AGAMOUS, as well as APETALA1 and SEPALATA, that are all involved in specifying floral organ identity, a very cool discovery in the early 1990s by Enrico Cohen and colleagues. So what we're doing first of all is, we're clustering some of our expression data from last week, using the built in function of the heat map function. And in our default view, what we're seeing is some kind of patterns of gene expression. Actually pretty good expression organization, but that's more by chance. We can then also group the data group the genes, group the rows which represent genes, by their patterns of expression similarity. And the tree, as I've talked about in the lecture, represents how similar those expression patterns are. So the branch length here represents the similarity of expression patterns. There are some slightly dodgy results here and that's due to the function that's actually used, the similarity metric, Euclidean distance, and the complete linkage analysis. So we need to take these clustering results with the grain of salt. however, when would we want to cluster expression data by genes? The answer to that question is we would want to do so to see similar patterns of expression for groups of genes. And we could then take those groups of genes and do say a Gene Ontology enrichment analysis or other kinds of analyses, perhaps cis-element prediction for the promoters of those genes to see if there're any cis-elements in common to those promoters. We can then also cluster the data by samples and treatments, and we do that by simply leaving out the Colv=NA parameter in this function call here. And the results that we get are similar to the Phyloplot results, the quality control results, that we saw last week. Whereby samples showing similar expression profiles across all of the genes are grouped together. And, we might want to do this to determine the function of mutants. Say we had a mutant plant, we could profile it. We could do gene expression profiling and then we could compare the profiles across all genes, the gene expression profiles of all genes to other samples in the database, we might be able to figure out where that mutant lies, what its mode of action is, this can actually be a very powerful method for determining what the mutation could be. So that's the answer to question d. Then we use a tool called Expression Angler. And Question e asks how this analysis is different from that conducted in Bioconductor in terms of the similarity metrics. In the case of the clustering, what we're doing, is we're grouping genes with similar patterns of expression. We're building a hierarchical tree. In the case of Expression Angler, what we're doing is we're simply taking a query gene and then comparing its expression profile across all of the samples to all of the other profiles for all other genes in the database and we're pulling out those ones that show the greatest similarity. So we're just ranking them. We're not actually building a tree of any kind. But we are still computing the expression pattern similarities. In this case it's 28,000 times. In the case of all by all it would be a much larger number. So in which tissues do you see AP3 in its co-expressor showing the strongest expression? And those tissues would be these ones here and here which are stamens and petals of stage 12 flowers and stage 15 flowers, and Question g asks are there any proteins of unknown function in the list of co-expressed genes? There are, so basically any gene here that doesn't have a name beside it. Any gene identified that doesn't have a name beside it actually it's function isn't known. And if you mouse over these gene identifiers you'll see the annotation appear here, and you can quickly identify ones which have no function ascribed to them. The Pearson correlation coefficient also appears in this box. So we then use a tool for querying expression data across many samples, Expression Browser, and the Expression Browser output actually clusters the gene expression patterns based on their similarity. And Question h asks which gene is most highly co-expressed with AP3 across the greater number of samples, in the AtGenExpress Plus series. And the answer to that if we zoom in on this region where AP3 is found, is PISTILLATA1, so recall from the introductory slide, that APETALA3, and PISTILLATA act together to define the organ identity of Whorl two and Whorl three. So it's not surprising that those two genes are very highly co-expressed across all of these samples in the database. Question i asks are there any tissues where AP3 co-expressed genes seem to be more strongly expressed? The answer is yes, so here at the, the floral samples over here, stamens and petals, there's a set of samples here from the, from the ovary of, of the Arabidopsis flower, a developing ovary where AP3 also seems to be co-expressed, and the answer to the quiz question. is, it's these samples here, where you see blue expression. Those are the ones where the AP3 levels and the the other co-expressors' levels seem to be lower than average. So then we use an online tool, AgriGO, to assess whether or not, in that list of 50 genes, any particular GO Biological Processes are enriched. And, here we see that the GO Biological Process for post-embryonic organ development is enriched, as is the GO BP for whorl development and floral organ development. This makes sense as all of these genes or several of them, are floral homeotic genes. So we would expect to see those categories enriched in our GO enrichment analysis. Then, we're exploring expression data on a gene by gene basis, using the eFP Browser that my lab has developed. Here, in the default view, we see that, again, as we saw in the Expression Browser, that APETALA3 is expressed in stamens and petals of stage 12 flowers and stage 15 flowers and Question k asks where's expression of AP3 the strongest. And it's in fact strongly expressed in the seed coat of developing seeds. And I spoke with a seed coat expert, George Haughn, at UBC today and asked him if AP3's involvement in seed coat development is known. He said no but it could be possible as the APETALA2 does seem to be involved in seed coat development in spite of it being identified as a floral homeotic gene. So we've already generated a hypothesis just by looking at the data. Then we're using another tool for coexpression analysis, ATTED-II, which also in a similar way to Expression Angler generates coexpression scores for a query gene and here we're looking at the genes that are co-expressed with APETALA3 and we are downloading the results and then using The Venn Selector tool to ask what the commonality is between these two sets of genes. And there are about 12 genes here at the bottom of the list that are in common. So if we were trying to generate hypotheses about genes whose function is not known, We might want to focus on this, this interception between the two lists to help focus our search. Finally, we finish up with GeneMANIA, and Question m asks what other floral homeotic genes are in the APETALA3 network? They are several. There's PISTILLATA, there's AGAMOUS, SEPALATA3. There's APETALA1. Now all of these AGAMOUS-like genes are actually floral homeotic genes too. And the nice thing about the GeneMANIA output, is that it tells us which dataset has contributed to the the interaction that's predicted to occur based on these datasets. So we can sort of get a feel for where that prediction came from. We don't see any of the players from our coexpression lists in the GeneMANIA but, and that's because the expression datasets that were used are quite different from those that were used in In ATTED-II and in Expression Angler. So just keep that in mind when you're doing a co-expression analysis. It does make a difference what datasets you're using for co-expression analysis. Alright, that's lab five. The objectives of lab four and five are the following. You should know the. Main technologies for creating, generating expression data and for extracting information from microarray hybridizations or RNA-seq reads. You should understand the sources of error associated with the outputs of a transcriptomics experiment. You should understand the importance of normalization and how to interpret both MA and box plots. And other quality control parameters. You should be able to select significant genes, and to organize them using hierarchical clustering. Know what the Pearson correlation coefficient score measures. And you should also know how to use gene expression databases to leverage the existing gene expression data. Should know how to use Bioconductor for normalizing expression datasets to select significantly differentially expressed genes, to create heatmaps, and for clustering. You should also be able to use some online tools to identify co-expressed genes using existing expression databases. And you should also be able to explore expression patterns. Finally you should be familiar with the, the archival expression databases such as GEO and the Sequence Read Archive. That's it, I hope you enjoyed the lab, and we'll see you next week. [MUSIC].