Category Archives: database

An open letter to the fungal research community regarding genome database resources (from the Broad Institute & FungiDB/EuPathDB):

As many of you are already aware, fungal genome websites at the Broad Institute are undergoing a major transition. These resources were originally developed in support of sequencing projects, many of which have long-since been completed. While we have tried to keep such sites operational for as long as possible without funding, infrastructure changes now underway will make these websites nonfunctional over the coming weeks. We are therefore replacing formerly interactive websites with a static page providing information on fungal projects, along with links to the Broad FTP site where datasets can still be downloaded, and links to NCBI – the primary repository for all genomic data, where all genomes and annotation have been deposited and can be accessed, queried, and downloaded. We are also working to incorporate genomic data into other sites that support comparative analysis of fungal genomes, including FungiDB and MycoCosm.

The EuPathDB family of databases (funded by NIAID/NIH and the Wellcome Trust) supports a wide range of microbial eukaryotes; FungiDB includes many fungal (and oomycete) species, including non-pathogens. This resource has been designed to provide sustainable, cost-effective automated analysis of multiple genomes, integrating curated information (when available), with comments and supporting evidence from the user community (PubMed IDs, phenotypic information, images, datasets, etc). In addition to gene records, browser views, and data downloads, FungiDB offers sophisticated tools for integrating and mining diverse Omics datasets that fungal biologists will find quite useful. See the sidebar on the FungiDB web site for access to tutorials, videos, and exercises.

MycoCosm (supported by JGI/DOE) offers the largest available collection of fungal genomes, for comparative genomics across phylo- and eco-groups, along with interactive web-based tools for genome downloading, searching and browsing, and a form for nominating new species for sequencing to fill gaps in the Fungal Tree of Life.

For many years the Broad has been pleased to work closely with various fungal research communities, and we will continue to work with EuPathDB and MycoCosm to transition data valued by the community. Please direct any inquires or requests for help to help@FungiDB.org

The latest release of FungiDB (2.3) is now live and includes 52 genomes, 11 of which are new for this release. This was a longer than expected release cycle due to reintegration with the EuPathDB software team. Programmers Raghu Ramamurthy and Edward Liaw at UC Riverside did nearly all the Fungal specific work, collaborating closely with the EuPathDB team who provided many site-specific corrections and assistance in running the workflow. This is a joint collaborative project between the UCR,Oregon State (FungiDB) and U Penn, Univ of Georgia (EuPathDB) and the work in this release was funded through grants from the Burroughs Welcome Fund, the Alfred P. Sloan Foundation, and the USDA-NIFA.

Aspergillus fumigatus Neurospora crassa Saccharomyces cerevisiae Schizosaccharomyces japonicus Schizosaccharomyces octosporus
New genomics data available in this release include additional RNA-Seq experiments for Coprinopsis cinerea. High Throughput SNP (HTS) discovery module have been addded for Aspergillus fumigatus and a population of 23 strains from JCVI.

Data fixes and update

Updated in this release include new versions of annotation for

Aspergillus fumigatus – s03-m02-r18 from AspGDAspergillus nidulans – s09-m05-r03 from AspGDFusarium oxysporum f. sp. lycopersici – correcting some annotation problems in Broad v2Neurospora discreta – correcting some annotation problems from JGISaccharomyces cerevisiae version from 2012-11-20
The current annotation for N. crassa is still v10 release and does not reflect the V12 release made March 2013. The updated version will be available in the 3.0 release of FungiDB.

Corrections

The Coccidioides RNA-Seq data in the previous release had flipped the labels of the spherule and mycelium results, this has been corrected.

Errors in previous loading of gene product information for P. sojae had left many genes without sufficient product information and description. This has been corrected.

Synteny results between several species were not properly loaded in the previous release. This has been corrected.

Data summary tables of genomes and gene metrics have been updated to reflect the current state of the database.

Known errors

Alternative splicing and starting/ending non-coding exons may not be properly represented in GBrowse and in the GFF files available for download.

On behalf of the FungiDB development team I am pleased to announce the release of FungiDB 2.1 which includes 39 Fungal genomes from Ascomycota, Basidiomycota, and Mucormycotina (Zygomycota) and 6 genomes of Oomycetes. This release builds on the 2.0 release from August to include 6 additional species, RNA-Seq from a population of Neurospora strains, growth time points in 3 fungi Coprinopsis, Neurospora, and Rhizopus, and Phytophthora species. The 6 new genomes include Batrachochytrium dendrobatidis, Coprinopsis cinereus, Histoplasma capsulatum, Coccidioides posadasii, Rhizopus delemar (formerly oryzae), and Ustilago maydis.

While the Oomycetes are not true Fungi, as phylogenetically they are in a very distinctly different clade, however we have included them in the database as part of collaboration with Brett Tyler. It may be that some aspects of the convergent evolutionary patterns among these groups can be revealed by having the data in a common system and use of the same tools.

Several human pathogenic and opportunistic fungi are now available in the system including 2 strains of Histoplasmacapsulatum and 2 species of Coccidioides, Candida albicans, 2 Cryptococcus gattii strains, C. neoformans var grubii, and 2 C. neoformans var neoformans strains, Fusarium oxysporum, Aspergillus fumigatus and A. terreus. With the homolog tools available in the FungiDB system, one can map functional data from onto genes in these fungi from related models in the filamentous or yeast species.

Plant pathogens Magnaporthe grisea,Ustilago maydis, Puccinia graminis, and several Fusarium species, and the collection of 6 Oomycetes also provide a platform for comparative genomics among plant pathogens.

Functional annotation data have been imported from model system databases for Aspergillus nidulans, Saccharomyces cerevisiae, and C. albicans. We also generate predicted GO annotations from InterPro based analyses.

The development team at UC Riverside including Raghu Ramamurthy, past member Daniel Borcherding, and new member Edward Liaw; our collaborators on Oomycete data at Oregon State Brett Tyler and Sucheta Tripathy; and the EuPathDB developers and systems teams that have been essential partners in everything from assisting in data development and software debugging to database administration and web and systems administration.

Future work

Work is likely to begin in the next quarter to curate and support further literature based annotation of gene function in the Cryptococcus species. In addition we plan to expand the supported phenotypic data for Neurospora to support work from the Program Project grant and the phenotyping of the systematic gene deletion collection.

Additional support will be rolled out for more functional and evolutionary genomics data including expanded RNA-Seq datasets, population genetic data sets for several species with cohorts of sequencing of strain populations. We plan to continue to add additional species, with priorities focused on pathogens and model systems, but are interested in the community feedback of specific species that are must include targets in future releases. Please email help[AT]fungidb.org with your suggestions or fill out feedback on the “Contact Us” link on the FungiDB page.

SupportThe work in this release was supported by grants from the Burroughs Wellcome Fund and the Alfred P. Sloan Foundation.The Oregon State team is supported by grants from the Agricultural and Food Research Initiative of the USDA National Institute for Food and Agriculture. The EuPathDB team is supported by grants from the NIH, Gates Foundation, and Wellcome Trust. Without the direct and indirect support of these funders none of this would have been possible. All web and computational resources for FungiDB are currently housed at the Univ of Pennsylvania or the University of Georgia, thanks to the many system administrators who keep these services running that have allowed us to make this release.

New genomes from Microsporidia are on the way from the Broad Institute and other groups, and will be a boon to those working on these fascinating creatures. Microsporidia are obligate intracellular parasites of eukaryotic cells and many can cause serious disease in humans. Some parasitize worms and insects too. The evolutionary placement of these species in the fungi is still debated with recent evidence placing them as derived members of the Mucormycotina based on shared synteny (conserved gene order), in particular around the mating type locus. There is still some debate as to where this group belongs in the Fungal kingdom, with their highly derived characteristics and long branches they are still make them hard to place. The synteny-based evidence was another way to find a phylogenetic placement for them but it would be helpful to have additional support in the form of additional shared derived characteristics that group Mucormycotina and Microsporidia. There is hope that increased number of genome sequences and phylogenomic approaches can help resolve the placement and more further understand the evolution of the group.

For data analysis, a new genome database for comparing these genomes is online called MicrosporidiaDB. This project has begun incorporating the available genomes and providing a data mining interface that extends from the EuPathDB project.

As part of background in preparing a grant I ended up writing a few scripts to see the distribution of fungal species with ITS data in GenBank. The whole spreadsheet of the data is public and available here and I walk you through the data generation and summary below.

ITS (Internal Transcribed Spacer) is the typically used barcode for identifying fungi at the species level as it works for most (but not all) groups of Fungi. It falls between highly conserved nuclear rDNA genes (18S, 5.8S, 28S) but tends to be hypervariable making it a reasonable locus for identification of species since it tends to be unique between species but fairly unchanged among individuals from the same species. You can see a Map of the amplified region from Tom Brun’s site or info at Rytas Vilgalys’s site among others.

The script to extract these and dump the numbers from GenBank uses Perl, BioPerl, and is plotted in a Google docs table. I queried for all ITS sequences with a pretty simple query – some people use a better more thorough query to get the list of GIs so I separated the GI query from the statistics about taxonomy.

The GI query code uses BioPerl and queries GenBank over the web to dump out a file of GI numbers The code is in this Perl script.

This generates a file with GI (genbank identifiers) numbers for nucleotide records. This is not cleaned up to remove problematic seqs, but since we’re interested in overall statistics, I don’t think is that important if there are some records with problem. You might want to do some cleanup of these data and expand the query before using it as a reference ITS database for your BLAST queries. See tools built by Henrik Nilsson and others like Emerencia for some of the cleanup and detection of problems with a dataset like this of ITS.

But given a list of GIs from any query – in our case of ITS sequences – what is the distribution of taxa (based on what is specified by the submitted which is not always correct!)? Of course some aren’t specified to the species level or even to the genus level so the code has to be smart enough to put those in a different category. But of those specified to a particular taxonomic level – what are they? This script tallies the information about the phyla and genus and dumps them out – it takes a while to run the first time because it must build a database for all the GI to taxon record links (gi_to_taxa_nucl.dmp file from ncbi taxonomy) so be prepared to wait a while and dedicate several dozen gigabytes to get this all working the first time.

So what is the most abundant deposited genus? Well according to this analysis it is Fusarium. Which are found everywhere especially in soil. This distribution may have much more to do with the types of places being sampled and the types of questions researchers are working on rather than about relative abundance worldwide so take it as an interesting observation of what is in the databases! Only in particular environments with dedicated studies to fungal species (for example, the indoor environment or a particular area of a forest or fungi associated with trees in an urban and rural environment or one of many other studies not mentioned) can we really say something. What is important to note also is the massively parallel sequencing studies using 454 are coming online and not necessarily being dumped directly into this particular database at GenBank – these number represent the mainly Sanger clone sequenced data from years past, but it will be a whole new ball game in the next few years as studies start doing 454 sequencing as primary means to identify community structure.

click on image to see this in google docs spreadsheet

So who is generating all that data — well I wrote another version of the script which dumps out the authors for records from a particular taxa by querying the genbank record for the author field of all the records that came from a particular taxa.
The data are in this spreadsheet.

So a few bits of code using queries of GenBank and BioPerl to link things together, hope you see some sense of what is out there and maybe can think of interesting variations on this theme to address other data mining questions.

Shepard Fairley has gotten alot of notice lately for his Obama art that has been replicated pretty much everywhere. I mocked up a homage to his earlier street art — here we’ll discuss the growing Aspergillus genome posse.

I think a lot of other projects have a Posse too (or maybe just a loosely organized band) in terms of a community of people working on related species and willing to work together to coordinate. As these sort of “clade” databases start to develop we will have better clusters of information that can be mapped among multiple species.

Eventually I hope this will spur efforts for more coordinated genome databases for comparative genomic and transfer of known gene and functional information between experimental systems. The efforts really require coordination or centralization of the data so that gene models can be updated as well as orthologs and phylogenomic inference of function.

The JGI in collaboration with our lab at Berkeley have released the Neurospora tetrasperma (mat A) and N. discreta (mat A) genome sequences and annotation after about two years of work. These are two closely related species to the well studied laboratory workhorse Neurospora crassa.

The N.tetrasperma assembly (8X) has an N50 of 976kb and is highly colinear with the N.crassa genome. With the JGI, we’ve also done some additional 454 sequencing which will represent an improved assembly and 23X coverage in the next release. We also did some comparative scaffolding and can basically double that N50 – most of which looks good when compared to the improved V2 assembly.

The N.discreta assembly (8X) is also quite good with an N50 of 2.3 Mb. For comparison, the V7 of N.crassa has an N50 of 664 kb. although with genetic map information the 250+ contigs can be scaffolded into 7 chromosomes with 146 unmapped contigs.

Both N.discreta and N.tetrasperma genomes contain about 10k predicted genes similar to counts in other related species like N.crassa and Podospora anserina.

We’re finalizing several analyses to present at the Asilomar meeting to describe these Neurospora genomes and comparisons with other Sordariomycete species.

I’m working to make more data available in the genome browsers for fungi. One is adding in the Primer information from the Neurospora KO project to the Neurospora browser to indicate the position and primer sequences for all the gene knockouts being (or already) constructed. At least 60% of the genes have been knocked out and are available from the FGSC.

We’re also integrating SNP data using the HapMap glyphs in which you can see one way to view this information in the Genome Browser for Coccidioides. Working on other information including PhastCons conservation profiles and other information in our development server and hope to make this public soon.

The Broad Institute in collaboration with many of the Coprinopsis cinereus (Coprinus cinerea) community of researchers have updated the genome annotation for C. cinereus with additional gene calls based on ESTs and improved gene callers. The annotation was made on the 13 chromosome assembly produced by work by SEMO fungal biology group and collaborators across the globe including a BAC map from H. Muraguchi. Thanks to Jonathan Goldberg and colleagues at the Broad Institute for getting this updated annotation out the door.

This updated annotation is able to join and split several sets of genes and the gene count sits at just under 14k genes in this 36Mb genome. There are a couple of hiccups in the GTF and Genome contig/supercontig file naming that I am told will be fixed by early next week. Additional work to annotate the “Kinome” by the Broad team provides some promising new insight to this genome annotation as well.

We’re using this updated genome assembly address questions about evolution of genome structure by studying syntenic conservation and aspects of crossing over points during meiosis. The C. cinereus system has long been used as model for fungal development and morphogensis of mushrooms as it is straightforward to induce mushroom fruiting in the laboratory. It also a model for studying meiosis due to the synchronized meiosis occurring in the cells in the cap of the mushroom.