Subscribe to our newsletter

Login to ScienceNode

At Science Node, we need your help. We have ALMOST reached our fund-raising goal. In order to maintain our independence as a source of unbiased news and information, we don’t take money from big tech corporations. That’s why we’re asking our readers to help us raise the final $10,000 we need to meet our budget for the year. Donate now to Science Node's Gofundme campaign. Thank you!

SoyKB and iPlant streamline complex bioinformatics analysis

Next-generation sequencing (NGS), in which millions or billions of DNA nucleotides are sequenced in parallel, is the backbone of novel discoveries in life sciences, anthropology, social sciences, biomedical sciences and plant sciences. Read about the SoyKB and iPlant collaboration that is taking plant sciences to the next level.

Image courtesy US National Science Foundation/Thinkstock.

Even though next-generation sequencing (NGS) - with millions or billions of DNA nucleotides sequenced in parallel - is much less costly compared to first-generation sequencing, it still remains too expensive for many labs. NGS platform start-up costs can easily surpass hundreds of thousands of dollars, and individual sequencing reactions can cost thousands per genome.

To garner accurate information, the data analysis can be time-consuming and require special knowledge of bioinformatics. Even so, this high-throughput computational analysis is the backbone of novel discoveries in the life sciences, as well as in other domains including anthropology, social sciences, and plant sciences.

"Using next-generation sequencing you're getting a snapshot of everything that is happening in a given genome up to that point," says Trupti Joshi, assistant research professor in computer science and core faculty at the Informatics Institute at the University of Missouri (MU), Columbia, US.

In addition to integrating SoyKB - which already includes many built-in informatics tools - with existing iPlant tools, the MU team is developing additional toolsets that will also be available to the iPlant community. "Right now we are building the infrastructure so that we can submit jobs - RNA-seq analysis is just one example - to iPlant Atmosphere." Joshi says three to four different analysis capabilities will be available in a couple months.

SoyKB includes the tens of thousands of genes in the soybean genome, experimental data related to gene expressions, fast-neutron mutation data, and soybean lines GWAS (genome-wide association studies) data. SoyKB is unique in that it includes 'multi-omics' experimental data that might otherwise be irrelevant (thrown out) by a particular researcher at a particular time. By making all research data available, experiments take on an increasingly important role in the bigger picture, and enable future researchers to narrow their own results.

iPlant includes storage resources for large datasets, high-performance and cloud computing services to analyze and solve complex research problems including genome assembly, annotation, and association studies. Video courtesy iPlant Collaborative.

Researchers may want to look at soybeans that have a high-oil content, for example, or a high-protein content. Or, they may want to focus on soybean lines that are more drought, disease, or insect resistant. Scientists can access data on particular genomic variations directly in SoyKB, using tools to quickly query and isolate items of interest.

More than 19,000 users take part in the iPlant Collaborative, and about 2,500 of them use Atmosphere - iPlant's cloud service that is fully integrated with user management and the Data Store (570 terabytes). "Atmosphere is one of the nicest academic cloud implementations available," says Rynge. "I would say it is on par with Amazon in terms of user interface; really well done."

Rynge is developing a SoyKB submit infrastructure and Pegasus workflows for scientists to pull data from the data store, analyze it, and deposit the results back in the data store - all with the click of a button. The ultimate goal is to make the workflows general enough to be mapped to other infrastructures, which future sequencing groups can use as a starting point.

As NGS techniques continue to amass more data than labs and researchers can handle on their own, high-performance computing and infrastructures capable of presenting, analyzing, and storing data will remain critical resources for complex bioinformatics analysis. After all, with 50,000 to 70,000 genes in a single soybean, looking at thousands of soybean genomes can produce several gigabytes of data for each soybean line.

Contact

Science Node

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:

You have to credit our authors.

You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.

You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.

The easiest way to get the article on your site is to embed the code below.