Saturday, October 10, 2015

Letter RE "Data analysis: Create a cloud commons," Nature

We read with great interest the recent Commentary piece advocating establishing a system for uploading and accessing genomic data on commercial clouds. While we think this is a step in the right direction, a better system would be the development of a single, standardized, international, non-commercial depository cloud -- a global enclave -- that could be independently managed and policed and built to the necessary specifications to satisfy both the funding agencies policies, and the nature of the data and its analysis. It would contain a single copy of all the large biomedical datasets.

The idea of putting all important genomic data into an easily and universally accessible system could change the ecosystem in which biocomputing takes place. Researchers are currently hampered by the difficulty of having to upload and download datasets to a haphazard assemblage of different and proprietary websites and clusters that each have their own idiosyncratic methods for categorizing and formatting. Interacting with a single global enclave could be very efficient. However, there will still be the need for researchers to develop code on systems more under their control. Here, we would urge a focus on creating "stub-datasets" that have the look and feel of the typical online data sets, but that would be freely available to all researchers —holding no personal information, and as such, having no privacy restrictions. Furthermore, these sets would be much smaller than the typical analysis set, permitting upload and download in interactive timescales. The key point of these datasets is that they would share many of the same statistical characteristics, with their larger cousins, and, consequently, could be used to develop and profile code before deployment on the global enclave.