Mass data storage

Many research projects and organisations have large volumes of research data that they need to store. As well as providing high performance computing systems for research, PDC also offers long-term storage for research data – currently this is primarily via our Mass Storage System (MSS) which is managed by the IBM Spectrum Protect software. PDC is also heavily involved in an ongoing project to establish a storage system for Swedish research data using an approach based on the iRODS software for data management.

The Swedish National Infrastructure for Computing (SNIC) provides Swedish researchers with storage for active research data (that is, data that is being collected and analyzed). However, once the data becomes static (for example, when the results of the research have been published), the university that employs the researchers becomes responsible for storing the data. At present storage for active data is provided by SNIC through the
Swestore project
and should be available in the near future via the SNIC iRODS project.

If you are involved in a research project that needs to store data long-term, you are welcome to contact PDC Support to discuss purchasing storage from PDC or becoming a pilot user of the PDC iRODS system. As the PDC mass storage can be extended fairly easily and cheaply (by buying more tapes for the MSS and extra licenses for the software, or by adding more disks or tapes to the iRODS system), this can be a more economical solution than other storage alternatives such as setting up and then maintaining a tape storage system dedicated just for a single project, or buying storage from commercial companies.

Mass Storage System (MSS)

PDC's Mass Storage System is essentially a large library of magnetic tape cartridges that are accessed using a tape robot. This is a very efficient way to archive data that is not accessed frequently but that needs to be stored for a long time.

PDC's IBM TS4500 tape library currently has

14 TS1150 tape drives

850 IBM 3592 JD tape cartridges

~3500 available tape slots

This gives a total uncompressed data capacity of ~35 PB, extendable to 17,550 slots and 175 PB.

Each IBM TS1150 tape drive has

built-in compression, up to 3:1

dual 8 Gbit/s fibre channel interface

360 MB/s native speed

up to 700 MB/s with compression

Each IBM 3592 JD cartridge has

10 TB native data capacity

up to 30 TB compressed data capacity

IBM Spectrum Protect

PDC's IBM Spectrum Protect is essentially software that is used to manage the data archived in PDC's MSS - this includes storing the data, backing up the data, and recovering damaged data. This system is sometimes referred to as PDC's Backup and Archiving Solution.

Different projects and organizations are using PDC's Backup and Archiving Solution which is connected to the mass storage system or MSS. Data lands in the disk storage pools and, after a certain time, it is migrated to PDC's MSS. Off-site backup of the system is performed twice a day to the National Supercomputer Centre (NSC) in Linköping. For example, the Swedish Human Protein Atlas (HPA) program is using archiving at PDC and has stored about 370 TB of data which is mirrored off-site, in this particular case to the High Performance Computing Center North (HPC2N) in Umeå.

iRODS storage

In collaboration with other partners from the Swedish National Infrastructure for Computing (SNIC), PDC is developing a new service for research data storage. The new service will make it easier to manage research data and will be a significant step towards providing open access data. Researchers who make use of this storage service for live project data will be able to start generating metadata right from the very beginning of their project. This will make it a lot easier to package data and archive it when their projects come to an end. The expectation is that archiving services will also benefit from these efforts immediately, as metadata is an important part of making data searchable and is also useful when it comes to publishing data.

The Integrated Rule-Oriented Data System (
iRODS
) is open-source software that provides a comprehensive set of tools to support data management tasks from the initial collection of data through to archiving and reusing the data. This is particularly important given the implications of the worldwide movement towards Open Science and Open Data Access. iRODS is supported and maintained by the iRODS Consortium, and is used by research organisations and government agencies worldwide. Over the last few years the iRODS software has been subjected to very significant refactoring and reorganisation – the result of which is that iRODS is now being released as a production-level software distribution with commercial support, as well as a strong user community.

PDC's goal during 2017 has been to implement both a new iRODS-based storage service for SNIC, as well as a separate service that will be available at PDC for Swedish researchers whose research data storage requirements are not addressed by the available SNIC services.

The new iRODS-based SNIC service is being developed to expand the service portfolio of Swestore, the Swedish National Research Data Storage Infrastructure operated by SNIC. The new service will have the advantage that the data will be stored in such a way that it will be interoperable with the services for European research data provided through the EUDAT Common Data Infrastructure (CDI). This will make it easier for Swedish researchers to collaborate and share data with other European researchers, as well as to have transparent access to European e-Infrastructures (such as PRACE, EUDAT and EGI), which is in line with the aims of the European Open Science Cloud initiative.

Some projects storing data at PDC

The following projects have their primary data archives at PDC.

CENTER-TBI

CENTER-TBI
, or “Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injuries”, is a large European project that aims to improve the care for patients with traumatic brain injuries and identify the most effective clinical interventions for managing such injuries.

Odin satellite project

The
Odin satellite
combines two scientific disciplines on a single spacecraft in studies of star formation and the early solar system (astronomy) and the mechanisms behind the depletion of the ozone layer in the Earth´s atmosphere and the effects of global warming (aeronomy). The Swedish Space Corporation, on behalf of the Swedish National Space Board and the space agencies of Canada (CSA), Finland (TEKES) and France (CNES), has developed the satellite for astronomers and atmospheric researchers in the participating countries.

Prisma

Prisma
is a Swedish-led satellite project that aims to develop and qualify new technology necessary for future science missions in space. Many of the future projects comprise formation flying and rendezvous, so several spacecraft need to communicate and interact with each other with high precision. That requires exceptional accuracy in measuring and controlling the inter-satellite orientation.

SNIC-SENS

SNIC-SENS
is a Swedish project that uses high performance computing resources for analyzing sensitive data. PDC is a partner is this project and provides a backup resource for the National Genomics Infrastructure (NGI), which includes backup of sensitive personal data. The system is based on the IBM Spectrum Protect software and provides backup for the NGI facilities at the KTH Royal Institute of Technology and Uppsala University, and also acts as the backup of the NGI production systems which are operated by the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) at Uppsala University.

Human Protein Atlas

The
Human Protein Atlas
(HPA) is a large Swedish program that aims to map of all the human proteins in cells, tissues and organs. The Human Protein Atlas consists of three separate parts, each focusing on a particular aspect of the genome-wide analysis of the human proteins: the Tissue Atlas showing the distribution of the proteins across all major tissues and organs in the human body, the Cell Atlas showing the subcellular localization of proteins in single cells, and finally the Pathology Atlas showing the impact of protein levels for survival of patients with cancer. All the data in the atlases is open access to allow researchers, both in academia and industry, to freely access the data for exploration of the human proteome. The data in the atlases is archived and backed up at PDC.