NFS – HPCwirehttps://www.hpcwire.com
Since 1987 - Covering the Fastest Computers in the World and the People Who Run ThemFri, 09 Dec 2016 14:11:16 +0000en-UShourly1https://wordpress.org/?v=4.760365857Stepping Up to the Life Science Storage System Challengehttps://www.hpcwire.com/2015/10/05/stepping-up-to-the-life-science-storage-system-challenge/?utm_source=rss&utm_medium=rss&utm_campaign=stepping-up-to-the-life-science-storage-system-challenge
https://www.hpcwire.com/2015/10/05/stepping-up-to-the-life-science-storage-system-challenge/#respondMon, 05 Oct 2015 13:58:31 +0000http://www.hpcwire.com/?p=21523Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional […]

]]>Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional cluster technology, HPC is now becoming important – one estimate is 25% of bench scientists will require HPC resources in 2015.

Currently, the emphasis is on sequence data analysis although imaging data is quickly joining the fray. Sometimes the sequence data is generated in one place and largely kept there – think of major biomedical research and sequencing centers such as the Broad Institute and Wellcome Trust Sanger Institute. Other times, the data is generated by thousands of far-flung researchers whose results must be pooled to optimize LS community benefit – think of the Cancer Genome Hub (CGHub) at UC Santa Cruz, which now holds about 2.3 petabytes of data, all contributed from researchers spread worldwide.

Given the twin imperatives of collaboration and faster analysis turnaround times, optimizing storage system performance is a high priority. Complicating the effort is the fact that genomics analysis workflows are themselves complicated and each step can be IO or CPU intensive and involve repetitively reading and writing many large files to and from disk. Beyond the need to scale storage capacity to support what can be petabytes of data in a single laboratory or organization, there is usually a need for a high-performance distributed file system to take advantage of today’s high core density, multi-server compute clusters.

Broadly speaking, accelerating genomics analysis pipelines can be tricky. CPU and memory issues are typically easier to resolve. Disk throughput is often the most difficult variable to tweak and researchers report it’s not always clear which combination of disk technology and distributed file system (NFS, GlusterFS, Lustre, PanFS, etc.) will produce the best results. IO is especially problematic.

Alignment and de-duplication, for example, is usually a multi-step disk intensive process: Perform alignment and write BAM file to disk, sort original BAM file to disk, deduplicate BAM file to disk. Researchers are using a full arsenal of approaches – powerful hardware, parallelization, algorithm refinement, storage system optimization – to accelerate throughput. Simply put, storage infrastructures must address two general areas:

The infrastructure must be capable of handling the volume of data being generated by today’s lab equipment. Modern DNA sequencers already produce up to a few hundred TB per instrument per year, a rate that is expected to grow 100-fold as capacities increase and more annotation data is captured. With many genomics workflows, many terabytes of data must routinely be moved from the DNA sequencing machines that generate the data to the computational component that performs the DNA alignment, assembly, and subsequent genomic analysis.

The analysis process, multiple tools are used on the data in its various forms. Each of the tools has different IO characteristics. For example, in a typical workflow, the output data from a sequencer might be prepared in some way, partitioning it into smaller working packages for an initial analysis. This type of operation involves many read/writes and is IO bound. The output from that step might then be used to perform a read alignment. This operation is CPU-bound. Finally, the work done on the smaller packages of data needs to be sorted and merged, aggregating the results into one file. This process requires many temporary file writes and is IO bound.

One the compute side, there are a variety of solutions available to help meet the raw processing demands of today’s genomics analysis workflows. Organizations can select high performance servers with multi-core, multi-threaded processors; systems with large amounts of shared memory; analytics nodes with lots of high-speed flash; systems that make use of in-memory processing; and servers that take advantage of co-processors or other forms of hardware-based acceleration.

One the storage side, the choices can be more limiting. Life sciences code, as noted earlier, tends to be IO bound. There are large numbers of rapid, read/write calls. The throughput demands per core can easily exceed practical IO limitations. Given the size and number of files being moved in genomic analysis workflows, traditional NAS storage solutions and NFS-based file systems frequently don’t scale out adequately and slow performance. High performance parallel file systems such as Lustre and the General Parallel File System (GPFS) are often needed.

Determining the right file system for use isn’t always straightforward. One example of this challenge is a project undertaken by a major biomedical research organization seeking to conduct whole genome sequencing (WGS) analysis on 1,500 subjects; that translated into 110 terabytes (TB) of data with each whole genome sample accounting for about 75GB. Samples were processed in batches of 75 to optimize throughput, requiring about 5TB of data to be read and written to disk multiple times during the 96 hour processing workflow, with intermediate files adding another 5TB to the I/O load.

Many of processing steps were IO intensive and involved reading and writing large 100GB BAM files to and from disk. These did not scale well. Several strategies were tried (e.g., upgrading the network bandwidth, minimizing IO operations, improving workload splitting). Despite the I/O improvements, significant bottlenecks remained in running disk intensive processes at scale. Specifically the post-alignment processing slowed down on NFS shared file systems due to a high number of concurrent file writes. In this instance, switching to Lustre delivered a threefold improvement in write performance.

Conversely Purdue chose GPFS during an upgrade of its cyberinfrastructure which serves a large community of very diverse domains.

“We have researchers pulling in data from instruments to a scratch file and this may be the sole repository of their data for several months while they are analyzing it, cleaning the data, and haven’t yet put it into archives,” said Mike Shuey, research infrastructure architect at Purdue. “We are taking advantage of a couple of GPFS RAS (reliability, availability, and serviceability) features, specifically data replication and snapshot capabilities to protect against site-wide failure and to protect against accidental data deletion. While Lustre is great for other workloads – and we use it in some places – it doesn’t have those sorts of features right now,” said Shuey.

LS processing requirements – a major portion of Purdue’s research activity – can be problematic in a mixed-use environment. Shuey noted LS workflows often have millions of tiny files whose IO access requirements can interfere with the more typical IO stream of simulation applications; larger files in a mechanical engineering simulation, for example, can be slowed by accesses to these millions of tiny files from a life sciences workflow. Purdue adopted deployed DataDirect Networks acceleration technology to help cope with this issue.

Two relatively new technologies that continue to gain traction are Hadoop and iRODS.

Hadoop, of course, uses a distributed file system and framework (MapReduce) to break large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance.

It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required. Conversely issues remain, say some observers.

“[W]hile genome scientists have adopted the concept of MapReduce for parallelizing IO, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally…Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements,” noted Alan Day (principal data scientist) and Sungwook Yoon (data scientist) at vendor MapR, in a blog post during the Strata & Hadoop World conference, held earlier this month.

MapR’s approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. “The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale,” they wrote, adding the MapR solution follows Hadoop and the GATK best practices. They argue results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.

iRODS solves a different challenge. It is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set.[1]

At the Renaissance Computing Institute (RENCI) of the University of North Carolina, iRODS has been used in several aspects of its genomics analysis pipeline. When analytical pipelines are processing the data they also register that data into iRODS, according to Charles Schmitt, director of informatics, RENCI[iv]. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use.

Broad benefits cited by the iRODS consortium include:

iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.

iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.

iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.

]]>NFS Version 4.1 is on the horizon — again. It has been years of predictions and promises for data storage managers, but NFS Version 4.1, which supports parallel NFS, is getting closer to shipping in NAS systems now that the pNFS spec has been approved. What’s next for NFSv4.1? How can pNFS benefit data storage administrators? Will pNFS be worth the wait?

]]>https://www.hpcwire.com/2010/06/08/what_nfs_4_1_and_pnfs_mean_for_nas_owners/feed/05216Parallel NFS Is the Future Standard to Manage Petabyte Level Growthhttps://www.hpcwire.com/2009/11/19/parallel_nfs_is_the_future_standard_to_manage_petabyte_level_growth/?utm_source=rss&utm_medium=rss&utm_campaign=parallel_nfs_is_the_future_standard_to_manage_petabyte_level_growth
https://www.hpcwire.com/2009/11/19/parallel_nfs_is_the_future_standard_to_manage_petabyte_level_growth/#respondThu, 19 Nov 2009 08:00:00 +0000http://www.hpcwire.com/?p=5638IT professionals are constantly being challenged to manage exponential growth that has reached petabyte levels. As pressures increase on IT to deliver even-higher levels of productivity and efficiency, a new generation file system standard will be required to maximize utilization of powerful server and cluster resources while minimizing management overhead.

]]>IT professionals are constantly being challenged to manage exponential growth that has reached petabyte levels. With more and more data taxing the system, performance sacrifices are always a consideration. And for applications that demand high performance and scale, the stakes are even higher because any I/O bottlenecks in the system can essentially bring a project to its knees.

So it’s no surprise that removing I/O bottlenecks can have a direct impact on the profitability of a business. By removing I/O bottlenecks, potential benefits include faster time to results, the deployment of more powerful analytical algorithms and filters, and the management of higher-resolution datasets. As pressures increase on IT to deliver even-higher levels of productivity and efficiency, a new generation file system standard will be required to maximize utilization of powerful server and cluster resources while minimizing management overhead.

Challenges of Networked Storage Systems

The question is, “How soon will we eliminate performance bottlenecks, non-scalable file systems, complex client management, vendor lock in and fork lift upgrades?” Most customers have independent networked storage systems that are not capable of achieving the ideal performance, capacity and client management utilization efficiencies. Storage administrators are constantly looking for ways to address these challenges by reducing management costs and increasing performance to lower total cost of ownership for networked storage purchases. Bottom line, organizations need to reduce operational costs, increase productivity or solve a unique problem for a competitive advantage.

Network File System (NFS) answers many of these challenges. NFS is a communication protocol to make data stored on file servers available to any computer on a network. NFS clients are included in all common operating systems and allow servers to communicate with the file system in the storage network. NFS also ensures interoperability between vendor solutions, allows users to have a choice of best-of-breed products in their storage networks, and eliminates risks associated with proprietary technology. NFS v4.1 protocol (approved December 2008) has resulted in many storage management enhancements. These include global name space, a feature that can help storage administrators configure different hardware components to look like a single system as well as head and storage scaling. In addition, storage administrators can now perform non-disruptive upgrades without impacting performance. Combined, these features reduce storage operating costs, improve storage network performance, and consolidate systems to reduce management hours.

Parallel NFS (pNFS) Kicks NAS Performance Up a Notch

pNFS kicks NAS performance up an order of magnitude by allowing users to access storage devices directly and in parallel by leveraging the combination of Parallel I/O and NFS. Files can be broken up and striped across NAS heads and, leveraging multiple data paths and processors, delivered in parallel to the requestor to provide a significant performance boost. pNFS also introduces the ability to bypass NAS heads for file delivery altogether. It supports block, file and object-based data files. Parallel I/O delivers higher levels of application performance and allows for massive scalability without diminished performance. Single sequential I/O patterns have many bottlenecks that adversely affect performance, including no load balancing and the inability to aggregate other devices. pNFS solves these issues by providing global name space functionality without requiring forklift upgrades while allowing storage administrators to scale performance and storage capacity without disruption. It also eliminates vendor lock-in, providing added flexibility for future upgrades. As a result, customers can further impact their bottom line by lowering their total cost of ownership and maximizing consolidation of storage.

pNFS Status

The estimated Internet Engineering Task Force (IETF) ratification for pNFS completion is the end of 2009. pNFS for a standard Linux distribution is expected to be available in mid 2010. Storage customers ultimately have the power to accelerate adoption of new standards if and when they see the value. The first step is to learn more about pNFS and understand its value. Get involved — ask your application vendors and infrastructure providers what their plans are for supporting pNFS. The SNIA NFS Special Interest Group is the recommended source for community endorsed pNFS eduction content and events.