Overview

The high-performance I/O requirements and massive scalability needs of high performance computing (HPC)introduces unique challenges for data storage and access. This document aims to give a high-level overview of how parallel file systems (PFS) can improve the I/O performance of Azure-based HPC solutions, and how to deploy parallel file systems using Azure Resource Manager.

HPC storage on Azure

HPC is used to solve complex problems, such as geological simulations or genome analysis, that are not practical or cost effective to handle with traditional computing techniques. It does this through a combination of parallel processing and massive scalability to perform large and complicated computing tasks quickly, efficiently, and reliably.

In Azure HPC clusters, compute nodes are virtual machines that can be spun up, as needed, to perform whatever jobs the cluster has been assigned. These nodes spread computation tasks across the cluster to achieve the high performance parallel processing needed to solve the complex problems HPC is applied to.

Compute nodes need to perform read/write operations on shared working storage while executing jobs. The way nodes access this storage falls on a continuum between these two scenarios:

·One set of data to many compute nodes

·Many sets of data to many compute nodes

One set of data to many compute nodes

In this scenario, there is a single data source on the network that all the compute nodes access for working data. While structurally simple, any I/O operations are limited by the I/O capacity of the storage location.

Example of one set of data to many compute nodes scenario

Many sets of data to many compute nodes

In this scenario, there are several different data sources, and compute nodes access whichever is relevant to their particular task. While this approach can improve overall I/O performance, it adds a considerable amount of effort in managing and tracking files, and does not completely eliminate the threat of a single storage location causing I/O bottlenecks if one storage location has a file frequently accessed by multiple compute nodes.

Example of many sets of data to many compute nodes

Limits of NFS

Network file system (NFS) is commonly used to provide access to shared storage locations. With NFS a server VM shares out its local file system, which in the case of Azure is stored on one or more virtual hard disks (VHD) hosted in Azure Storage. Clients can then mount the server's shared files and access the shared location directly.

NFS has the advantage of being easy to setup and maintain, and is supported on both Linux and Windows operating systems. Multiple NFS servers can be used to spread storage across a network, but individual files are only accessible through a single server.

In HPC scenarios, the file server can often serve as a bottleneck, throttling overall performance. Currently the maximum per VM disk capacity is 80,000 IOPS and 2 Gbps of throughput. Attempts to access uncached data from a single NFS server at rates higher than this will result in throttling.

In a scenario where dozens of clients are attempting to work on data stored on a single NFS server, these limits can easily be reached, causing your entire application's performance to suffer. The closer to a pure one-to-many scenario your HPC application uses, the sooner you will run up against these limitations.

Multiple storage devices and multiple paths to data are utilized to provide a high degree of parallelism, reducing bottlenecks imposed by accessing only a single node at a time. However, parallel I/O can be difficult to coordinate and optimize if working directly at the level of API or POSIX I/O Interface. By introducing intermediate data access and coordination layers, parallel file systems provide application developers a high-level interface between the application layer and the I/O layer.

Metadata services Metadata services store namespace metadata, such as filenames, directories, access permissions, and file layout. Depending on the particular parallel file system, metadata services are provided either through a separate server cluster or as an integrated part of the overall storage node distribution.

The advantages of distributed storage and superior I/O performance makes parallel file systems preferable to NFS in most HPC scenarios, particularly when it comes to shared working storage space.

Azure-supported parallel file systems

There are many different parallel file system implementations to choose from. The following examples are both compatible with Azure, and have example Resource Manager templates that will make deploying them much simpler.

Lustre

Lustre is currently the most widely used parallel file system in HPC solutions. Lustre file systems can scale to tens of thousands of client nodes, tens of petabytes of storage, and over a terabyte per second of I/O throughput. Lustre is popular in industries such as meteorology, simulation, oil and gas, life science, rich media, and finance.

Lustre uses centralized metadata and management servers, and requires at least one VM for each of those, in addition to the VMs assigned as object storage nodes.

See also

GlusterFS

GlusterFS, developed by Red Hat, also provides a scalable parallel file system specially optimized for cloud storage and media streaming. It does not have separate metadata servers, as metadata is integrated into file storage.

GlusterFS also has Azure Resource Manager templates available:

Drupal 8 VM Scale Set (with GlusterFS and MySQL) – This template deploys an Azure Virtual Machine Scale Set behind a load balancer/NAT with each VM running Drupal (Apache / PHP). All the nodes share the created GlusterFS file storage and MySQL database.