Solving the noisy neighbour problem

In a compute cluster with shared storage it is possible for a small number of jobs to overload the network or file system. This can affect the performance of all the jobs on the cluster and even bring it down completely. This is called the noisy neighbour problem.

Sometimes this problem is caused by rogue jobs that have been submitted to the cluster by mistake. Other times the cluster may simply by overloaded with a high number of I/O hungry jobs.

Mistral monitors application I/O and cluster performance so that jobs exceeding the expected I/O thresholds can be automatically identified and slowed down through I/O throttling.

Development customer

ARM

Olly Stephens, Engineering Systems Architect at ARM, said of the project to develop Mistral with Ellexus:

We wanted to develop a system that will allow the infrastructure to protect itself somewhat against I/O behaviour that is considered a risk. In particular, we wanted the ability for aggressive use of the storage infrastructure to be automatically detected early and remedial steps taken quickly. Previously this activity was done by the HPC support staff, who were able to monitor and detect issues, but this was a slow and difficult process, primarily due to the lack of available information.

The data and system control provided by Mistral will allow the infrastructure to prevent risky I/O patterns and give us a lot more information to learn from.

Whole cluster I/O monitoring

Mistral can monitor application I/O by wrapping up the jobs on the compute node or by intercepting I/O traffic as it passes through an NFS gateway or SMB gateway.

Jobs with higher than expected I/O or higher than expected latency generate an alert.

Mistral monitors the number of read() and write() operations, the I/O bandwidth for reads and writes and the number of meta-data operations such as open() or stat().

Development customer

Software design company

Checking I/O patterns in software test frameworks
There are lots of reasons why a job might overload the file system. One of our customers is a software company that uses Mistral to look for I/O problems in their test infrastructure. Not only do they want to protect their own file system, they also want to stop bad I/O patterns from happening when the software is released to customers.

Mistral has been picking up situations when a test is generating too many I/O calls or too much data so they can fix it before release.

Load balancing for shared storage

As well as monitoring the I/O to detect rogue jobs, Mistral is able to throttle the I/O of problem jobs so that the cluster can recover and all the remaining well-behaved jobs can continue with good performance.

High-priority jobs can be given high limits so that they get a large share of the storage bandwidth. Jobs that do unexpectedly high I/O will be throttled early so that the cluster is not affected.

User education

Once you have built up a picture of how much I/O each job is expected to do, Mistral can build a history of I/O patterns for advanced tuning. This information can be logged by Mistral and used to educate your users and to redesign your work flows and software pipelines to better use your shared storage.