Post navigation

Parallel Computations Efficiency: Abaqus, Ansys and Simmakers

Posted on: 13.11.2014

Hardware that is based on parallel computing architecture has recently been gaining increasing popularity in high performance computing.

The efficiency of parallel processing hardware in engineering problem solving such as the computer simulation of physical processes is not directly dependent on the number of processors: four CPU cores do not in fact provide a fourfold speed increase in solving complex engineering problems over one CPU core. Similarly, the transfer of computation to graphics cards with hundreds of cores cannot provide a hundredfold increase in speed.

First of all, parallel computation acceleration is limited by computational algorithms; running algorithms with a low degree of parallelization on supercomputers and high-performance workstations is irrational. The notion of "efficiency of parallelization" is explained by Amdahl's law, according to which if at least 1/10 of the program is executed sequentially, then the acceleration cannot be increased beyond 10 times the original speed regardless the number of cores employed.

Telling examples of the limited effectiveness of algorithm parallelization for solving engineering problems are provided in the relatively weak results of worldwide leaders in computer-aided engineering (CAE) software - Abaqus and Ansys.

In SIMULIA's Abaqus transfer of computations from 2 CPU cores to 4 CPU cores, the speedup factor was 1.7 times. Transferring these algorithms to CUDA architecture with 448 cores of Nvidia Tesla C2075 sharing 4 CPU cores resulted in an increase of only 3.5 times [Source].

Ansys also achieved parallelization efficiency of algorithms commensurate with Abaqus. When increasing the number of CPU cores from two to eight, the processing speed of the Ansys Mechanical 15.0 package tripled. Sharing between 2 CPU cores and the 2880 cores on the Nvidia Tesla K40 video accelerator was 3.5 times faster than the 2 CPU cores alone [Source].

The mathematical solvers embedded in the «Frost 3D Universal» software demonstrate the superior computational algorithm parallelization and use of parallel architecture in terms of efficiency.

A computer model of production wells was used to compare the parallel computing speed on CPUs and GPUs.

The hardware was selected from widely available user computing resources such as the Intel Core i7 CPU and the Nvidia Titan graphics card.

Intel Core i7-3770

Nvidia GeForce GTX Titan

Specifications

Specifications

Cores: 4

Cores: 2688

Base Clock: 3.4 GHz

Base Clock: 836 MHz

Boost Clock: 3.9 GHz

Boost Clock: 876 MHz

Graphics Card Power: 77 W

Graphics Card Power: 250 W

Recommended price: $305

Recommended price: $1080

The three-dimensional model was discretized with different spatial steps. As a result, meshes with the following number of nodes were obtained: ~2 million, 4 million, 8 million and 16 million. Each computational mesh was computed on 1 core of Intel Core i7, 4 cores of Intel Core i7 and the GeForce GTX Titan video card. Below there are computational results for the two-year simulation forecast.

Number of nodes

Processing time, s

Speedup factor

1 core of Intel Core i7

4 cores of Intel Core i7

GeForce GTX Titan

4 cores of Intel Core i7 to 1 core

GeForce GTX Titan to 4 cores of Intel Core i7

GeForce GTX Titan to 1 core Intel Core i7

2,000,000

9.62 h

(34,632 s)

5.97 h

(21,504 s)

34.11 min

(2,047 s)

1.61x

10.50x

16.91x

4,000,000

18.16 h

(65,388 s)

10.63 h

(38,287 s)

57.65 min

(3,459 s)

1.70x

11.06x

18.90x

8,000,000

34.33 h

(123,600 s)

19.22 h

(69,221 s)

1.62 h

(5,844 s)

1.78x

11.84x

21.14x

16,000,000

61.14 h

(220,104 s)

32.98 h

(118,736 s)

2.62 h

(9,456 s)

1.85x

12.55x

23.27x

The performance of 1 core of Intel Core i7 represents an speedup factor of 1x

It should be noted that, when comparing the computational speed on multi-core architectures, the following model parameters have a significant impact on the acceleration:
- number of materials;
- the number of boundary conditions;
- mesh uniformity;
- multiplicity of mesh nodes and computational cores;
- conformity of thermo-physical properties of materials.
It means that the maximum acceleration on parallel architectures could be achieved on the simplest models with a uniform computational mesh and the minimum number of materials and boundary conditions. In practice, however, computational models are more complicated, that’s why our speed analysis was based on the production wells simulation model for more objective results.

Conclusions:

The use of computational algorithms with a low degree of parallelization is inefficient on multi-core processors and video accelerators.

The major engineering analysis software packages on the market contain a high degree of serial code, significantly hampering the acceleration potential of parallel computing. This is largely due to the implementation of now dated mathematical solver algorithms, developed when there were no technologies such as CUDA and therefore not designed to take advantage of these parallelization technology enhancements.

Mathematical algorithms in the latest generation CAE software are designed basing on parallel processing technology. It allows achieving speedup by a factor of ten by transferring computation from one CPU core to multi-core graphics accelerators.

Ansys also achieved parallelization efficiency of algorithms commensurate with Abaqus. When increasing the number of CPU cores from two to eight, the processing speed of the Ansys Mechanical 15.0 package tripled. Sharing between 2 CPU cores and the 2880 cores on the Nvidia