Oracle Blog

Performance & Best Practices

Wednesday Dec 08, 2010

This Oil & Gas benchmark highlights both the computational performance
improvements of the Sun Blade X6275 M2 server module
over the previous genernation
server and the linear scalability achievable for the total application
throughput using a Sun Storage 7410 system to deliver almost 2 GB/sec I/O effective write
performance.

Oracle's Sun Storage 7410 system attached via 10 Gigabit Ethernet to a
cluster of Oracle's Sun Blade X6275 M2 server modules was used to
demonstrate the performance of a 3D VTI Reverse Time Migration application,
a heavily used geophysical imaging and modeling application for Oil & Gas Exploration.
The total application throughput scaling and computational kernel performance improvements
are presented for imaging two production sized grids using 800 input samples.

The Sun Blade X6275 M2 server module showed up to a 40% performance
improvement over the previous generation server module with super-linear scalability
to 16 nodes for the 9-Point Stencil used in this Reverse Time Migration
computational kernel.

The balanced combination of Oracle's Sun Storage 7410 system over 10 GbE to the Sun Blade X6275 M2 server module cluster showed linear scalability for the total application throughput, including the I/O and MPI communication, to produce a final 3-D seismic depth imaged cube for interpretation.

The final image write time from the Sun Blade X6275 M2 server module nodes to Oracle's Sun Storage 7410 system achieved 10GbE line speed of 1.25 GBytes/second or better write performance. The effects of I/O buffer caching on
the Sun Blade X6275 M2 server module nodes and 34 GByte write optimized cache on the Sun Storage 7410 system gave up to 1.8 GBytes/second effective write performance.

Application Scaling

Performance and scaling results of the total application, including I/O,
for the reverse time migration demonstration
application are presented. Results were obtained using a
Sun Blade X6275 M2 server cluster with a Sun Storage 7410
system for the file server.
The servers were running with hyperthreading enabled,
allowing for 24 OpenMP threads per server node.

Application Scaling Across Multiple Nodes

Number Nodes

Grid Size - 1243 x 1151 x 1231

Grid Size - 2486 x 1151 x1231

Total Time (sec)

Kernel Time (sec)

Total Speedup

Kernel Speedup

Total Time (sec)

Kernel Time (sec)

Total Speedup

Kernel Speedup

16

501

242

2.1\*

2.3\*

1060

576

2.0

2.1\*

14

583

271

1.8

2.0

1219

679

1.7

1.8

12

681

346

1.6

1.6

1420

797

1.5

1.5

10

807

390

1.3

1.4

1688

890

1.2

1.3

8

1058

555

1.0

1.0

2085

1193

1.0

1.0

\* Super-linear scaling due to the
compute kernel fitting better into available cache for larger node counts

Image File Effective Write Performance

The performance for writing the final 3D image from a Sun Blade X6275 M2 server cluster over 10 Gigabit Ethernet to a Sun Storage 7410 system are presented. Each server allocated one core per node for MPI I/O thus allowing 22 OpenMP compute threads per node with hyperthreading enabled. Captured performance analytics from the Sun Storage 7410 system indicate effective use of its 34 Gigabyte write optimized cache.

Benchmark Description

This Vertical Transverse Isotropy (VTI) Anisotropic Reverse Time Depth
Migration (RTM) application measures the total time it takes to image
800 samples of various production size grids and write the final image
to disk for the next work flow step involving 3-D seismic volume interpretation.
In doing so, it reports the
compute, interprocessor communication, and I/O performance of the individual
functions that comprise the
total solution. Unlike most references for the Reverse Time Migration,
that focus solely on the performance of the 3D stencil compute kernel,
this demonstration code additionally reports the total throughput involved in
processing large data sets with a full 3D Anisotropic RTM application.
It provides valuable insight into configuration and sizing
for specific seismic processing requirements. The performance effects of
new processors, interconnects, I/O subsystems, and software technologies
can be evaluated while solving a real Exploration business problem.

This benchmark study uses the "in-core"
implementation of this demonstration code where each node reads in only the trace, velocity, and
conditioning data to be processed by that node plus a 4 element array
pad (based on spatial order 8) shared with it's neighbors to the left
and right during the initialization phase. It maintains previous, current, and next
wavefield state information for each of the source, receiver, and anisotropic wavefields in
memory. The second two grid dimensions used in this benchmark
are specifically chosen to be prime numbers to exaggerate the effects
of data alignment. Algorithm adaptions for processing higher orders in space and
alternative "out-of-core" solutions using SSDs for wave state checkpointing are
implemented in this demonstration application to better understand the effects of
problem size scaling. Care is taken to handle absorption boundary conditioning and a
variety of imaging conditions, appropriately.

Key Points and Best Practices

This demonstration application represents a full
Reverse Time Migration solution.
Many references to the RTM application
tend to focus on the compute kernel and ignore the
complexity that the input, communication, and output bring to the
task.

The Swat Sun Storage 7410 analytics data capture indicated an initial write performance of about 100 MB/sec with the MPI non-blocking implementation. After modifying to MPI blocking writes, Swat showed between 1.3 and 1.8 GB/sec with up to 13000 write ops/sec to write the final output image. The Swat results are consistent with the actual measured performance and provide valuable insight into the Reverse Time Migration application I/O performance.

The reason for this vast improvement has to do with whether the MPI file mode is sequential or not (MPI_MODE_SEQUENTIAL, O_SYNC, O_DSYNC). The MPI non-blocking routines, MPI_File_iwrite_at and MPI_wait, typically used for overlapping I/O and computation, do not support sequential file access mode. Therefore, the application could not take full performance advantages of the Sun Storage 7410 system write optimized cache. In contrast, the MPI blocking routine, MPI_File_write_at, defaults to MPI sequential mode and the performance advantages of the write optimized cache are realized. Since writing the final image is at the end of RTM execution, there is no need to overlap the I/O with computation.

Additional MPI parameters used:

setenv SUNW_MP_PROCBIND true
setenv MPI_SPIN 1
setenv MPI_PROC_BIND 1

Adjusting the Level of Multithreading for Performance

The level of multithreading (8, 10, 12, 22, or 24) for various components of the RTM should be adjustable based on the type of computation taking place. Best to use OpenMP num_threads clause to adjust the level of multi-threading for each particular work task. Use numactl to specify how the threads are allocated to cores in accordance to the OpenMP parallelism level.

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 12/07/2010.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.