The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Optimisation progress for UM7.8 Ilia Bermous.

Similar presentations

Presentation on theme: "The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Optimisation progress for UM7.8 Ilia Bermous."— Presentation transcript:

1
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Optimisation progress for UM7.8 Ilia Bermous 7 April 2011 Thank you to Joerg Henrichs, Martin Dix and Mike Naughton for some help and advices during the work

2
2 Description of global forecast test job  Global model with N320L70 resolution  Based on Fabrizio’s xazje job (which came from forecast step of Chris’s APS1 ACCESS-G development suite)  24 hour integration with ~30GB output  ~3100-3250 GCR iterations per run (3162 with 7.5 and 3223 with 7.8) for 120 time steps.  Timing results are given in terms of Elapsed CPU Time and Elapsed Wallclock Time from internal UM model timers as reported in UM job output for each run.  All runs used Mike’s version of UM run script; this script is very simple and flexible for these kinds of tasks.

4
4 Major UM7.8 developments for performance improvement  Asynchronous parallel I/O (requires OpenMP)  This new feature is activated at both build and run stage  Only works with OpenMP – requires UMUI “Use OpenMP” option in the “User Information and Submit Method => Job submission method” panel is selected  PMSL revised algorithm (Jacobi algorithm)  A revised algorithm based on a Jacobi solver is introduced and the number if iterations increased. Even with more iterations this new method is cheaper and scales at higher node counts.  Optimisation for FILL_EXTERNAL_HALOS  resulted in ~5% reduction in fill_external_halos routine runtime cost  Improved QPOS algorithms  All new versions are significantly quicker at scale (~10%) but require scientific validation as the science is altered somewhat. The "level" method has been validated in PS25 and is now being used in the global model at the Met Office.

5
5 Summary of attempts made for UM7.8  OpenMP usage with Intel compiler and 1 thread on Solar  OpenMP usage with Intel compiler and 2 threads on Solar  OpenMP usage with SunStudio compiler and 2 threads on Solar  OpenMP usage with Intel compiler and 2 threads on NCI system  OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar

6
6 OpenMP usage with Intel compiler and 1 thread on Solar  Problems resolved and reported  Found a number of cases in the sources for inconsistent usage of allocate/deallocate statements and IF block logic => reported to the UM developers  Significant impact on the performance if TMPDIR is used and modified in the UM scripts => reported to the developers  A couple of missing environment variables such as OMP_NUM_THREADS and OMP_STACKSIZE should be set by the UMUI scripts if multithreading is used  A run time crash problem with the usage of Intel11.0.083 compiler is resolved by using Intel11.1.073, the most latest available compiler on our site

8
8 OpenMP usage with Intel compiler and 1 thread on Solar (cont #3)  Performance comparison between top 6 sections for UM7.8 and UM7.5 without usage of I/O UM7.8UM7.5 PE_Helmholtz68.15PE_Helmholtz70.76 SL_Full_wind31.35ATM_STEP52.30 ATM_STEP26.46SL_Full_wind31.19 SL_Thermo21.52SL_Thermo27.20 READDUMP13.53READDUMP12.67 Atmos_Physics29.55NI_filter_Ctl15.63 Conclusions : 1.Comparing the top sections the major performance improvements are coming from ATM_STEP (25.74sec), NI_filter_Ctl (9.59sec) and SL_Thermo (5.68sec) which gives in total of 41.01sec

10
10 OpenMP usage with Intel compiler and 2 threads on Solar  The same slow performance issue found and investigated in detail for UM7.5 and reported in August 2010 still exists and the issue has not been addressed by Intel at all  Monitoring execution of the models sometimes the job starts to run fast for the first 10-15 steps then it slows down significantly, sometimes this may happen from the start of a run Elapsed times for a 14x18 decomposition and 2 threads per MPI process with 504 cores and without I/O are (4915sec; 4921sec) in comparison with (360sec; 366sec) using 14x18x1 without I/O  Conclusions:  this long standing problem must be addressed by Intel  At the moment it is not the most critical issue in getting asynchronous parallel I/O functionality with UM7.8

11
11 OpenMP usage with SunStudio compiler and 2 threads on Solar  Problems resolved and reported  Usage of POINTER INTENT attributes which is not supported by Fortran standard => used a work around, reported problem to the UM development team  Multithreading performance results  UM performance using 2 threads per MPI process is better than without OpenMP, but scaling is very poor (Lustre striping was not used): 14x18x2threads + FLUME_IOS_NPROC=4 => 508 cores => 753sec 14x18x1thread + FLUME_IOS_NPROC=0 => 252 cores => 794sec Note: - Date command output was used to calculate elapsed times - Several runs using different run configurations such as 20x24x1 and 16x32x1 had crash problems, the nature of these problems have not been investigated - Usage of different optimisation options such as -O3, -O5, –xtarget=native, –xarch=native, –dalign, -g does not make any visible impact on the performance results

12
12 OpenMP usage with Intel compiler and 2 threads on NCI system  The same slow performance issue as on Solar does exist on NCI system using Intel11.1.073 compiler and openmpi1.4.3 library: ( 3565sec; 3703sec )  Monitoring execution of the model: the job started to run slow from the first time step  Usage of the latest Intel12.0.084 compiler  Compilation crashes for a file, a work around to use “-O0” instead of “-O2” recommended by Martin fixes the problem  Execution with 14x18 decomposition using a single thread crashes, this run time problem has not been investigated

13
13 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar Due to a slow performance issue with multithreading for the computational part, the main idea in this approach is  to compile all UM7.8 sources excluding the “io_services” library without usage of “-openmp” compilation option  to compile the “io_services” library with multithreading using the “-openmp” compilation option  UM7.8 major terms in relation to asynchronous parallel I/O:  FLUME_IOS_NPROC – number of MPI tasks allocated to act as IO servers  IOS_Spacing – the gap between IO servers in MPI_COMM_WORLD (for optimal performance a node has no more than one IO server)  buffer_size – amount of data (MB) that each IO server can have outstanding  IOS_use_async_stash – use asynchronous communications to accelerate diagnostics output  IOS_use_async_dump – asynchronous DUMP output not currently available

14
14 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #2)  Found problems/issues  Using 14x18x2 configuration with FLUME_IOS_NPROC=8 and spacing of 8 (8 nodes are overcommitted) a run time error problem was produced forrtl: severe (40): recursive I/O operation, unit 6, file unknown work around: several write statements to produce similar diagnostic output have been commented out (as per Joerg’s message, Peter Kerney has reported this problem to Intel Support)  The main asynchronous parallel IO functionality due to the latest model development is available only if an MPI library allows that multiple threads can call MPI with no restrictions (MPI_THREAD_MULTIPLE), unfortunately a single threaded support (MPI_THREAD_SINGLE) is provided by our MPI library (OpenMPI), this is checked by an MPI_QUERY_THREAD call which returns the current level of thread support Comment: this is another example of an obstacle when the user has a different platform from the platform used by the developer

15
15 With the current version of MPI library to be able to use some parts of the implemented UM7.8 functionality Joerg suggested to overwrite the UM7.8 setting of MPI_THREAD_SINGLE with MPI_THREAD_FUNNELED (The task can be multi-threaded, but only the main thread will make MPI calls. All MPI calls are funneled to the main thread. ) Results (in sec) using 20x24 decomposition with Lustre file system striping (4Mb, 8 ways), buffer_size= 6000 488 cores (FLUME_IOS_NPROC=8), 8 MPI processes per node 560 cores (FLUME_IOS_NPROC=10), 7 MPI processes per node 431; 436390; 395 425; 429404; 413 423; 428406; 414 OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #3)

16
16  Conclusions  Usage of MPI_THREAD_FUNNELED does not provide a visible performance improvement in comparison with the results achieved with the usage of a single thread only  In a case of not overcomitting the nodes on which multithreading is not used gives slightly better performance results which are similar to the results obtained with the usage of a single thread only  The number of wasted cores in a second configuration when only 7 MPI processes are used can be reduced to 0 with the usage of the functionality provided by Joerg’s mprun.py script using its explicit form which will take 10-15 lines of text for a single run command OpenMP only for “io_services” library with Intel compiler and 2 threads on Solar (cont #4)

17
17 Next steps for future work  Merge Joerg’s byte swapping procedure from UM7.5 into UM7.8 (Joerg agreed to do this task)  Addressing by Solar Help a request on a thread multiple version of the MPI library could be provided to our site to be able to use the asynchronous functionality with UM7.8  Validation of the numerical results produced with UM7.8  By providing just presented information to Paul Selwood ask him  What are the IOS main parameter settings used at UKMO site with UM7.8?  What kind of performance improvement is produced in comparison with UM7.5 for the I/O part?  What kind of parameter settings can be recommended for our case?