This example modifies the PE layout for our original run, b40.B2000. We
now target the model to run on the jaguar supercomputer and modify our
PE layout to use a common load balance configuration for CESM on large
CRAY XT5 machines.

In our original example, b40.B2000, we used 128 pes with each
component running sequentially over the entire set of processors.

128-pes/128-tasks layout

Now we change the layout to use 1728 processors and run the ice, lnd,
and cpl models concurrently on the same processors as the atm model
while the ocean model will run on its own set of processors. The atm
model will be run on 1664 pes using 832 MPI tasks each threaded 2 ways
and starting on global MPI task 0. The ice model is run using 320 MPI
tasks starting on global MPI task 0, but not threaded. The lnd model
is run on 384 processors using 192 MPI tasks each threaded 2 ways
starting at global MPI task 320 and the coupler is run on 320 processors
using 320 MPI tasks starting at global MPI task 512. The ocn model uses
64 MPI tasks starting at global MPI task 832.

1728-pes/896-tasks layout

Since we will be modifying env_mach_pes.xml after
configure was invoked, the following
needs to be invoked:

Note that since env_mach_pes.xml has changed, the model has to be reconfigured and rebuilt.

It is interesting to compare the timings from the 128- and
1728-processor runs. The timing output below shows that the original model
run on 128 pes cost 851 pe-hours/simulated_year. Running on 1728 pes,
the model cost more than 5 times as much, but it runs more than two and
a half times faster.