CESM on Stampede2 (TACC)

I've been attempting to get CESM1.2.2 up and running on Stampede's KNL system, Stampede2, but we've run into a few issues. This seems to be architecture (or potentially compiler type and version) specific, as this model works on the NERSC Cori KNL system but not on Stampede. Unfortunately, we are not able to back-migrate compilers or try other versions as they are not available on this system.

We are running at a resolution of two degrees in the B_1850-2000_CN compset. The main error message being produced now is the following, which seems to stem from the ice_transport_remap.F90 file:

This message was produced after almost 1.5 years during a long simulation, but was again produced after just 15 days when we re-submitted the same run script. It seems that there's a root problem that we're unaware of, but that is leading to unpredictable failures.

We are using the Intel compiler (version 17.0.4), with the following flags supplied to the model in the config_compilers.xml:

This case did not work for me on Stampede2, so for my initial testing I switched to I_1850_CN, which runs fine.
For B_1850_CN did you get runtime warnings from NetCDF? Are you sure all the input files are on hand?

Thanks for checking this out, Max. It's odd that something would be systematically wrong with the B_1850_CN (or the B_1850-2000_CN) composet. The appropriate files are on hand, but there were a few NetCDF errors/warnings that appeared in the cesm log (i.e., "NetCDF: Invalid dimension", "NetCDF: Variable not found", "NetCDF:Attribute not found", etc.). These had appeared even in simulations that were able to successsfully run for a few months, however, so I'm unsure of how this is related to this particular failure mode.

Did you make any changes in the machine files of CESM to run on Stampede2 instead of 1? If so, could I glance at the modifications you made?

As of now, it seems that we can work around the problem by assigning only 32 tasks per node instead of 64. The error doesn't seem to be related to a memory limitation though (based on some memory usage checking by TACC support), so it's also unclear why this is getting rid of the array index out of bounds problem we were experiencing previously.

So ICE is not active in I_1850_CN—that explains why it works, but B_1850* doesn't.

Interesting that you can run with MAX_TASKS_PER_NODE=32; I have been using 64.
I will test that myself!

I was able to run B_1850_CN with MAX_TASKS_PER_NODE=64 up to the point where it started writing restart files,
or at least that is how it appeared to me, a relatively inexperienced user (I'm really a benchmarker).
But all components time-stepped, without any NetCDF warnings (those started in the shutdown phase).
To achieve this partial success, I built MODEL="cice" with O0 instead of O2, so I suspect a compiler bug.
I am continuing to chase it.

As for sharing my port, personally I would love to—but I am working under contract, so first I'd need to ask my client.

It certainly does seem to be related to the ice model in particular. How do you go about changing the optimizaiton flag for only that model component? Is this something you include in the machine build script? If it makes any difference, I've still got MAX_TASKS_PER_NODE set to 256 (in theory, each node has 64 cores with 4 threads per core, so 256 logical cores) and specify --ntasks-per-node=32 via SBATCH to get a functioning simulation.

Understandable that you need to check before sharing these machine files. If your client is open to it though, I'd be incredibly appreciative! Happy to share what we've worked out machine-file wise in return as well.

Interesting, I wasn't aware that you could specify compiler flags for each component of the model - good to know! I've heard that OpenMP is a problem for CESM as well - this has been an issue on multiple machines now, and stems from the ice_transport_remap.F90 file. I believe you can still use OpenMP for the other model components, but the ice model can't be enabled for OMP.

As far as other systems go, the model seems to work fine on NERSC's Cori-KNL machine, even with 64 tasks per node. This is partly why we're having such a hard time finding the exact cause of this problem - the same model works on one machine, but takes some odd configurations to get working on another machine with fairly similar architecture. So far, the 32 tasks work-around is holding, but it's frustrating to be unable to use the entire node.

Do you happen to know the size of memory on Cori-KNL nodes (cat /proc/meminfo)? Never mind—I checked: both Cori and Stampede2 KNL nodes have 96GB. I had suspected a memory issue (32 tasks have twice as much memory apiece as 64). This would have also explained why multithreading fails when ICE is active. It's still possible that on Stampede2 the memory isn't all available for some reason.

I will test 32 tasks per node myself, for B_1850_CN, the compset that I'd most like to run.

My client allowed me to share configuration, but not results; so once I get comfortable with CESM I could give some specifics. But I didn't change much, at least it doesn't seem like a lot.

I agree that a memory limitation would make sense. That said, we've worked with someone at TACC on this, and from their results it doesn't seem as obvious that memory is the problem. Simulations with 32 tasks per node have plenty of available memory space; it's hard to say with 64 tasks per node, because those jobs are failing too quickly to get a useful memory read on. Perhaps you'll have more luck determining if memory is the main problem. Did the B_1850_CN compset you tried work?

The test you suggest on Cori might be useful though. I worry that the results might not be directly applicable though, since 64 tasks per node works on Cori but not on Stampede2. Next time I'm testing over there though, I'll be sure to check.

All my latest tests have failed, including a run with MAX_TASKS_PER_NODE=32.

There appear to be "magic numbers" in cice—were you aware of that? nproc=320 or 640, for examples; using those you get very (?) different decompositions than with nproc=512 or 1024.

But even using the "blessed" values of nproc my runs are dying with a "ridging error." So now I've increased the iteration limit from 20 to 100; and I'm testing a build with "-fp-model precise" instead of "source" in case there is some precision issue at play.

How many total cores are you running on? Are you using the Intel 17 tools?

I wasn't aware of any "magic numbers" in cice, no. I assume that it won't lead to different solutions scientifically though, correct? I've dealt with the "ridging error" frequently. One thing to try is to increase your node count by one (so if you're running a job that would only require 8 nodes, try assiging it 9 instead). That's gotten around the issue occasionally. The most recent case in which I've encountered the ice ridging error seems to actually be related to settings in the ocean model. I had specified an in input spun-up ocean file to the POP model, but hadn't changed init_ts_file_fmt from 'bin' to 'nc' (the input file was netCDF, not binary, in my case).

Updating to -fp-model precise should also help. That solved a few error messages we had been encountering previously. Right now, I'm running on 256 cores - 32 tasks per node. I've assigned 9 nodes (for some reason our model wouldn't work on 8 nodes, though the problem doesn't seem to be reproducible). We're using Intel 17 still; I think that's the main one on Stampede2.

Please see models/ice/cice/bld/cice_decomp.xml (and I have no idea about the science...).
I'll check OCN init_ts_file_fmt
I already tried "precise" - no apparent effect. I suppose "strict" would the last hope.
So you are getting your case to work on Stampede2???

Thanks for sharing your compiler settings! Below are ours as well; a lot of trial and error has gone into this, so it's not a guarantee that everything in there is necessary or will help. But so far, this seems to be a functioning configuration:

You don't have "-D_USE_FLOW_CONTROL" in POP2 CPPDEFS; I did—I've removed it now.

Your CPPDEFS have CRMACCEL and STRATOKILLER, which I don't. No idea what those do, but I'll add them.

Other than those I don't see anything significant other than your SLIBS: are you loading hdf5 and netcdf, or phdf5 and parallel-netcdf? I've been using PnetCDF, and also loading parallel-netcdf (with phdf5). When you invoke "nf-config --flibs" you get "-L$(TACC_HDF5_DIR)/lib but not "-lhdf5" so that's a possible factor, although I've tried it both ways, including adding or not adding to LDFLAGS.

I have never set TRILINOS_PATH, not thinking it mattered because I don't switch on Trilinos— do you?

More experiments, oh joy!

P.S.: mpicc is not the same as mpiicc.

Notes added later: "-DINTEL -DCPRINTEL" get added to Macros anyway, so your removal of "-DIntel -DCPRINTEL" from CPPDEFS seems a no-op. Also I found an entry

<!--sungduk: it seems that STAMPEDE has CESM inputdata in the above directories, but permission problem arises. So I have to use manual downloading option. To do that i added the following two lines. -->

I wouldn't think that this would make a huge difference, but it's possible that some changes are important (i.e., PES_PER_NODE). Most of the modules (pnetcdf, hdf5, etc.) I use are loaded in the env_mach_specific.stampede-knl file:

#Replacing above two lines with following 3 based on TACC support advice

module reset

module load hdf5 netcdf pnetcdf intel

module load cmake

module load impi

I don't believe I switch on Trilinos, no, but it was set before and so I haven't changed it. Have you had any luck? Hopefully some of this wil help. I'll add a warning though, my successfull long simulation has just failed after running for almost 150 years, and I've yet to debug the exact cause of it, so that remains to be done.

(I'm still working out some refinements so I won't have to sync too many settings.)

I need to keep PES_PER_NODE=68 because I use that value to construct a pin map, although at this stage I'm not using it. Anyway MAX_TASKS_PER_NODE is what matters to mkbatch.

I ran various cases in total dozens of times since Friday, with my best result being with B_1850-2000_CAM5 (0.9x1.25_gx1v6) using 256 cores (no multithreading), which did fine until dying in cam. By the way, the same build on 320 crashed with what was probably a NaN somewhere, because NetCDF couldn't represent it.

This is all very frustrating; I have the feeling I'm missing something stupid, because a model like this shouldn't be so fragile.

I expect to gain access to Skylake soon, so perhaps I'll have better luck on that, maybe even learn what's going wrong on KNL.

I don't see too many glaring differences there. One question - does your model throw an error regarding loading the perl module? Mine did previously, and when I checked with TACC I was told that it's no longer a module, but is available by default from the system. We might also be using different impi versions, though I think mine uses the default version most often.

One more thing to try may be setting I_MPI_PIN_DOMAIN in your mkbatch script. I've used this as an analog to "-c" on Cori-KNL, which users have said is key. If you want to run with 64 tasks per node, set it to 4; for 32 tasks per node, it's set to 8. If you send me an email (mdfowler@uci.edu), I can give you the directory to my Machine files on Stampede as well, perhaps doing a direct "diff" would illuminate some differences that can get the model up and running?

Yes, I stopped loading Perl, using /bin/perl by default. Also I think you got bad advice: "module reset" doesn't purge modules; I do a purge, then load only the modules I need, just to be safe. (My configuration achieved self-awareness yesterday: it's now much more sophisticated, after I put in some hours....)

AFAIK there is only one IMPI on Stampede2.

Use I_MPI_DEBUG=4 (setenv I_MPI_DEBUG 4) to see where ranks are placed/bound. I have used I_MPI_PIN_DOMAIN, but at the moment am less concerned about task/thread placement than getting interesting cases to run. But I will check output again, just to confirm the MPI ranks are where they ought to be.

Yesterday I found something, and fixed it. Now I can run B_1850_CN for at least a short simulated while. It still fails in longer tests, and I'm tracking down the reason.

Because I'm getting into sensitive (i.e., competitive advantage) territory, I gladly accept your invitation to take our discussion off line.

Next time you run on Cori, try this.
While CESM is running, log onto one of the nodes it's using, get the PIDs of its MPI tasks, and monitor VmPeak in /proc/$pid/status. That gives the maximum memory used by the process since it started—memory highwater, in other words. The information is only available while the process exists, so you have to check while CESM is running.

I will do the same on Stampede2 with a MAX_TASKS_PER_NODE=32 run, if that works for me.

Hello, I've been experiencing a similar problem, the model randomly crash at line 1879 in ice_transpost_remap.F90. Our compiler flags conatins "-check uninit", and log indicated that a variable used as array index was uninitialized. None of the conditions of the branches that initializes it were satisfied because of NaN.

I'd like to know if you have solved the problem. If so, would you like to share some solution or workaround?

I wish I had an easy answer for you. There were a series of issues that cropped up along the way before we could solve the error, resulting in additional errors. But if you're running on Stampede2 (KNL), I can at least share the final configuration options with you:

In config_compilers.xml:

<!--Meg: Settings for KNL -->

<compiler MACH="stampede-knl">

<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>

<NETCDF_PATH>$(TACC_NETCDF_DIR)</NETCDF_PATH>

<!--PNETCDF_PATH>$(TACC_NETCDF_DIR)</PNETCDF_PATH-->

<!-- <ADD_CPPDEFS> -DHAVE_NANOTIME -DCLOUDKILLER </ADD_CPPDEFS> THIS IS FOR CLOUDKILLER-->

<!-- <ADD_CPPDEFS> -DHAVE_NANOTIME -DASYMTSI </ADD_CPPDEFS> THIS IS FOR ASYMTSI-->

<ADD_CPPDEFS> -DHAVE_NANOTIME </ADD_CPPDEFS>

</compiler>

Ultimately, it also seemed that we were running out of space on the nodes we were requested. I believe we wound up running with an extra node; so for example, if we ran with 64 tasks per node, and wanted to run on 2 nodes, I would increase the total nodes requested from 2 to 3, just to be on the safe side. There's no real logical reason this should work, and I'm not at all sure that it's still necessary (it shouldn't be), but you could always try it. Let me know if it would help to share any other bits of code or files regarding workflow.

Thanks for answering. Actually I'm not running on Stampede2. I compared our configuration with yours and it seems that only the "-DHAVE_NANOTIME" is different. I have inspected our memory usage and I'm pretty sure we won't be running out of memory. Our model often crash on the 15th day in a month. I'm wondering if your situation is similar. Also, could you please share your compiler and MPI configuration, or modification on the code?