Clear up craycc issues:
1. split tests for sync_synchronize and fetch_add
2. split implementation selection for CmiMemory*Fence from CmiMemoryAtomic* to use whichever underlying intrinsic support is available based
on the newly split configure test
3. Add test and support for use of __builtin_ia32_lfence() so that craycc builds can leverage native intrinsics for CmiMemory*Fence
4. fixed typo in configure.in

With these changes in place, AMPI builds and passes make test when using craycc.

CrayXC: Add target gni-crayxc and mpi-crayxc for the new cray system based on
Aries interconnect. All the code is borrowed from CrayXE. In the process also
remove the GEMINI tag from CRAYXE build and call it GNI instead.

Use the new dynamic CPU set allocation API to support very large SMP machines

In particular, allow a multicore binary built on Stampede to run on
Blacklight. (It's not needed if you build on Blacklight, because the
default CPU set size in the headers has been increased on that
machine.) This should be safe, as it doesn't use the new interface
unless the default size is too small.

global and local element counters in CkReductionMgr need to be reset only when
the redn mgr is serving a chare array (and not when its serving a group). Implementing
this via an input flag to flushStates() caused flushStates() to become overloaded in
the child class. This overload hid the base class virtual method causing some compilers
(icc?) to complain (warnings). This somewhat cleaner solution should avoid the
warnings, but it still stinks of a hack.

Change the counts of sent and received messages and bytes from int to
unsigned int, to avoid the undefined behavior that occurs when the
eventually overflow. Incidentally, this doubles their useful range as
well.

Shifting by more the full width of a variable is undefined behavior in
C and C++. When trying to rotate by 0, a shift of the full width
resulted. Avoid that by returning the unmodified argument when no
rotation is to be applied. The test should be free, either because the
value was just computed and so the condition code is set, or because
the compiler emits a rotate opcode anyway.

In C and C++, shifting left into or past the sign bit of a signed
integer results in undefined behavior. In cases where we're just
setting up a bit mask, use the bitwise inverse of an unsigned 0
instead, on which shifting is perfectly well defined.

Overflowing a signed int is undefined behavior in C and C++, but the
hash code seems to assume 2's-complement wrap-around. That's exactly
the specified behavior of unsigned arithmetic, to use that explicitly.

MPI has its own throttling built in and has its own API for this where
available, so we disable this test on the MPI layer because it doesn't
visibly provide the gni_pub.h header and consequently the BI API.