* Replaced Streams C code with equivalent assembly code.
* Used vendor MPI optimization to MPI_Cart_* functions to put communicating neighbors on the same physical node.
* Used MPIRandomAccess optimization from Sandia that combined messages so that many small messages could be combined into fewer large messages that were then passed together via alltoall operations. This work is documented at: http://www.cs.sandia.gov/~sjplimp/algorithms.html#gups