The fastest supercomputers today such as Blue Gene/L, Blue Gene/P,
Cray XT3 and XT4 are connected by a three-dimensional torus/mesh
interconnect. Applications running on these machines can benefit
from topology-awareness while mapping tasks to processors at
runtime. By co-locating communicating tasks on nearby processors,
the distance traveled by messages and hence the communication
traffic can be minimized, thereby reducing communication latency
and contention on the network. This paper describes preliminary
work utilizing this technique and performance improvements
resulting from it in the context of a n-dimensional k-point stencil
program. It shows that even for simple benchmarks, topology-aware
mapping can have a significant impact on performance. Automated
topology-aware mapping by the runtime using similar ideas can
relieve the application writer from this burden and result in
better performance. Preliminary work towards achieving this for a
molecular dynamics application, NAMD, is also presented. Results on
up to 32,768 processors of IBM's Blue Gene/L, 4,096 processors of
IBM's Blue Gene/P and 2,048 processors of Cray's XT3 support the
ideas discussed in the paper.