3
ECMWF Slide 3 What is Parallel Computing? The simultaneous use of more than one processor or computer to solve a problem

4
ECMWF Slide 4 Why do we need Parallel Computing? Serial computing is too slow Need for large amounts of memory not accessible by a single processor

5
ECMWF Slide 5 An IFS operational T L 511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster 1600 1.3 GHz system (total 1920 CPUs). How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4

6
ECMWF Slide 6 Ans. About 8 days This PC would need about 25 Gbytes of memory. 8 days is too long for a 10 day forecast! 2-3 hours is too long …

22
ECMWF Slide 22 What performance do Meteorological Applications achieve? Vector computers -About 30 to 50 percent of peak performance -Relatively more expensive -Also have front-end scalar nodes Scalar computers -About 5 to 10 percent of peak performance -Relatively less expensive Both Vector and Scalar computers are being used in Met Centres around the world Is it harder to parallelize than vectorize? -Vectorization is mainly a compiler responsibility -Parallelization is mainly the users responsibility

26
ECMWF Slide 26 Parallel Programming Languages? High Performance Fortran (HPF) directive based extension to Fortran works on both shared and distributed memory systems not widely used (more popular in Japan?) not suited to applications using irregular grids http://www.crpc.rice.edu/HPFF/home.html OpenMP directive based support for Fortran 90/95 and C/C++ shared memory programming only http://www.openmp.org

28
ECMWF Slide 28 the myth of automatic parallelization (2 common versions) Compilers can do anything (but we may have to wait a while) -Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source Compilers cant do anything (now or never) -Automatic parallelization is useless. Itll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome

33
ECMWF Slide 33 Cache is … Small and fast memory Cache line typically 128 bytes Cache line has state (copy,exclusive owner) Coherency protocol Mapping, sets, ways Replacement strategy Write thru or not Important for performance -Single stride access of always the best!!! -Try to avoid writes to same cache line from different Cpus But dont lose sleep over this