Resum:

Java is a very commonly used computer programming language, although its use amongst the scientific and High Performance Computing (HPC) communities remains relatively low. In this thesis, the option of using Java for developing scientific applications intended for execution in HPC environments is investigated.
The data reduction pipeline for the Gaia space astronomy mission is an example of a large software project that has been written in Java, and will run in HPC environments. The efficient execution of the Gaia data reduction pipeline was one of the main motivations behind this thesis, although this thesis largely remains a general investigation into the use of Java in HPC.
HPC is a fast changing field, in terms of hardware, software, and the scale of the problems that are being tackled. Amongst the most significant trends in HPC in recent years have been the increase in the number of cores per computing node, and the increase in the size of datasets that must be processed.
A significant challenge in HPC is ensuring that data is made available in a particular node, when a core is ready to process it, thereby avoiding deadtime and providing high throughput. One danger to throughput is a decrease in the performance of shared storage devices, as the number of concurrent processes that are accessing those devices increases.
Given the trends mentioned above, efficient data communication is very important for many applications running in HPC environments. In this thesis, we present an investigation into the current options for providing efficient data communication to Java applications in HPC environments. We investigate a number of implementations of Message Passing in Java (MPJ) and compare their performance.
We present a new communication middleware application, called MPJ-Cache. This middleware makes use of an underlying implementation of Message-Passing in Java (MPJ), and adds prefetching, caching, and file-splitting functionality. It presents application developers with a high-level API, thus providing high-performance, as well as enabling high productivity amongst application developers. We compare the aggregate data rate that can be achieved though the use of this middleware, against that which can be achieved though direct access of a high performance shared storage device (GPFS), while distributing data amongst the nodes of a computer cluster. The use of MPJ-Cache has shown to provide an aggregate data rate of up to 103Gbps.
Java applications are executed within a Java Virtual Machine (JVM), which is a managed runtime environment. The execution of applications within such a runtime environment is very different from the execution of native code, that was compiled ahead-of-time. The Java runtime environment consists of several sophisticated components, including the core runtime system, a garbage collector and a Just-In-Time (JIT) compiler. Modern JVMs strive to provide out-of-the-box high-performance, however in some situations, users may want to tune the JVM to better suit the behaviour and needs of a particular application. In order to do this, a profile of the target application should be obtained.