The following list is not exhaustive. There is always some student work to be done in various research projects. You can send an email (e.g. to Josef Weidendorfer or Dai Yang), asking for currently available topics.

In a current project the Chair for Computer Architecture analyzes modern HPC system with heterogeneous architectures towards exascale computing. Real-world applications which represent a class of typical HPC problems are an important element. One example is the maximum likelihood expectation maximization (MLEM) algorithm [KWS+09], which is used for image reconstruction in positron emission tomography (PET). PET visualizes functional processes by measuring the distribution of a tracer of radioisotopes injected into a subjects’s body. Clinical PET scanners for example assist in tumor diagnosis. PET research currently focuses on improving spatial resolution and sensitivity of the technique. Our research is done on small animal PET scanners for preclinical stuies in cooperation with the Medical Institute Rechts der Isar (MRI). The MLEM algorithm is based on sparse matrix vector multiplication (SpMV). The efficient usage of heterogeneous systems with accelerator cards such as Intel Xeon Phi is still an open challenge. We have already developed an efficient implementation for MLEM on multicore architectures. In this work we seek for an efficient implementation of the MLEM algorithm on Xeon Phi (Knight’s landing) using hight-bandwidth memory (HBM). Verification is to be done by benchmarking against the Intel Math Kernel Library (MKL). A cluster system consisting of Xeon Phis is available at LRZ (CooLMUC3).

Comparison and Integration of Fault Resilience Mechanism for distributed Applications

In a current project at our chair, we are analyzing modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. Major challenges in exascale computing include an increasing number of nodes, dynamic resource allocation and organization, and fault resilience. So far, we have developed an extensible yet lightweight library (LAIK) to dynamically manage the application workload for better load balancing and proactive fault tolerance. This way, an upcoming failure can be avoided by proactively migrating application data to other physical location. Furthermore, by using our library, a global rebalancing can be triggered, ensuring application load balancing.

Our project partners at RWTH Aachen have developed an application-transparent framework in which running applications can be migrated to another physical location to overcome a failure by using virtualization or container technology.

In this master’s thesis, a comparison of performance and complexity with these two libraries a strategy for application fault tolerance is to be analyzed. In addition, a way of collaboration between these two mechanisms (e.g. decision making) is to be designed and developed.

In a current project at our chair, we are analyzing modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. Major challenges in exascale computing include an increasing number of nodes, dynamic resource allocation and organization, and fault resilience. So far, we have developed an extensible yet lightweight library (LAIK) to dynamically manage the application workload for better load balancing and proactive fault tolerance. This way, an upcoming failure can be avoided by proactively migrating application data to other physical location. Furthermore, by using our library, a global rebalancing can be triggered, ensuring application load balancing.

To assess and improve the performance of our library, runtime results from suitable high performance benchmarks are required. One of the most common benchmark suites is the NAS Parallel Benchmarks3. It mimics the data flow and computations for different kinds of typical HPC applications. These benchmarks are written in C and/or FORTRAN. In this master’s thesis, a selected subset of the NPB ist to be ported using our LAIK library. Performance test and analysis are to be conducted on the ported NPB benchmarks.

Analysis and Implementation of In-Memory and Near-Node Checkpointing/Restart Mechanism for HPC Applications Background

In a current project at our chair, we are analyzing modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. Major challenges in exascale computing include an increasing number of nodes, dynamic resource allocation and organization, and fault resilience. So far, we have developed an extensible yet lightweight library (LAIK)12 to dynamically manage the application workload for better load balancing and proactive fault tolerance. A central element to achieve full functionality in our library is to provide recovery based reactive fault tolerance.

In this master’s thesis, a strategy for reactive fault resilience based on in-memory and near-node checkpointing mechanisms is to be developed and integrated into our LAIK library. By the end of this project, our library shall be capable of dynamically recovering from an arbitrary number of node failures. Existing checkpoint/restart approaches for application fault tolerance in HPC shall be analyzed and eventually adapted to be used within LAIK. By demonstrating their functionality on an example MPI-based application, the efficiency and performance of these algorithms shall be assessed and validated. Analysis shall be done on state-of-the-art hardware such as the Linux clusters and the SuperMUC at LRZ (Leibniz Supercomputing Center).

Masterpraktikum für Games Engineering (IN2016, IN2257)

We provide master practical course for games engineering students.

If you are interested, please contact Dai Yang. A group of 4 students is required. A topic can be arranged with the student group together, so anything related with our topic might be a topic for you. Students are encouraged to bring us their own topic WITHIN our research interest for further discussion.

Please make sure that you contact us as group of 4 students. If you want to do you Master practical course in Semester N, then you have to contact us in semester N-1. A later arrangement is administratively not possible.

Design and Implementation of a Benchmark for Predicting System Health in High Performance Computing

In a current project at our chair, we analyse modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. A central element in this project is to find a reliable prediction method, which can determine the current system health state in a given HPC environment. Different methods are being evaluated by our research group currently. One of those is to run an efficient and fast benchmark in order to determine abnormality in system performance. Using such a benchmark and corresponding historical data, one is capable of predicting upcoming system faults, which may lead to a failure.