Projects

CACHETM - Memory Hierarchy and Cache Coherence for Many-core CMP

Introduction

Future many-core CMPs have the potential to offer unprecedented
computing power and energy efficiency, but only as long as processor
cores are able to access the data that they need to process. This is
not straightforward due to the well known problem of the memory-wall
(i.e.: the increasing difference in speed between the core and the
memory) and the complexities of the communication among so many
cores. To overcome this hurdles, on-chip caches and powerful
interconnection networks are crucial.

Unfortunately, the presence of private or partially-shared caches
creates difficulties for sharing data between cores. In order to keep
the shared-memory programming model which is familiar to most
programmers, a cache coherence protocol needs to be implemented.

In addition to traditional scientific and commercial parallel
workloads, currently one of the most practical uses of many-core CMPs
is server consolidation (usually by means of virtualization). At the
same time, new or revived programming models like transactional memory
and message-passing will benefit from hardware support at the level of
the cache coherence protocol.

The aims of this research line are to provide new solutions to the
cache coherence problem, to study novel organizations and management
policies for the on-chip caches and to propose techniques to enable
the efficient execution of established and emerging parallel
workloads.

Main Results Achieved

We have designed Direct Coherence, a new family of cache coherence
protocols which avoid the indirection of directory-based based
protocols without relying on broadcast communication by means of
shifting the responsibility of ordering requests and keeping coherence
information from the home node, used by directory-based protocols, to
the owner node. Direct Coherence for cc-NUMAs was presented in the
HiPC 2007 conference [Ros2007] and it was later extended for CMPs and
presented inthe IPDPS 2008 conference [Ros2008].

We also proposed novel mechanisms for conflict detection and resolution in Hardware Transactional Memory (HTM) systems. The first of these works proposes a directory-based conflict detection scheme, which can alleviate the performance degradation that eager systems experience when contention is high, and it has the potential to minimize the effect of false positives when hash signatures are used for transactional book-keeping. This work was published in the HiPC 2008 conference [Titos2008]. Later, we proposed a scheme for speculative resolution of conflicts that allows a writer transactions to continue its execution past conflicting access with other concurrent readers. This proposal was published in the IPDPS 2009 conference [Titos2009]. Furthermore, we have designed a hybrid HTM system that is capable of selecting the most appropriate policy, eager or lazy, for managing each individual cache line, depending on the past behaviour of the line in regards to contention. This data-centric design combines the best of both worlds, as it is able to achieve truely parallel commits when contention is low, while being able to extract good concurrency in situations of high contention. This proposal was published in ICS 2011 [Titos2011a]. Furthermore, we have thoroughly analised the implications that common structural optimizations such as store buffering have in the performance of both eager and lazy systems, which had been ignored in the literature. Our findings confirm that when write buffering is employed, the behaviour of eager systems is lazified and both HTM designs exhibit comparable performance, debunking the generalized perception that lazy systems consistently outperform their eager counterparts. This study was published in ICPP 2011 [Titos2011b].

In the context of consolidated servers, we proposed a operating system
based distance-aware round-robing mapping policy that tries to map
memory pages to the cache bank belonging to the core that most
frequently accesses the blocks within that page which was presented in
HiPC 2009 [Ros2009]. Recently, we analized previously proposed
cache-coherence protocols for server consolidation using
virtualization, found problems with the handling of shared memory
(i.e.: deduplicated pages) and presented a solution in the SBAC-PAD
2010 conference [García-Guirado2010].

Publications

The full list of publications can be found visiting the personal web pages of the group members, where you could download most of the papers in pdf format.

FTCMP - Fault-tolerance in CMP Architectures

Introduction

Current technology trends are increasing the number of available transistors per chip. Nonetheless, these trends are also making these transistors more prone to permanent, intermittent and transient faults. To overcome these problems, we need to develop new architectural techniques that will ensure the reliability of the chip. Traditionally, this can be achieved by adding a significant amount of redundant hardware, something which increases the cost of the device and decreases its performance and energy efficiency.

Our main goal consists of providing fault-tolerance with minimal performance degradation. For achieving this, we propose fault-tolerance techniques both at the microarchitectural level and at the interconnection network level.

Main Results Achieved

At microarchitectural level, we proposed a fault-tolerant architecture by redundant execution within SMT cores called REPAS. With this proposal, we achieve an improvement in terms of both performance degradation and area overhead compared to previous works. The results were published in the Euro-Par 2009 conference [Sánchez2009].

As a way to minimize the increased complexity, hardware and performance overhead, we presented Log-Based Redundant Architecture (LBRA) in HiPC 2010 [Sánchez2010], a highly decoupled redundant architecture based on a hardware transactional memory implementation. We leverage the already introduced hardware of LogTM-SE to provide a consistent view of the memory between master and slave threads through a virtualized memory log, achieving both transient fault detection and recovery, more scalability, higher decoupling and lower performance overhead than previous proposals.

For handling faults that happen in the on-chip interconnection network of CMPs, we propose to add fault-tolerance at the level of the cache coherence protocol instead of at the level of the interconnection network itself. We have shown the viability of our approach and we have developed several fault-tolerant cache coherence protocols. These results have been published in well-know international conferences (like HPCA [Fernández2007] and DSN [Fernández2008a]) and in the TPDS journal ([Fernández2008b] and [Fernández2010]).

Finally, we study the impact of hard faults on cache memories. To this end, we develop an analytical model for determining the implications of word/block disabling techniques due to random cell failure on cache miss rate behaviour presented at IOLTS 2011 [Sánchez2011]. The proposed model is distinct from previous work in that it is an exact model rather than an approximation. Besides, it is simpler than previous experimental frameworks which are based on the use of fault maps as a brute force approach to statistically approximate the effect of random cell failure on caches.

Publications

The full list of publications can be found visiting the personal web pages of the group members, where you could download most of the papers in pdf format.

NOVELGPU - Benchmarking the GPU for Novel Computations

Introduction

Current processors are endowed with many simpler processors, having a tremendous potential in terms of peak performance. Moreover, emergent platforms such as Graphics Processing Units (GPUs), the Field Programmable Gate Array (FPGAs), Accelerating Processing Units (APUs), etc. have been consolidated for developing scientific applications in different areas including bioinformatics, finance, seismic Processing, fluid dynamics, etc.
However, it is not a trivial task to take advantage of the potential performance that these platforms provide to the scientific community. In this task, we develop scientific application from different fields such as linear algebra, system biology, natural computing, image processing, etc. On these emergent platforms, providing insight into the peculiarities of their programming models and architectures. Currently, we are researching in applying those models to challenging problems, mainly derived from Bioinformatics.

Main Results Achieved

Our studies began with a performance study of the GPU as general purpose computing device, providing some insights into the peculiarities of CUDA programming model [Cecilia2009][Cecilia2010a]. In addition, we discuss alternative ways of computation inspired on natural computing and their efficient design on GPUs. Currently, we have worked on two of those models: P systems [Cecilia2010b] and Ant Colony Optimisation [Cecilia2011].

Introduction

In biomedical research, experimentation can be unfeasible in relevant study cases due, among other factors, to the intrinsic complexity of nature. Theoretical and computational methods jointly with biophysical and biochemical modeling can overcome these limitations providing new understanding and solutions for world health problems. Nevertheless, they need to process huge amounts of data with high accuracy and this can be a serious limitation for the applicability of bioinformatics methods. GPUs and Supercomputers can drastically speedup required calculations and facilitate the inclusion of new methodological enhancements not feasible in the past [Pérez-Sánchez2009][Pérez-Sánchez2010][Pérez-Sánchez2011a]. Our main objectives are related with the development, implementation and exploitation of bioinformatics applications in massively parallel architectures like GPUs and Supercomputers, its experimental validation, and its application to relevant world health problems.

Nowadays:

We are developing new biophysical simulation methodologies from scratch on massively parallel architectures [Sánchez-Linares2011c][Cepas-Quiñonero2011a][Sánchez-Linares2012].

We are adding new improvements to the biomedical applications.

We are exploiting the new methodological features introduced in the bioinformatics applications, thanks to the huge increment of computational capability obtained implementing these methods in massively parallel architectures, to biomedical relevant problems in collaboration with several experimental groups [Pérez-Sánchez2011b][Navarro-Fernández2012].

Main Results Achieved

We have worked on the implementation of the most computationally demanding kernels of the Virtual Screening program FlexScreen on GPUs [Guerrero2011][Sánchez-Linares2011a][Sánchez-Linares2011b], Grids/Clusters [Pérez-Sánchez2011c] and Supercomputers [Pérez-Sánchez2011d]Guerrero2012].

[Pérez-Sánchez2010] K. Klenin, H. Pérez-Sánchez, W. Wenzel. Method and system for determining the solvent accessible surface area and its derivatives of a molecule. European Patent Application 10002203.7, 2010.