Upcoming Talks

Distributed decision and verification by mobile agents

We consider decision problems that are solved in a distributed fashion by synchronous
mobile agents operating in an unknown, anonymous network. Each agent has a unique
identifier and an input string and they have to decide collectively a property which may
involve their input strings, the graph on which they are operating, and their particular
starting positions. Building on recent work by Fraigniaud and Pelc [J. Parallel Distrib.
Comput, vol. 109, pp. 117–128], we introduce several natural new computability classes
allowing for a finer classification of problems below MAV or its complement class co - MAV,
the former being the class of problems that are verifiable when the agents are provided
with an appropriate certificate. We provide inclusion and separation results among all
these classes. We also determine their closure properties with respect to set-theoretic
operations. Our main technical tool, which may be of independent interest, is a new meta-
protocol that enables the execution of a possibly infinite number of mobile agent protocols
essentially in parallel, similarly to the well-known dovetailing technique from classical
computability theory.
(joint work with David Ilcinkas)

Past Talks

Redundancy in Distributed Proofs

Distributed proofs are mechanisms enabling the nodes of a network to collectively and efficiently check the correctness of Boolean predicates on the structure of the network (e.g. having a specific diameter), or on data-structures distributed over the nodes (e.g. a spanning tree). We show that many network predicates have distributed proofs offering a high level of redundancy, explicitly or implicitly. We use this remarkable property of distributed proofs for establishing perfect tradeoffs between the size of the certificate stored at every node, and the number of rounds of the verification protocol.
Joint work with Laurent Feuilloley, Pierre Fraigniaud, Juho Hirvone and Mor Perry.

Resilient PDE solving approaches for exascale computing

Resilience represents one of the many challenges to be anticipated and addressed before viable exascale systems can emerge.
In this context, I will present an approach to tackle resilience for the solving of deterministic and stochastic partial differential equations (PDEs).
The approach is based on a domain decomposition framework, together with a sampling and optimization approach. In particular, the occurence of silent data corruption (SDC), modeled by random bit flips in the 64-bits representation of double precision variables, is overcome by means of robust regression techniques involving l1-minimization.
In the case of an elliptic PDE with random coefficients, the stochastic quantities are approximated by a polynomial chaos (PC) surrogate and we then resort to a hybrid approach involving both sampling and Galerkin projection.
Finally, I will present discrete a priori bounds on the solution of elliptic PDEs, used to enhance the resilient capabilities of the approach. This a priori knowledge is used to check the admissibility of local PDE solutions and to filter out corrupted data, thus making the subsequent regression more robust.

Efforts in Machine Topology-Aware Global Scheduling for HPC Systems

Technological advancements in integrated circuits and computer organization have been leading to better high performance computing (HPC) systems at the cost of more complex designs including specialized networks, hierarchical and distributed shared memory, accelerators with their own complex subsystems, among others. In these systems, achieving high performance and reliability have been important objectives for a while, and energy efficiency has also become an issue in recent years. With these objectives in mind, I have been working with collaborators in different scenarios by trying to understand the behavior of applications and the architecture of the computing systems where they execute, and by combining this information to identify issues and propose new solutions. In this presentation, I will focus on some of my experience with the development of machine topology-aware global scheduling algorithms, and I will discuss future plans related to hybrid-memory systems.

Collaborative delivery by robots that can share energy

Unusual Day!
The seminar will take place on Wednesday
instead of Tuesday.

We consider the problem of delivery of a packet from a source node to a destination node in a graph, by a team of mobile robots that have very low battery capacity (two units of energy, with one unit spent each time an agent traverses an edge). The robots can collaborate to deliver the packet by exchanging energy or the packet itself whenever two of them are co-located. We study two variants of the problem: In the fixed placement variant, the robots are initially placed at some given nodes of the graph and the question is whether delivery is feasible. We prove that this problem is NP-complete. In the free placement variant, the initial positions of the robots are not fixed but a subset of nodes H of the graph is given as input together with an integer k, and the question is whether there exists a placement of k robots at nodes in H such that delivery is possible. We prove that this problem can be solved in polynomial time, by actually computing the minimum initial energy and the placement of robots at nodes in H that allows them to carry out the delivery.

(joint work with Shantanu Das, Dariusz Dereniowski, and Christina Karousatou)

March 27, 2018:
Oguz Kaya (Inria Bordeaux)

Algorithmic and HPC Challenges in Parallel Tensor Computations

Tensors, or multi-dimensional arrays, have been increasingly used in the recent past in many application domains including signal processing, quantum chemistry, data analysis and machine learning. Tensor decompositions, generalization of matrix decompositions such as SVD to higher dimensions, are employed to find a low-rank representation of data in these applications. This in turn enables finding feasible solutions to the problem at hand with a proper interpretation of this compact representation. Specifically, sparse tensor decompositions are used in data analysis involving recommender systems, graph analytics, and anomaly detection in order to predict missing entries in the tensor, while dense tensor decompositions are heavily employed in signal processing for the detection of unknown signal sources. Recently, another promising use case of tensors have arised in the solution of linear systems and eigenvalue problems in higher dimensional problems where matrices and vectors can be expressed using low-rank tensor decompositions, and all matrix-vector operations can be carried out in this compressed form with tremendous computational and memory gains. In all these applications, computing tensor decompositions efficiently is indispensible for rendering tensor methods practical when dealing with data of massive scale. The focus of this talk is challenges encountered in accelerating the computation of sparse and dense tensor decompositions using effective shared/distributed memory parallelization, partitioning, data structures, and algorithms.

January 9, 2018:
Julien Herrmann (Georgia Tech, Atlanta, USA)

Acyclic Partitioning of Large Directed Acyclic Graphs

Finding a good partition of a computational directed acyclic graph associated with an algorithm can help find an execution pattern improving data locality, conduct an analysis of data movement, and expose parallel steps. The partition is required to be acyclic , i.e., the inter-part edges between the vertices from different parts should preserve an acyclic dependency structure among the parts.
In this work, we adopt the multilevel approach with coarsening, initial partitioning,
and refinement phases for acyclic partitioning of directed acyclic graphs and develop a direct k -way partitioning scheme. To the best of our knowledge, no such scheme exists in the literature. To ensure the acyclicity of the partition at all times, we propose novel and efficient coarsening and refinement heuristics.
The quality of the computed acyclic partitions is assessed by computing the edge cut, the total volume of communication between the parts, and the critical path
latencies. We use the solution returned by well-known undirected graph partitioners as a baseline to evaluate our acyclic partitioner, knowing that the space of solution is more restricted in our problem.

July 11, 2017:
Antonella Del Pozzo
(CEA)

Building Distributed Computing Abstractions in the Presence of Mobile Byzantine Failures

In a distributed system the nodes affected by arbitrarily failures are modeled
as Byzantine. In this model the set of Byzantine processes never changes during
the time contrarily to what happens in a long lasting execution, where nodes
can be affected and restored. The Mobile Byzantine Model aims to capture such
kind of scenario. In this talk we present the different Mobile Byzantine
Failures models that have been defined so far and the problems, Consensus,
Approximate Agreement and Distributed Registers, that have been solved in such
models. Besides that, we are going to underline the challenges and
impossibilities that arises when we move from the Byzantine to the Mobile
Byzantine Failures model.

Slow Molecule Revolution

Unusual Day!
The seminar will take place on Wednesday instead of Tuesday.

Suppose to function correctly, a chemical system requires a single molecule of a certain species L. Preparing a solution with just a single copy of L is a difficult task to achieve with imprecise pipettors. Could we engineer artificial reactions (a chemical election algorithm, so to speak) that whittle down an initially large count of L to 1? Yes, with the reaction L+L→L+F: whenever two candidate leaders encounter each other, one drops out of the race. In volume v convergence to a single L requires expected time proportional to v; the final reaction --- two lone L's seeking each other in the vast expanse of volume v --- dominates the whole expected time.

One might hope that more cleverly designed reactions could elect a leader more quickly. We dash this hope: L+L→L+F, despite its sloth, is the fastest chemical algorithm for leader election there is (subject to some reasonable constraints on the reactions). The techniques generalize to establish lower bounds on the time required to do other distributed computing tasks, such as computing which of two species X or Y holds an initial majority, whether the initial number of X's is odd, or cutting the total number of X's in half.

February 21, 2017:
Ludomir Oteski
(ONERA)

Optimisation de code pour le calcul scientifique sur carte Graphique

Unusual Place!
The seminar will take place in room 455 instead of room 465

June 7, 2016:
Ian Masliah
(Univ Paris-Sud)

An Overview of Integrated graphics processors

Unusual Place!
The seminar will take place in room 475 instead of room 465

Exploring the use of Graphic Processor Unit (GPGPU) as a general purpose
co-processor to increase the computational power has been an active
research field since the release of the CUDA language in 2007. GPUs have
now become an integrated component in devices like smartphones, games
console or even modern CPUs. Integrated GPUs are an essential part of these
chipsets as performance and energy efficiency are the main concerns. In
this talk, we will discuss the programming model of integrated graphics and
their advantages as we now have the newest NVIDIA Tegra system-on-chip
available for the ParSys team.

May 31, 2016:
Asma Farjallah
(Intel)

Overview of Intel Knights Landing Processors

The Knights Landing is the 2^nd generation of Intel Xeon Phi product.
This new processor provides significant enhancement in performance
compared to the previous generation. Among the new features of this
self-booting CPU, we can name the binary compatibility with Xeon,
power efficiency of the cores, new 512-bit vector instruction set,
on-package high-bandwidth memory and integrated fabric. For this
presentation, we will overview the main hardware characteristics of
Knights Landing processors and their impact on the way we write code.

February 16, 2016:
Frédéric Gruau
(Univ Paris Sud)

A computing medium for General purpose Computation

Because of locality of communication in space, a computing medium such
as a Cellular Automata (CA) can scale abitrarily in size, thus
providing a machine which power can grow arbitrarily.
This same locality , however, makes the programming difficult, and
-CA as a computing device- ... look more like a toy for
mathematicians fond of funny images. They are mostly known for the
game of life rule, which allthough universal, is light-years away from
real practical computations.
We will show that a CA rule can be real General Purpose, and even an
efficient computing device. We use two main ingredient:
1- we have develop a compiler and a simulation platform allowing to
program complex CA with radius of 20 and thousands of gates, using
layers, and object oriented software, and an efficient execution
exploiting the SIMD integer operations.
2- we obtain general purpose-ability by implementing SELF
DEVELOPING SYSTEM.

The GPCA rule is not finalised yet, allthough we will be able to
present several simulations, showing the feasibility.
First, we show a CA rule able to do homogeneization of agents
distributed on the CA medium. When the agents are hierachically
organized, the homogeneisation still work on the hierarchy.
Second we present the current status of our GPCA. It executes
instructions send by a host, which propage through the medium like a
wave.
The first kind of instructions are local instructions: the effect is
immediate as the instructions traverse a CA cell. Local instructions
allow to select agents, and set registers.
The second kind of instructions call collective instructions, define
regions on the CA, in which a sustained activy takes blace by a
succession of 4- 5 region-waves giving birth to each other, at
specific places. They are the equivalent of the 4 - 5 cycles, needed
to execute a machine instruction in a processor.
Those instruction are micro-programmed, the micro-instructions are
associated to each region wave.
For the moment being we have implemented only 3 collective
instructions, the easiest one: Election, Computation, and
Duplication.
We will show a development obtained by sending 8 Duplications to an
initially single agent. It generates 256 offsprings, in only 600
iterations of the CA rules, which illustrates the scalability.
The finalised GPCA will make use of 4 more elaborate instructions,
whose functionning is described in detail in a report, but not yet
implemented.
The GPCA cell will have around 64 bits of memory, and 20,000 gates.
For the moment, it has only 6000 gates.

We are used to think of computer as General-Purpose machine with
arbitrary large memory. The GPCA is a General-Purpose machine with
arbitrary large power.

Task-based multifrontal QR solver for heterogeneous architectures

In this work we investigate the design of task-based sparse direct
solvers which constitute extremely irregular workloads, with tasks of
different granularities and characteristics with variable memory
consumption on top of runtime systems. We prove the usability and
effectiveness of our approach with the implementation of a sparse
matrix multifrontal factorization based on a Sequential Task Flow
parallel programming model. Using this programming model, we developed
features such as the integration of dense 2D Communication Avoiding
algorithms in the multifrontal method allowing for better scalability
compared to the original approach used in the qr_mumps solver.
Following this approach, we address heterogeneous architectures where
task granularity and scheduling strategies are critical to achieve
performance. We present, for the multifrontal method, a hierarchical
strategy for data partitioning and a scheduling algorithm capable of
handling the heterogeneity of resources.

December 8, 2015:
Janna Burman
(Univ Paris Sud)

Space-Optimal Counting in Population Protocols

In this paper, we study the fundamental problem of counting,
which consists in computing the size of a system. We consider
the distributed communication model of population protocols of finite
state, anonymous and asynchronous mobile devices (agents) communicating
in pairs (according to a fairness condition). This work significantly
improves the previous results known for counting in this model, in terms
of (exact) space complexity. We present and prove correct the first space
optimal
protocols solving the problem for two classical types of fairness,
global and weak. Both protocols require no initialization of the counted
agents.

The protocol designed for global fairness, surprisingly, uses only one
bit of memory (two states) per counted agent. The protocol, functioning
under weak fairness, requires the necessary log P bits (P states, per
counted agent) to be able to count up to P agents. Interestingly, this protocol
exploits the intriguing Gros sequence of natural numbers, which is
also used in the solutions to the Chinese Rings and the Hanoi Towers
puzzles.

Computers are used in a wide range of applications, including
embedded control systems in cars and aircraft, cloud computing,
and the Internet. As we increasingly depend on the correct
operation of such systems, it becomes crucial to design them in a
way that ensures that they behave as expected. To do so, one has
to address two problem areas: on the one hand, partial failure
that is outside the control of a system designer (such as power
outages or bit-flips due to radiation), and on the other hand,
design faults (bugs). The former is classically addressed by
means of replication and fault-tolerant distributed algorithms,
while the latter is dealt with by rigorous software engineering
methods, such as model checking. In order to maximize the
reliability, one should deploy fault-tolerant distributed
algorithms that have been verified by model checking. However,
only very few distributed algorithms have been automatically
verified, because many aspects of distributed algorithms, such as
parameterization, faults, uncertain timing, etc. still pose
research challenges for model checking.

In this talk I will discuss several verification techniques that
we recently generalized and used for automatic verification of
distributed algorithms. These techniques include parametric
interval data and counter abstraction, offline partial order
reductions, and acceleration.

Based on joint work with Annu Gmeiner, Igor Konnov, Helmut Veith,
and Ulrich Schmid

November 24, 2015:
Hartwig Anzt
(University of Tennessee)

Sparse triangular solvers are typically parallelized using level-scheduling techniques, but parallel efficiency is poor on high-throughput architectures like GPUs. This talk proposes an iterative approach for solving sparse triangular systems when an approximation is sufficient. Although not suitable for all problems, the approach can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. Significant performance gains can be achieved when using this approach for a preconditioned Krylov subspace method.

Time-Optimal Approximate Consensus in Highly Dynamic Networks

We investigate the approximate consensus problem in highly dynamic networks in
which topology may change continually and unpredictably. We prove that in both
synchronous and partially synchronous networks, approximate consensus is
solvable if and only if the communication graph in each round has a rooted
spanning tree. Interestingly, the class of averaging algorithms, which have the
benefit of being memoryless and requiring no process identifiers, entirely
captures the solvability issue of approximate consensus in that the problem is
solvable if and only if it can be solved using any averaging algorithm.

After this characterization of solvability of approximate consensus, we turn to
the question of time complexity.
We give an algorithm that solves approximate consensus in time linear in the
numer of nodes in the network and show the optimality of its time complexity up
to a multiplicative constant.
We also present its generalization to higher dimensional values and the case of
finite memory, in which rounding procedures will be necessary.

Unusual Time/Place!
The seminar will take place at 2:30 p.m. in Amphi Digiteo Moulon
(Bât 660).

Recent Progress in the Computation of Matrix Functions

Functions of matrices are widely used in science, engineering and the
social sciences, due to the succinct and insightful way they allow
problems to be formulated and solutions to be expressed. New
applications involving matrix functions are regularly being found,
ranging from small but difficult problems in medicine to huge, sparse
systems arising in the solution of partial differential equations.
In this talk I will outline the history of matrix functions,
summarize some applications,
and describe recent developments concerning computation of
the matrix exponential and logarithm and their Fréchet derivatives.

Biography: Nick Higham is Richardson Professor of Applied Mathematics in the School of Mathematics, University of Manchester. His degrees (BA 1982, MSc 1983, PhD 1985) are from the University of Manchester, and he has held visiting positions at Cornell University and the Institute for Mathematics and its Applications, University of Minnesota. He is Director of Research within the School of Mathematics, Head of the Numerical Analysis Group, and was Director of the Manchester Institute for Mathematical Sciences (MIMS) 2004-2010. He was elected Fellow of the Royal Society in 2007, is a SIAM Fellow, and held a Royal Society-Wolfson Research Merit Award (2003-2008).
He is well known for his research on the accuracy and stability of numerical algorithms, and the second edition of his 700-page monograph on this topic was published by SIAM in 2002. His most recent book, Functions of Matrices: Theory and Computation (SIAM, 2008), is the first research monograph devoted to this topic. He is the Editor of the Princeton Companion to Applied Mathematics (2015, circa 1000 pages).
He has more than 100 refereed publications on topics such as rounding error analysis, linear systems, least squares problems, matrix functions and nonlinear matrix equations, condition number estimation, and polynomial eigenvalue problems. His research has been supported by grants from EPSRC and by fellowships from the Nuffield Foundation, the Royal Society and the Leverhulme Trust. He currently holds a 2M euro ERC Advanced Grant (2011-2016) supporting his research on matrix functions.
Higham is a member of the editorial boards of the journals Acta Numerica, Forum of Mathematics, Foundations of Computational Mathematics. IMA Journal of Numerical Analysis, Linear Algebra and Its Applications, and Numerical Algorithms, He is also (Founding) Editor-in-Chief of the SIAM Fundamentals of Algorithms book series.
He was Vice President at Large (2010-2013) of SIAM and has served for over ten years on the SIAM Board of Trustees and the SIAM Council. He has also served on the Board of Directors of the International Linear Algebra Society, and as Chair of the SIAM Activity Group on Linear Algebra. He is a frequent invited speaker at international conferences, served for 17 years on the (permanent) organizing committee of the Householder Symposia on Numerical Linear Algebra, and was a member of the Scientific Program Committee for ICIAM 2011.
Higham has contributed software to LAPACK and the NAG library, and has written numerous M-files included in MATLAB. His algorithms are also included in Julia, SciPy, Mathematica and other packages.
Honours include the Alston S. Householder Award VI, 1987 (for the best Ph.D. thesis in numerical algebra 1984--1987), the 1988 Leslie Fox Prize in Numerical Analysis, a 1999 Junior Whitehead Prize from the London Mathematical Society, designation as a "Highly Cited Researcher" by Thomson/ISI in 2006, and the 2008 Fröhlich Prize of the London Mathematical Society.
Higham is also author of the best-selling SIAM books Handbook of Writing for the Mathematical Sciences (2nd edition, 1998) and MATLAB Guide (with D. J. Higham, 2nd edition, 2005), and is a contributor to the popular Penguin Dictionary of Mathematics (fourth edition, 2008).

With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to checkpointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application's processes. In this talk, I will introduce the intra-parallelization technique, a solution that avoids replicating all computation by introducing work-sharing between replicas. I will also present SDR-MPI, our optimized implementation of active replication for MPI applications. Experiments on a representative set of benchmarks show that intra-parallelization together with SDR-MPI allow achieving more than 50% efficiency without compromising fault tolerance.

Efficient abstractions for GPGPU programming

Currently, General purpose (GP)GPU programming is a popular solution to achieve high performance. It couples inexpensive highly parallel computing units with classic CPUs. These heterogenous systems lead to complex designs combining multiple paradigms and programming languages to manage each hardware architecture. This talk will present a set of tools to harness GPGPU programming through the OCaml programming language. It will describe the SPOC library, which handles GPGPU subprograms (kernels) and data transfers between devices. Then, It will show how SPOC expresses GPGPU kernels: through interoperability with common low-level extensions (from Cuda and OpenCL frameworks) but also via an embedded DSL forOCaml. Finally, I will present how to manage those kernels through parallel skeletons.

Equivalence of Iterated and Non-Iterated Models of Distributed Shared Memory

The Iterated Immediate Snapshot model (IIS), due to its elegant geometrical representation, has become standard for applying topological reasoning to distributed computing. Its modular structure makes it easier to analyze than the more realistic (non-iterated) read-write Atomic-Snapshot memory model (AS).

In this talk we show that AS and IIS are equivalent with respect to distributed computability: a distributed function is computable in AS if and only if it is computable in IIS. Therefore, the computability of any distributed function in AS can be equivalently investigated in IIS. This is a joint work with Petr Kuznetsov and Eli Gafni to be presented at OPODIS 2014.

Lazy Spilling for a Time-Predictable Stack Cache: Implementation and Analysis

The growing complexity of modern computer architectures increasingly complicates the prediction of the run-time behavior of software. For real-time systems, where a safe estimation of the program’s worst-case execution time is needed, time-predictable computer architectures promise to resolve this problem. A stack cache, for instance, allows the compiler to efficiently cache a program’s stack, while static analysis of its behavior remains easy. Likewise, its implementation requires little hardware overhead.

This work introduces an optimization of the standard stack cache to avoid redundant spilling of the cache content to main memory, if the content was not modified in the meantime. At first sight, this appears to be an average-case optimization. Indeed, measurements show that the number of cache blocks spilled is reduced to about 17% and 30% in the mean, depending on the stack cache size. Furthermore, we show that lazy spilling can be analyzed with little extra effort, which benefits the worst-case spilling behavior that is relevant for a real-time system.

Biography: Florian Brandner is an Assistant Professor at ENSTA ParisTech working on the intersection of time-predictable computer architecture, static program analysis, worst-case execution time analysis, and compilers. He received is PhD in 2009 from the Vienna University of Technology, working on processor description languages and the automatic generation of machine-code generators. Then spend two years as a post-doc at INRIA / ENS de Lyon, working on register allocation. Before joining ENSTA, he worked within the FP7-STREP project T-CREST at the Technical University of Denmark on time-predictable computer architectures, networks-on-chip, static program analyses, and compilers for real-time systems.