2:00pm

Most embedded devices are multicore, and we see concurrency becoming ubiquitous for machine learning, machine vision, and self-driving cars. Thus the age of concurrency is upon us, so whether you like it or not, concurrency is now just part of the job. It is therefore time to stop being concurrency cowards and start on the path towards producing high-quality high-performance highly scalable concurrent software artifacts. After all, there was a time when sequential programming was considered mind-crushingly hard: In fact, in the late 1970s, Paul attended a talk where none other than Edsger Dijkstra argued, and not without reason, that programmers could not be trusted to correctly code simple sequential loops. However, these long-past perilous programming pitfalls are now easily avoided with improved programming models, heuristics, and tools. We firmly believe that concurrent and parallel programming will make this same transition. This talk will help you do just that.

Besides, after more than a decade since the end of the hardware "free lunch", why should parallel programming still be hard?

Paul E. McKenney has been coding for almost four decades, more than half of that on parallel hardware, where his work has earned him a reputation among some as a flaming heretic. Over the past decade, Paul has been an IBM Distinguished Engineer at the IBM Linux Technology Center... Read More →

Maged Michael is a software engineer at Facebook. He is the inventor of hazard pointers, lock-free malloc and several algorithms for concurrent data structures. His work is included in several IBM products where he was a Research Staff Member at the IBM T.J. Watson Research Cente... Read More →

Michael Wong is VP of R&D at Codeplay Software. He is a current Director and VP of ISOCPP , and a senior member of the C++ Standards Committee with more then 15 years of experience. | He chairs the WG21 SG5 Transactional Memory and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been in invited speaker and keynote at numerous conferences. | | He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. | | Previously, he was CEO of OpenMP involved with taking OpenMP toward Acceelerator support and the Technical Strategy Architect responsible for moving IBM's compilers to Clang/LLVM after leading... Read More →

3:15pm

Most embedded devices are multicore, and we see concurrency becoming ubiquitous for machine learning, machine vision, and self-driving cars. Thus the age of concurrency is upon us, so whether you like it or not, concurrency is now just part of the job. It is therefore time to stop being concurrency cowards and start on the path towards producing high-quality high-performance highly scalable concurrent software artifacts. After all, there was a time when sequential programming was considered mind-crushingly hard: In fact, in the late 1970s, Paul attended a talk where none other than Edsger Dijkstra argued, and not without reason, that programmers could not be trusted to correctly code simple sequential loops. However, these long-past perilous programming pitfalls are now easily avoided with improved programming models, heuristics, and tools. We firmly believe that concurrent and parallel programming will make this same transition. This talk will help you do just that.

Besides, after more than a decade since the end of the hardware "free lunch", why should parallel programming still be hard?

Paul E. McKenney has been coding for almost four decades, more than half of that on parallel hardware, where his work has earned him a reputation among some as a flaming heretic. Over the past decade, Paul has been an IBM Distinguished Engineer at the IBM Linux Technology Center... Read More →

Maged Michael is a software engineer at Facebook. He is the inventor of hazard pointers, lock-free malloc and several algorithms for concurrent data structures. His work is included in several IBM products where he was a Research Staff Member at the IBM T.J. Watson Research Cente... Read More →

Michael Wong is VP of R&D at Codeplay Software. He is a current Director and VP of ISOCPP , and a senior member of the C++ Standards Committee with more then 15 years of experience. | He chairs the WG21 SG5 Transactional Memory and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been in invited speaker and keynote at numerous conferences. | | He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. | | Previously, he was CEO of OpenMP involved with taking OpenMP toward Acceelerator support and the Technical Strategy Architect responsible for moving IBM's compilers to Clang/LLVM after leading... Read More →

4:45pm

C++11 introduced atomic operations. They allowed C++ programmers to express a lot of control over how memory is used in concurrent programs and made portable lock-free concurrency possible. They also allowed programmers to ask a lot of questions about how memory is used in concurrent programs and made a lot of subtle bugs possible.

This talk analyzes C++ atomic features from two distinct points of view: what do they allow the programmer to express? what do they really do? The programmer always has two audiences: the people who will read the code, and the compilers and machines which will execute it. This distinction is, unfortunately, often missed. For lock-free programming, the difference between the two viewpoints is of particular importance: every time an explicit atomic operation is present, the programmer is saying to the reader of the program "pay attention, something very unusual is going on here." Do we have the tools in the language to precisely describe what is going on and in what way it is unusual? At the same time, the programmer is saying to the compiler and the hardware "this needs to be done exactly as I say, and with maximum efficiency since I went to all this trouble."

This talk starts from the basics, inasmuch as this term can be applied to lock-free programming. We then explore how the C++ lock-free constructs are used to express programmer's intent clearly (and when they get in the way of clarity). Of course, there will be code to look at and to be confused by. At the same time, we never lose track of the fact that the atomics are one of the last resorts of efficiency, and the question of what happens in hardware and how fast does it happen is of paramount importance. Of course, the first rule of performance — "never guess about performance!" — applies, and any claim about speed must be supported by benchmarks.

If you never used C++ atomics but want to learn, this is the talk for you. If you think you know C++ atomics but are unclear on few details, come to fill these few gaps in your knowledge. If you really do know C++ atomics, come to feel good (or to be surprised, and then feel even better).

Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp. His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in... Read More →

9:00am

The most significant improvement in C++17 will be Parallel Algorithms in the STL. But it is meant only for CPUs, as C++ does not define heterogeneous devices yet (though SG14 is working on that). How would you like to learn how to run Parallel STL algorithms on both CPU and GPU?

Parallel STL is an implementation of the Technical Specification for C++ Extensions for Parallelism for both CPU and GPU with SYCL Heterogeneous C++ language. This technical specification describes a set of requirements for implementations of an interface that C++ programs may use to invoke algorithms with parallel execution. In practice, this specification allows users to specify execution policies to traditional STL algorithms which will enable the execution of those algorithms in parallel. The various policies can specify different kinds of parallel execution. For example,

So how does a Technical Specification become a Standard? As it turns out, in this case, not without harrowing twists and turns worthy of an Agatha Christie novel. This talk will also be the story behind the C++17 standardization process of the Parallelism TS and why we made so many changes. While it started life as a Technical Specification (TS), did you know all the changes we made to it before we added it to C++17 and why? For example, we changed the names of the execution policies, removed exception handling support, disabled dynamic execution, unified some of the numeric algorithm names, allowed copying arguments to function objects given to parallel algorithms, and addressed complexity and iterator concerns as we lived through it as a member of SG1 and the editor of several TSes.

The implementation is available here: https://github.com/KhronosGroup/SyclParallelSTL/blob/master/README.md

Gordon Brown is a senior software engineer at Codeplay Software specializing in heterogeneous programming models for C++. He has been involved in the standardization of the Khronos standard SYCL and the development of Codeplay's implementation of the standard from its inception... Read More →

Michael Wong is VP of R&D at Codeplay Software. He is a current Director and VP of ISOCPP , and a senior member of the C++ Standards Committee with more then 15 years of experience. | He chairs the WG21 SG5 Transactional Memory and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been in invited speaker and keynote at numerous conferences. | | He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. | | Previously, he was CEO of OpenMP involved with taking OpenMP toward Acceelerator support and the Technical Strategy Architect responsible for moving IBM's compilers to Clang/LLVM after leading... Read More →

2:00pm

This session will cover the various kinds of problems which can be solved by using multithreaded concepts or techniques. I will discuss the challenges involved with designing and implementing a multithreaded application.

I will provide a brief introduction to multithreading terminology and an overview of the libGuarded library.

The discussion will include C++11 multithreading, C++17 concurrency TS, and new abstractions wecan build on top of these features. Basic familiarity with the C++11 threading library will be helpful but is not required.

** Part II

The main focus of this talk will be about the importance of lockless containers and RCU technology. The value of this approach will be explained and why it was added to libGuarded. I will also cover recent changes made to the RCU containers.

I will explain the importance of libGuarded and how it was used in the CsSignal library to prevent deadlocks.

Either basic familiarity with multithreading or attendance in Part I of this talk is suggested.

I have been working as a programmer for nearly twenty years. My degree is in Computer Science from Cal Poly San Luis Obispo. I have transitioned to independent consulting and I am currently working on a project for RealtyShares in San Francisco. |
|
Co-founder of Copper... Read More →

3:15pm

This session will cover the various kinds of problems which can be solved by using multithreaded concepts or techniques. I will discuss the challenges involved with designing and implementing a multithreaded application.

I will provide a brief introduction to multithreading terminology and an overview of the libGuarded library.

The discussion will include C++11 multithreading, C++17 concurrency TS, and new abstractions wecan build on top of these features. Basic familiarity with the C++11 threading library will be helpful but is not required.

** Part II

The main focus of this talk will be about the importance of lockless containers and RCU technology. The value of this approach will be explained and why it was added to libGuarded. I will also cover recent changes made to the RCU containers.

I will explain the importance of libGuarded and how it was used in the CsSignal library to prevent deadlocks.

Either basic familiarity with multithreading or attendance in Part I of this talk is suggested.

I have been working as a programmer for nearly twenty years. My degree is in Computer Science from Cal Poly San Luis Obispo. I have transitioned to independent consulting and I am currently working on a project for RealtyShares in San Francisco. |
|
Co-founder of Copper... Read More →

2:00pm

RCU (Read, Copy, Update) is often the highest-performing way to implement concurrent data structures. The differences in performance between an RCU implementation and the next best alternative can be striking. And yet, RCU algorithms have received little attention outside of the world of kernel programming. Largely, this is because the most common drawback of RCU solution is complicated, and often wasteful, memory management. Kernel code has some advantages here, whereas a generic solution is much harder to design.

There are, however, cases when RCU is simple to use, offers very high performance, and the memory issues are easy to manage. In fact, you may already be using the RCU approach in your program without realizing it! Wouldn't that be cool? But careful now: you may be already using the RCU approach in your program in a subtly wrong way. I'm talking about the kind of way that makes your program pass every test you can throw at it and then crash in front of your most important customer (but only when they run their most critical job, not when you try to reproduce the problem).

In the more general case, we have to confront the problems of RCU memory management, but the reward of much higher performance can make it well worth the effort.

This talk will give you understanding of how RCU works, what makes it so efficient, and what are the conditions and restrictions for a valid application of an RCU algorithm. We focus on using RCU outside of kernel space, so we will have to deal with the problems of memory management... and yes, there will be garbage collection.

Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp. His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in... Read More →

3:15pm

Mutexes have frequently been observed to outperform reader-writer locks in domains where, logically, reader-writer locks should dominate. I was recently given an opportunity to address this inconsistency and, to demonstrate my certainty of success, accepted a bet regarding outperforming a mutex for a high read, low write work task with short — but not extremely short — lock hold times.

I lost the bet.

I resolved to understand how I lost this bet and, in my mind at least, convert this "loss" to a "win". The bet focused on a Linux platform (the evaluations presented are multi-platform). This presentation will discuss design criteria for a reader-writer lock, the "losing" implementation, the performance results for the "losing" implementation, a possible explanation for the loss, the novel "winning" implementation, and the results supporting the value of the "winning" implementation.

A basic understanding of mutexes, reader-writer locks, and atomic operations is recommended for attendees.

4:45pm

This is the long awaited continuation of a previous CppCon talk ("Lock-free by Example") on an "interesting" lock-free queue. ("interesting"? Well, "multi-producer, multi-consumer, growing, shrinking, mostly contiguous, lock-free circular queue" is a bit long. Maybe "complicated" is a better word.)

Attendance at the previous talk is completely NOT required.

This time we will not just review where we left off, but attempt to "prove" that what we did is actually correct, and thus discuss how to prove correctness of lock-free algorithms, and discuss provability vs testing.

And then, with the first steps proven (or disproven! - and hopefully corrected!), we can continue to expand the features of the queue, and tackle the new challenges that arise.

Also, this is secretly a talk to convince you not to do lock-free programming. Shhh...

Tony has been coding for well over 25 years, and maybe coding well for some of that. Lots of pixel++, UX, threading, etc. Previously at Inscriber, Adobe, BlackBerry, he now enables Painting with Light at Christie. He is on the C++ Committee. He is a Ninja and a Jedi. Lock-free is... Read More →

2:00pm

With the advent of modern computer architectures characterized by — amongst other things —many-core nodes, deep and complex memory hierarchies, heterogeneous subsystems, and power-aware components, it is becoming increasingly difficult to achieve best possible application scalability and satisfactory parallel efficiency. The community is experimenting with new programming models which are based on finer-grain parallelism, and flexible and lightweight synchronization, combined with work-queue-based, message-driven computation. Implementations of such a model are often based on a framework managing lightweight tasks which allows to flexibly coordinate highly hierarchical parallel execution flows.

The recently growing interest in the C++ programming language in industry and in the wider community increases the demand for libraries implementing those programming models for the language. Developers of applications targeting high-performance computing resources would like to see libraries which provide higher-level programming interfaces shielding them from the lower-level details and complexities of modern computer architectures. At the same time, those APIs have to expose all necessary customization points such that power users can still fine-tune their applications enabling them to control data placement and execution, if necessary.

In this talk we present a new asynchronous C++ parallel programming model which is built around lightweight tasks and mechanisms to orchestrate massively parallel (and distributed) execution. This model uses the concept of (std) futures to make data dependencies explicit, employs explicit and implicit asynchrony to hide latencies and to improve utilization, and manages finer-grain parallelism with a work-stealing scheduling system enabling automatic load-balancing of tasks. As a result of combining those capabilities the programming model exposes auto-parallelization capabilities as emergent properties.

We have implemented the this model as a C++ library exposing a higher-level parallelism API which is fully conforming to the existing C++11/14/17 standards and is aligned with the ongoing standardization work. This API and programming model has shown to enable writing parallel and distributed applications for heterogeneous resources with excellent performance and scaling characteristics.

Hartmut is a member of the faculty at the CS department at Louisiana State University (LSU) and a senior research scientist at LSU's Center for Computation and Technology (CCT). He received his doctorate from the Technical University of Chemnitz (Germany) in 1988. He is probably... Read More →

8:30pm

Attendees will learn what allows these architectures using computational HW accelerators like GPUs, DSPs and others with native C++, without resorting to proprietary APIs or programming libraries or limited language features. It outlines the architectural pillars that make the accelerators a peer to the host CPUs and support full C++, and an overview of the open source AMD ROCmTM stack and software ecosystem providing the tools to use it on Intel and AMD based host platforms.

Paul Blinzer works on a wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics... Read More →

9:00am

Shared-Nothing approach of "sharing memory by communicating" (instead of "communicating by sharing memory") gets more and more traction in the development world; this is not to mention that message-passing Shared-Nothing architectures have always been a cornerstone of both game development and UI development. These days, more and more projects realize the inherent dangers of combining business logic and thread sync within the same piece of code - which leads to cognitive overload (pushing developers well over 7+-2 boundary) and results in poor developer productivity, poor program reliability, and very often - subpar performance. In addition, message-passing programs allow to achieve determinism easily, which in turn provides very significant benefits, including such beauties as production post-mortem analysis, replay-based regression testing, and low-latency fault tolerance.

Within the realm of message-passing programs, the problem of processing non-void returns from non-blocking calls is a particularly ugly one. Over time, approaches to solving it have progressed from simple message-sending to OO-based callbacks, and further to the lambda pyramids and futures. Still, programming non-blocking calls is a Big Headache(tm). In this talk, we'll discuss _eight_ different ways of handling returns from non-blocking calls in the context of message-passing architectures (using event-driven architectures as an all-popular example of message-passing). We'll start with a simplistic message exchange, and will progress to void RPCs, OO-style callbacks, lambda pyramids, single-threaded futures, lambda-based "code builder", coroutines/fibers, and co_await.

Last but not least, we'll try to compare these different ways-to-handle-non-blocking-returns from the practical point of view, as well as the ways these eight ways are related to current C++ standard proposals; in addition - I'll argue for two important things-to-keep-in-mind for standard writers and implementors.

Sergey has 20+ years of software development experience, including 15+ years of experience in architectural positions. Among other things, he was a co-architect of a G20 online stock exchange, and a sole architect of a major online game with 400K+ simultaneous players. He's also... Read More →

10:30am

If you were to ask a C++ developer the question "what is execution?" you may get a different answer depending on who you asked. This is because execution means something different to the various users of C++; in areas such as multi-core parallelism, heterogeneity, distributed systems and networking. There are many commonalities that can be drawn between these different use cases, however, each too has their own distinct requirements.

Now imagine if C++ could bring together all of these and form a single unified interface for execution, one which would allow a distinct separation of computations from their method of execution. This is the challenge which a C++ committee subgroup has undertaken.

A recent joint effort by a group of interested parties within the C++ committee has been working on a solution which will bring together the requirements of all of these use cases into a single unified interface for execution. This unified interface will provide a generalised way of describing execution that will serve as an abstraction underneath common C++ control structures such as async, task blocks and parallel STL, and above a wide range of resources capable of execution.

This talk takes a subjective look at the story so far; the original papers that paved the way to where we are now, the underlying design philosophy that will come to represent execution in C++, and the current state of the proposal in progress. It will also present the various use cases that influenced the proposal, how their requirements helped shape the design and what challenges are still to be overcome.

Gordon Brown is a senior software engineer at Codeplay Software specializing in heterogeneous programming models for C++. He has been involved in the standardization of the Khronos standard SYCL and the development of Codeplay's implementation of the standard from its inception... Read More →

Michael Wong is VP of R&D at Codeplay Software. He is a current Director and VP of ISOCPP , and a senior member of the C++ Standards Committee with more then 15 years of experience. | He chairs the WG21 SG5 Transactional Memory and SG14 Games Development/Low Latency/Financials C++ groups and is the co-author of a number C++/OpenMP/Transactional memory features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. He has published numerous research papers and is the author of a book on C++11. He has been in invited speaker and keynote at numerous conferences. | | He is currently the editor of SG1 Concurrency TS and SG5 Transactional Memory TS. He is also the Chair of the SYCL standard and all Programming Languages for Standards Council of Canada. | | Previously, he was CEO of OpenMP involved with taking OpenMP toward Acceelerator support and the Technical Strategy Architect responsible for moving IBM's compilers to Clang/LLVM after leading... Read More →

11:05am

Facebook has developed tooling to help quickly find and debug several classes of concurrency bugs in Facebook's large C++ codebase. In this talk, we will focus specifically on deadlocks and the tools we use to detect and prevent them. We will explore the various tools we use — some open source tools we have deployed and some we have developed — and how they work by walking through several examples of real-world bugs found by these tools in Facebook's large production systems.

Topics include: * How we deploy and utilize ThreadSanitizer on Facebook's large codebase * Linux eBPF tools to detect potential deadlocks on running binaries * gdb extensions to examine mutex internals to detect deadlocks * folly::Synchronized and other libraries that make it more difficult to introduce concurrency bugs

Kenny Yu is a software engineer at Facebook. In his time there, he has focused on improving testing and developer experience for engineers at Facebook, working on things such as debugging tools and concurrency bug-finders. He currently works on Facebook's cluster manager and cont... Read More →

1:30pm

In High Performance Computing, data access has complex implications and requires concepts that are fundamentally different from those provided in the STL.Iterators as we know them just are not enough.The proposed range concepts for the standard library are a significant improvement but are designed for the mental model of iterating and mapping values, not hierarchical domain decomposition.

Even for a seemingly trivial array there are countless ways to partition and store its elements in distributed memory, and algorithms are required to behave and scale identically for all of them. It also does not help that most applications operate on multidimensional data structures where efficient access to neighborhood regions is crucial. Among HPC developers, it is therefore widely accepted that canonical iteration space and physical memory layout must be specified as separate concepts.

For this, we use views based on multidimensional index sets, inspired by the proposed range concepts.

In this session, we will explain the challenges when distributing container elements for thousands of cores and how modern C++ allows to achieve portable efficiency.As an HPC afficionado, you know you want this:

[ cpplang.slack.com : @fuchsto ] |
|
Tobi is a freelancer in embedded systems and real-time applications for over 10 years, mostly for medical devices, and went back to academia for PhD studies in High Performance Computing at LMU Munich. |
He is the lead developer of the... Read More →