My Ph.D. research was in the hardware-software interface design for parallel programs, with particular emphasis on performance portability from the multicore hardware memory hierarchy design angle.

LLVM is more than just a compiler toolkit, much the way the term "Java" has taken on to mean many things. The part that this concerns is the IR, intermediary representation, which is the form that program takes in between as it is progressively translated from human computer language (LLVM has many "frontends" so you can write for this in many languages) to machine language (assembly before getting fed into assembler). The IR design is at the heart of LLVM (well any compiler system, really), as LLVM is composed of multiple interacting but independent programs that work together, following the "unix philosophy" of reusable components working together. So the rules that determine what IR should look like has very big impact on not just on LLVM's final performance (however you define it, I'll leave that undefined since it doesn't need to be for this particular point), but all the tools that work on it, whether they strictly read, write, or both.

Introducing parallelism to a program is a fundamentally different process than, say, making a single-threaded program faster. It fundamentally changes the algorithm and, more often than not, you have to basically rewrite the whole thing (even a new algorithm) to actually get good performance. Not every program or algorithm can be effectively parallelized. New ways of parallelizing algorithms IS an active research topic in the CS community.

LLVM's IR does not include any concept of parallelism. This means anything the programmer may know (or just by looking at the code) doesn't get translated into the IR, so the LLVM tools that read off of that IR is SOL even if it would've been able to compile it better had this information been available. This work extends the IR to include some of that info.

There are some more background info to put this work in a proper light. The group this paper came out of, headed by Charles Leiserson (The L of CLRS algorithms book), has been working on this topic for a long time, with startup bearing its name, cilkplus.org, even having been sold to Intel (now part of Intel C++ compiler). What they are doing is a fundamental work that I honestly can't see done by anyone else, if even possible. The hardware-software interface, the "coding" as we know it, is wholly inadequate to express parallelism, because we didn't have to initially. What is increasingly clear to researchers is that we may have irreparably crippled human programmers' mental models by teaching them serial programming first. An easy example is a simple loop that reads every element of a collection. Doing this serially introduces serial dependency among the elements even if they can be read in parallel just fine. But we always teach programmers how to program serially first, so the more obvious (just look at every element once, in any order) is simply inexpressible to begin with, and parallelizing that has to work against it.

I'm digressing here, but my point is the idea of fork-join parallelism is not new, nor is their attempting to introduce as a way to parallelize a vast class of algorithms, combined with work-stealing scheduler (a runtime system). Their claim is this vastly simplifies parallel programming (anyone who's done any and performance tuned will KNOW the benefit of this) - a point I don't agree entirely, but again that's a different story.

The value of this paper is the fact that, prior to this, they only had their own custom Cilkplus compiler (also based on LLVM but without the IR changes). And that product didn't actually touch the internal representations.

This work presents their latest incarnation that does this in LLVM framework and introduced actual changes to the IR to properly carry extra parallelism-related information, which have always been lost in traditional (serial) IR.

A few parting thoughts:

"better than any commercial or open-source compiler" - the only comparison was their previous incarnation. This phrasing has very little meaning.

"how does it compare to existing compilers?" - this is a qualitative improvement, not quantitative, so it's very hard to even compare. You're basically asking if this new fruit they invented tastes better than apples when compared. Even if it were a fruit people from other cultures were aware (oranges), they still would not be able to explain to people who have only experienced apples in any satisfactory manner.

"is it really faster? what's the final runtime?" - such numbers may seem useful but are wholly inadequate to explain anything substantial and misleading at best. The paper also clearly indicates and places its own value not just from "improvement" but the fact that this is done at all. Embedding parallelism into compiler IR has been a known hard problem, and this paper shows that it can be done and how they did it. There's certainly value in sharing this work.

Do you have some good reading materials about the nature of serial programming vs parallel? I think your example of looping through elements in order vs just read every element once in any order is really thought provoking.

As a lazy old comp sci grad with a BS from a mediocre state college, my mind has been dulled by years of basic shitty business programming and i'm inspired by your eloquent post to learn more.

Fork-join can sort of be thought of as a parallel for (or foreach) loop. How it's implemented in code is more complicated, but basically any work that is not dependent on other work can be done in parallel. For instance if you're implementing merge sort, it looks something like:

Now the magic scheduler can run the two mergesorts in any order or in parallel, and will block until all of them are finished regardless of which one finishes first. And once each half is done, they will be merged.

Similarly, all the functional programming models (map, filter, etc) are easy to represent in this model. Map causes a (virtual) task for each element in the source list which converts it to the destination list. You don't really care what order that they get written, just when execution continues on the next line that the destination is filled out.

There's a wiki page on OpenMP which has information on the model as well. The Task Parallel Library is what I'm familiar with from .NET, but it's pretty similar.

You typically have to evoke special libraries (OpenMP, MPI, etc) to compile parallel code. With MPI, you even have to run compiled programs through a wrapper executable that will handle the actual message passing between nodes.

They're bad even at simple parallelization they should be doing, like SIMD.

Even if you write C++ that hints the code should get SIMD'ed, slap pragmas to direct it to SIMD aggressively, and compile with aggressive optimization settings, looking at the assembly output is often disheartening (and you end up having to code it low level yourself).

I remember a few months ago I had a critical loop in an algorithm that was effectively just a FMA operation over two STL vectors of float32s, and no C++ compiler, on any setting (even ICC with #pragma SIMD one) would get the loop "correctly". Same with SIMD addition of two STL vectors of floats.

Not blaming them, like I said, compiler writing must be incredibly hard, but the adage of "the compiler is smarter than you" is simply not true if you need true low level performance.

I went to the University of Kent where Occam-Pi is developed. Learned the language on their concurrency course.

It's a lovely language. A lot of stuff I hear from Go Occam-Pi already does ... and more. Tbh skimming over the article it feels like I did all that already on my second year course with Occam's PAR and SEQ blocks.

Sadly it's lack of maturity is a real problem. I'd love to crack it out but it's very much a language designed for building research papers. Last time I checked they were rebuilding the compiler again for at least the 4th time. Plus whilst the concurrency model feels great the rest of the language still feels like a language from the 80s (especially with it's ALL CAPS syntax).

But I've always felt it would be awesome if a major company picked up the language and developed it for 5 years. Go was influenced from CSP but it's just not the same.

Start programming something in Nodejs, preferrably something IO bound (files, databases, networking, ...). The for loop example would do just about exactly this! "For each file in a directory" would translate to parallell reading in Node. When your data flows through callbacks and promises,instead og return values, you are on the right track.

If you want to get more to the core of it all, try Erlang . It introduces the actor model which is simple to understand yet very powerful in expressing parallelism.

If you want pure parallellism, choose Erlang. If you want to get your hands dirty and build something, choose Javascript/Nodejs.

Why are you making the assumption for other people that they can only find usefulness in CPU-level parallelism rather than being open-minded to think that for quite some people parallel IO is already interesting, and an easy way to wet to waters on top of that, so to speak.

It might be off-topic to the original article, but to the guy asking to get acquainted with parallelism I think it's a fair suggestion.

No, that's wrong. You can parallelize a serial code that's not i/o bound, and all of a sudden have an i/o bottleneck. It's an important consideration with any parallel program even if it doesn't end up being relevant.

I am on mobile and just skimmed the article. I got the chance to attend supercomputing 2016 where Intel presented a set of llvm ir extension proposals for simd and parallelism. It seemed well received. Is this work a similar proposal of ir extensions? Do you happen to know how this work compares to Intel's? I wonder if these researchers plan to push to have their work adopted by the larger community as well. I would like to look into this more in the morning, but would appreciate any thoughts you have.

This is actually a much bigger thing than the topic of this thread. LLVM IR has a rather unfortunate flaw - there is no decent way to represent any pragmas at all, OpenMP or not, making it really hard to postpone pragma processing until after few LLVM passes.

People are desperate, everywhere, not just in Intel - they do stupid things out of desperation, like introducing special intrinsic calls to encode the pragmas, or even worse in some cases - using special opaque type naming conventions (up until opaque types were eradicated), see the first SPIR versions.

I wish LLVM community was less hostile to changes like this, but it is unlikely we'll ever see a decent pragma or directive encoding there.

What would a generic annotation representation in IR (I assume this is what you mean by pragmas) look like?

The problem is that if annotations are allowed to change semantics (which they do, otherwise they're kind of useless), in a way that's not known to the optimizer passes, then they have to be a strong optimization barrier. So if you want the upstream optimizer pipeline to treat a specific annotation as something other than an optimization barrier you have to teach every transformation about this specific annotation. Which means every new "type" of annotation carries roughly the same level of technical debt as a new IR instruction.

You can see read discussion around the "IR regions" proposal that's happening right now on llvm-dev for the details.

Yes, I agree, it hurts. It's just that it's a really hard problem to solve within the context of upstream LLVM.

My preferred approach though is not regions but a sticky metadata (unlike the current metadata that can be discarded by passes) attached to instructions alongside with the sticking rules. Then passes do not have to know about individual annotations, only how to execute a fixed set of rules.

LLVM's IR does not include any concept of parallelism. This means anything the programmer may know (or just by looking at the code) doesn't get translated into the IR, so the LLVM tools that read off of that IR is SOL even if it would've been able to compile it better had this information been available. This work extends the IR to include some of that info.

So, their claim is that modifying LLVM to better anticipate parallel code yields faster compiled code than an unmodified LLVM that isn't designed to handle parallel code intelligently? Because that sounds... obvious.

But it sounds like they're comparing their modified LLVM to other compilers that don't use LLVM. The impressive point is:

the compiler “now optimizes parallel code better than any commercial or open-source compiler, and it also compiles where some of these other compilers don’t.”

That claim isn't in the paper or this draft, so it just seems to have come from the press release. So.... ¯\_(ツ)_/¯

edit: wow, these bots are really annoying. i figured out how to fix the shrug face before even seeing their replies... but thanks, i guess. In fact, they're not even right, it needs more backslashes than they claim. Probably because of the other ('s from the links on the same line.

The hardware-software interface, the "coding" as we know it, is wholly inadequate to express parallelism, because we didn't have to initially. What is increasingly clear to researchers is that we may have irreparably crippled human programmers' mental models by teaching them serial programming first. An easy example is a simple loop that reads every element of a collection. Doing this serially introduces serial dependency among the elements even if they can be read in parallel just fine. But we always teach programmers how to program serially first, so the more obvious (just look at every element once, in any order) is simply inexpressible to begin with, and parallelizing that has to work against it.

I believe map function collection expresses parallelism. Especially when in a pure functional language, so that you know function has no side effects and may be parallelized.

Of course, efficient parallelization is way more complicated than that, but I'm making a counterpoint to "coding as we know it, is wholly inadequate to express parallelism" and "the more obvious is simply inexpressible".

map is very limited. It implies there is no dependence between the elements of the tuple/list on which it is applied. We already have such a construct in most parallel programming languages (although to be fair they tend to be less powerful, as they're often implemented as forall types of loops).

There is work that shows how to generate parallel executions even when using map-like constructs with possible dependence, but it is far from trivial, and still has limits as tonhow/when it can be applied with an improvement in performance.

This. The challenge is that many practical problems have kinda complex dependencies. You can go a bit further with primitives like scan/fold, but some problems aren't so simple (think about sorting algorithms, for a simple example). You really need a way to express arbitrary data dependencies, such as a dataflow graph.

If you're really unlucky and are coding something like numerical integration, you'll actually find that the parallel execution strategy affects the accuracy of results (floating-point arithmetic being neither fully commutative nor distributive!), in which case the programmer actually needs to retain a lot of control over exactly how the program is parallelized.

Right, sorry, I mixed that up. It is commutative in a mathematical sense, but because it isn't associative, the order of operations matters if you're, for example computing a sum of many elements.

The lack of associativity is a big problem though, particularly when using atomics (often important for good scaling on massively parallel architectures) because the order of operations isn't even necessarily deterministic.

I am very aware. The software/language counterpart of my hardware was just that kind of language where parallel constructs are baked into the language syntax and their semantics understood by the compiler/runtime. This still is a very active research field but the field as a whole somewhat shifted its focus.

Guess everyone got enamoured with the polyhedral model, believing that explicit parallelism is not that necessary any longer. At least it looks like this judging by the volume of papers being published on this.

What is increasingly clear to researchers is that we may have irreparably crippled human programmers' mental models by teaching them serial programming first.

But computers do computation in a serial manner, not a parallel one. Even multicore/multiprocessor systems are essentially a bunch of tiny/sub computers doing stuff serially that happen to be able to access the same resources, much in the same way that you can have multiple computers in a network trying to access the same files on a shared folder. The hard part is figuring out how to take advantage of having many little computers and how to do it in a way that they don't trample on each other while doing so - but each computer is still executing serial code.

Nobody uses "parallel program" to mean that every instruction happens at once. The problem GP points out is that we do teach beginners a mental model where every instruction happens in order, one at a time, and we do make them use languages that can only describe that mode of execution. All the while, the physical machine we make them use does not work that way -- even two assembly instructions placed in sequence may be executing simultaneously. The "hard part" is made much harder than it needs to be by treating parallelism as an esoteric topic to maybe show advanced students in an elective course.

The problem GP points out is that we do teach beginners a mental model where every instruction happens in order, one at a time, and we do make them use languages that can only describe that mode of execution.

But as i said, this is how computers work, why not teach beginners that when the machines and the vast majority of the languages they'll encounter will work in sequence?

All the while, the physical machine we make them use does not work that way -- even two assembly instructions placed in sequence may be executing simultaneously.

This is not how the machine is programmed though - the assembly instructions executing simultaneously is an implementation optimization and as far as their semantics go, they happen in sequence. The only parallelization that exists in computers (with a potential exclusion of some research systems i am not aware of) that programmers have access to is through writing serial code for multicore/multiprocessor systems.

Everything the vast majority of programmers will encounter is done in sequence because this is how our systems work - including when doing parallelization. So it makes perfect sense for beginners to start with serial code because that is what they will actually deal with when writing programs.

The problem with parallelization isn't the serial code. It is managing access to shared resources and writing algorithms that can be split to multiple pieces so that they will run in the available cores/processors/computers/clusters/whatever - but those pieces will still be written in code that runs serially, so becoming familiar -and confident- with serial code is a prerequisite for doing so.

It is basically learning the alphabet before learning how to write essays.

Maybe i should have written "this is how computers are programmed". Well, i actually wrote that in the next paragraph. And yes, as i said, there are some CPUs can run instructions in parallel but this is an implementation optimization the CPUs can do, but as far as their semantics go they happen in sequence.

Of course if you are writing performance critical code you need to know these details, but this is not something that makes sense for beginners to deal with - which is what the top post was talking about.

No, that's you model of it. And it was a great model until the world changed, and the need for parallel programming arose. But its far from the objective reality of today; even the fundamental hardware is different, even seen on just a single thread of execution. The compiler wont care much about your order either, as long as it wont change the results of data flow.

But yes, that is the point! The point is that the semantics is exactly the problem with how its taught, and the why behind the slow progress of natural multi-core programming.

this is not something that makes sense for beginners to deal with

The argument is that by not ruining beginners minds with the sequential mindset, teaching parallel programming - as well as eventually progress on parallels programming language and runtime design as students turns to masters - wont be hampered in the first place!

Kind of like how you learnt your first language as a wee kid without even thinking about it, but it learning your second and third probably was a total cunt of an experience. Because you had all these models, models that no longer apply, that have to be unlearned.

Haskell would probably be a perfect teaching language, since it kind of has no order except data-flow and the monad (like the reality of your code, after its gone trough the compiler and cpu). I'd hate to be a student in that case, though :P

This isn't just "my model", it is what happens and how CPUs are programmed. All CPUs (that i know of anyway) run assembly in sequence (or behave as if this is how it is done). Individual instruction may do some things in parallel, but the semantics are serial: the code behaves as if executed in order, even if as an optimization the CPU can execute things out of order.

The point is that the semantics is exactly the problem with how its taught, and the why behind the slow progress of natural multi-core programming.

The thread so far (from where i started replying anyway) wasn't arguing about the semantics themselves but about what beginners are taught. My point is that beginners are taught that and it makes perfect sense to be taught that because computers (including all sorts of computers here, not just x86 but ARM, PowerPC and even microcontroller stuff) work and are programmed like that (some doing stuff in parallel is an optimization and the assembly semantics are still sequential).

If the way computer works and are being programmer changes in the future then this part about beginners will need to change. But this isn't how things are now.

The argument is that by not ruining beginners minds with the sequential mindset, teaching parallel programming - as well as eventually progress on parallels programming language and runtime design as students turns to masters - wont be hampered in the first place!

That would make sense in a world where computers worked and were programmed like that though and it might happen in the future. But the vast majority of computers today work (or appear to work) and programmed using serial code, so starting beginners about parallel code is teaching them about something they most likely wont use.

I mean, look at it from the opposite side: if you teach students about parallel code then you ruin their sequential thought ability (which is something a lot of people already find difficult), which is arguably more important considering it is how things work and are programmed in the computers they will use. And even parallelism is these days always exposed as a composition/layering of serial code (at least in a low level), not true parallel code.

Don't get me wrong, i don't say there is no merit in learning about parallel programming early. But i think it is still something that ought to be taught as a more advanced concept to reflect how computers work and behave in the real world.

(of course all the above assuming we're talking about beginner programmers, not about teaching computer science which for many people is more of a theoretical math discipline - but personally i don't know about that aspect of CS to judge)

Right, SIMD requires a shift in a perspective, but it is still sequential. SPMD is a bit more complex, you cannot guarantee any particular order of accessing a shared resource even when a mental model is "same control flow, different data".

I wonder if anyone tried to teach HDLs prior to any programming languages. It can be an interesting experiment.

Yes, but i wasn't talking about accessing shared resources - i already said that this is the hard part. I was talking about the code that runs there and accessing those shared resources is also handled via sequential code.

SQL is written declarative and can give access to parallelism. But it's certainly an edge case in this context.

The issue is not that you can't give users access to parallel computation without them writing sequential code. It's that more often than not the code generated behind the scenes by these kinds of systems will suck.

In order to get good code out one has to specify enough things that explicit parallelism isn't much more difficult.

I'm talking about how the computers (CPUs) are programmed, now what you can build on top of them. SQL is a higher level language, CPUs aren't programmed in SQL or anything remotely similar. SQL is implemented as an interpreter written in serial code or as a compiler that emits serial code for the CPU to run.

They are not programmed in electric signals, you aren't wiring signals to make a program in any CPU. The lowest you can go is assembly language (or the machine code it represents), anything below (microcode, etc) is implementation details for the CPU manufacturer and (usually) not available to the programmer whereas anything above is built on that assembly/machine code that the CPU is programmed with.

So it makes perfect sense to use that as the way a computer is programmed and works like.

I really don't mean for this to be taken the wrong way, but this already took me a couple of hours to bring it down for general consumption. ELI5 or ELI15 of my Ph.D. research in a space of a reddit comment is outside of what I am capable of, I'm afraid. If you would like to ask specific questions, I am more than willing to answer.

A primary difference between serial and parallel programming is the fact that in serial programming, you only need to think about one "actor (thread)" taking action at a time, while in parallel, you have to assume there's more than one. This fundamentally changes the mental model, since now you have to think about some variable changing its value not just by one thread.

A variable being updated by more than one thread uncoordinated is called "data race" and it is the bane of parallel programming, the root of all evil, if you will. Ultimately, it is this data race that makes parallel programming hard, which is one of the reasons why serial programming is easier and have been taught (first).

I did say "cripple" but I didn't mean to say we should've taught programmers how to program in parallel first. If anything, had there not been the kind of industry success like CPU makers saw in the 90s, parallel programming would've taken root a lot sooner. It's just that for over a decade, the industry improved in a way where the hardware industry basically pulled the rest with sheer computing horsepower (with more transistors), so people didn't have to rewrite software to get good performance (new chip, old software, just snappier).

That stopped in ~2004 and that's when we started seeing multicore chips because the chip makers couldn't figure out any other way to use more transistors. This in effect drove the "complexity" back into the programming model, which forced us to re-visit parallel programming paradigm even by general programmers (HPC and other fields have always used parallel programming models).

When I was doing my comp sci undergrad two of the PhD students (I lived with one of them) worked on different projects in the same field, and their supervisor asked them to collaborate on a library. They both criticised each others' code.

I think the supervisor's idea was they'd open-source this library and they'd both use it for producing the projects they were working on. I remember them telling me about this, and the first guy going "but his code is terrible!" and the other one giving him this side-eye look, "um, your code is terrible" and then it was just accusations tennis from there on in, with no specific examples.

You do realize LLVM was created thanks to NSF funding, and is the product of an academic work, right?

... although to be fair, it was clearly funded as part of an "infrastructure for academia" kind of grant: gcc was terrible because the front and back ends were artificially coupled for ideological purposes (which were perfectly valid in the 80s and 90s, but not so much in the 00s), and made it really more complicated than it should to add a transformation pass one the compiler. As for the other compilers, many simply were not open source and as a result couldn't be used by academics who weren't part of thr research group that created them.

Agreed, but Apple arrived after a solid code base had been already built, with multiple publications explaining how they used state-of-the-art techniques in both compilation and software engineering to make LLVM what it is. That, and gcc started to use GPL 3 starting with v4.3 I believe.

Yup. LLVM had a really solid start and good code (and documentation, which GCC still doesn't really have!), which is probably why Apple went with it instead of forking the old version of GCC.

Apple's contributions helped make the code really mature and full-featured, though, and prompted a bunch of other companies to jump ship as well. It went pretty rapidly from a promising research project to an industry workhorse.

The problem is that software engineers don't know the science well enough to quickly code what's needed, but the academics that do get it don't necessarily know how to code well. Also hiring software engineers to help the academics gets expensive. So yeah, you end up with academics writing these huge analysis/visualisation programs with no idea about how to structure them well.

In my experience, it's everything from high level things like bad class/function structure to minor things like bad variable naming, and all the classic "beginner C in C++" mistakes (like returning a pointer to a local array from a function). But they're doing this not on newbie exercises, but on HUGE data processing toolkits. It's shit code on shit code on shit code. And they don't have an interest in learning to write software properly, they just learn barely enough to do what they want and stop there.

Scientists have deadlines too! Journals and conferences happen at most once a year, and good venues are hard to get into (usually, around 20% of the submissions get accepted).
Now here's the problem: once you're done with this specific part of the research, you have to move on to the next part. There is never any kind of down time where you can say " I'll take a whole week and a half to refactoring and clean up my code" because you're already being assigned new work to work toward your next publication.

If your whole dissertation relies on a single piece of software then yes, you can have quality code (LLVM is a good example, but i have others in mind). Otherwise, you keep hoping you'll have some time to clean up your code and maybe even open source it, but it only happens once in a while, if at all.

Well-written C++ and well-written Fortran will probably perform about the same given equivalent compilers.

C++ also makes it a bit easier to do clever optimizations. It's just that doing so requires quite a bit of in-depth computer architecture knowledge, which scientists often don't have the time or interest to acquire.

I will tell you as a computer engineer who writes software for scientists that I wish Fortran had better standardized foreign-function interfaces so I could write code that integrates with it. At least Fortran 2015 improves the situation, if someone would actually implement support for it!

Better at what? Register allocation and auto-vectorization? God no, that's simply not what this paper is about. Which is why they do that "wierd" normalization of the results, to try to single out the single factor that they are working on.

There have been several attempts to push parallel constructs into the LLVM intermediate representation, and there are, in fact, a couple such efforts ongoing now.

There are two major reasons this hasn't happened so far:

1) It's really scary. The press release has this quote:

“Everybody said it was going to be too hard, that you’d have to change the whole compiler. And these guys,” he says, referring to Tao B. Schardl, a postdoc in Leiserson’s group, and William S. Moses, an undergraduate double major in electrical engineering and computer science and physics, “basically showed that conventional wisdom to be flat-out wrong. The big surprise was that this didn’t require rewriting the 80-plus compiler passes that do either analysis or optimization. T.B. and Billy did it by modifying 6,000 lines of a 4-million-line code base.”

The problem is that this sort of modifications can potentially introduce extremely subtle bugs in corner cases, where serial assumptions unexpectedly get broken. This is fine if all you're trying to do is a proof-of-concept for a paper. It's more problematic for a production-quality compiler. The effort required to actually audit the compiler and make sure nothing breaks is immense.

2) Nobody really knows the right way to do it. Making parallelism a first-class citizen in the IR means you need to nail down a specific model (e.g. fork-join). This is fine if all you care about is supporting one front-end - say, Cilk, or OpenMP. Maybe several similar frontends. It's less appealing if you want to be able to support different languages with different concepts of parallelism. The other option is to try to define a flexible IR model, that still has clear enough semantics that optimization passes can reason about. This, unsurprisingly, turned out to be rather hard.

This is exactly the reason I'm so enthusiastic about an idea of an abstract SSA IR. Keep as much of a low level details out of an IR as possible and implement most of the passes on an abstract level - this way you do not depend on any assumptions and you can plug the entire wealth of existing SSA-based passes into a new, higher level IR.

Could they not just put the time each program took to execute. How fucking hard is that? Oh no we're going to put the time it takes to execute in serial divided by the times it takes to run the parallel code on one processor. They get results below one. By that metric I can do better, just don't fucking do anything. I mean come the fuck on. Comparisons between different compilers is pretty useless as well as the authors point out, if they produce fast code the overhead will be a bigger proportion. Eh okay.

Oh you say but they did provide numbers in the end. Right. On the last page, see. Comparing against a reference compiler. Hmm let's see what the reference compiler is then... It's their own fucking thing. Come the fuck on, they're testing two of their different implementations against each other.

Now I suspect that this might actually not be complete mumbo jumbo, and I tried to find anything supporting that theory. But I've got nothing. So you know. That was totally worth spending an hour on.

They aren't trying to sell the implementation here. They are trying to explore the benefits of a particular strategy, and in order to keep all the other variables the same they needed to use a modified compiler as the Reference.

What they were measuring is how much benefit one can get by having the compiler possess intrinsic knowledge of parallelization primitives. They are not trying to claim "muhaha we made a compiler that is faster than your pitiful commercial compilers".

The research is interesting as well. The optimizations that they are doing automatically are ones that you should be aware of anyways if you are doing concurrent programming. That is, don't spin off threads for small tasks. Remove unnecessary synchronization, etc.

I think they probably could do more in languages with concurrency primitives. The fact that they target OpenMP style concurrency speaks to the fact that the story for concurrency with vanilla C/C++ kind of sucks right now.

Because thread-based concurrency is abhorrent. It's extremely difficult to write thread-safe code at scale, it can be just as easy to hurt performance as it is to gain any, & there are much better alternatives these days. std::thread was a good bare minimum 10 years ago. These days, task-based & message-based concurrency is easier to write correctly and ensures better utilization of multiple cores.

There's no thread pool nor multithreading data-structures built up in the STL meaning you have to roll your own every time which is error-prone & probably doesn't perform well. Also no lock-free/wait-free data structures & algorithms.

C++ does make a strong attempt at providing some core data structures & algorithms in the STL. However, it has nothing like this in the STL for multi-threading/parallel data structures & algorithms, although there is some progress to add them. It provides no primitives for IPC which makes cross-platform code more challenging. 3P libraries are fine but the same thing could have been said of C++ before it got std::thread & atomics since people managed to write multi-threaded code before C++11. Standardization commoditizes the work and provides a high-quality reference implementation the majority of developers can use correctly with minimal effort so there's very real value over 3P libraries.

Your argument also presumes that this can be addressed by libraries alone rather than altering the language to allow for friendlier syntax. For example, OpenMP makes it possible to annotate code to be auto-parallelized by the compiler by (ab)using pre-processor pragmas but that's not necessarily something a standard conforming compiler supports; case-in-point Clang until very recently. C++ doesn't provide for any equivalent standardized facility. Another example is vendor-specific ways of annotating thread-based code for lock-checking at compile-time & yet there's no standard way of expressing this.

I did mean to say "yet". That being say, there are no multi-threading data structures in the STL for C++17 AFAIK as that's in a parallelism TS that is still a WIP. The only enhancement C++17 is brining in in this space AFAIK is std::for_each can now take an execution policy object which is an algorithm tweak & a minor one at that. Don't get me wrong. It's great but it's a baby step compared to what is actually out there in this space. For example, Rust, which is a baby compared to C++, has very robust multi-threading capabilities baked into the language & standard library. So does Go. Even Java has better multi-threading mechanisms baked into the JDK.

I am not a C++ programmer, so I can not take any opinion with or against yours, but I'm just going to point out that having some sort of concurrency in a language's standard lib does not necessarily mean that that concurrency is well-implemented, sufficient for all tasks, efficient in terms of lines-of-code compared to other languages/frameworks, etc.

Yup, in fact it is really bare bones when you get right down to it. C++ doesn't really provide almost anything beyond locks and threads which is a far cry from the useful concurrent data structures that exist.

It is a little sad that C++ lags so far behind here. Languages like Java have had what C++ has now from the start and in 2005 got what C++ needs (concurrent data structures and task based logic). The story continues to get better with recent releases.

C++, on the other hand, got threads in 2011... and... nothing. 2014 rolled around, nothing. 2017 rolled around, nothing. Don't get me wrong, the 2011 release was awesome, but man, the standards committees for C++ have brought very little to the language for the last 6 years.

In the current world where everything has multiple processors, it is crazy not to have higher support for concurrency.

It's at the library level, vs. languages like OpenMP which provide parallel constructs at the language level. In the latter case, the compiler itself can integrate the notion of parallelism and apply transformations/optimizations to the program, whereas in the former case the compiler cannot do much, and all relies on the programmer.

Modern languages have support for concurrency in the type system. Functional languages do parallelism automatically. Basically, older imperative languages still resemble assembly language and assume the programmer knows everything.

Haskell is pretty close. I'm not entirely sure if there are fully automated solutions now, but it's surely possible in at least domain specific applications, such as database languages. Graphics languages obviously do some abstracted parallelism.
I don't really see a drawback though if all you are adding is an annotation. I think the bulk of the convenience is having the same interface for both parallel and sequential functions. I think spinning off threads automatically may not be feasible since the compiler doesn't know if it should take up threads that could be used by other programs.

It is a very trivial an uninteresting form of parallelism, equivalent to what a strict aliasing is doing - just assuming that nothing ever alias and that read/write order never matter. You need far more than this for an efficient concurrent programming.

Languages like Rust have typeclasses and memory management semantics that (theoretically) control and verify safe shared state in concurrent programs. In Rust, a typical race condition will likely not compile due to violating memory and type rules.

You need primitives that say that you do not care in which order the blocks of code are executed, and, even worse - that you do not care about a read/write order for a certain memory location. C++ does not have such primitives.

That actually does make sense, I guess I should have toned it down a notch or two. I do still think that comparing to a completely unmodified version would be interesting. Not providing more information does probably not warrant my rant. Pushing research forward is always good, the take away from this paper should be that giving the optimizer knowledge of parallelization primitives is most likely beneficial, which is not very controversial.

Your complaint should be directed towards the difficulty of measuring computer program performance in general, which is a known to be a very hard and essentially unsolvable.

This book Measuring Computer Performance: A Practitioner's Guide is a great introduction into the topic. Long story short, there is simply no way to say "this program is faster than that program" that is universally applicable. You can imagine a hypothetical stripped down machine (PRAM model) and talk about performance on it, which would be completely unrelated to real-world performance, or stick to a particular hardware you run so that you can compare.

Simply saying "this program runs faster than that" with no qualification right then and there IS the err. There is absolutely no way to do this - just look at the microarchitectural differences in CPU even within the same model, not to mention generation. Cache sizes, etc, all matters, so does the algorithm.

If you actually see a paper or a (fucking) science journalism that claim this algorithm/code runs faster than the other, that alone should give you a pause.

This is one of the main things that SO MANY software shops waste time/energy. If I had a dollar every time I heard a C++ programmer claiming "I can do better than STL," I wouldn't be working. Intel threading building blocks is another one. These are libraries that are optimized for a wide array of hardware ranges. YOU WOULD BE A FUCKING IDIOT IF YOU COULDN'T OUTPERFORM IT ON ONE PARTICULAR SETUP.

Of course you can't claim this runs faster on every machine ever. But you can claim, on the thirty different machines it was tested on with these specs we saw an average improvement with x. That wasn't my complaint though, they did publish numbers of them running their benchmarks ten times and taking a min on one computer. My complaint, warranted or not, was that they didn't provide a reference benchmark of an unmodified compiler.

And as an aside on the STL, another reason why it's so easy to outperform the STL is that there's a lot of stuff in the standard which puts restrictions on how its implemented. It isn't hard to write a hashtable that's faster than unordered_map for example, because the standard specifies that the default load factor needs to be 1, essentially mandating that the hashtable needs to be chained. Can't do small size optimizations on vectors for similar reasons etc. etc.

OK, I'll bite: Why is the particular restriction mentioned in the GP required? It isn't clear to me that it's much more than red tape (or, more explicitly, guaranteeing particular performance characteristics even if they're not optimal or necessary for correctness).

The original proposal includes the reasoning for making them use chaining. If they were being added in C++17 it'd hopefully be a different story because of new features like std::variant that work around the issues mentioned (or more often, the objections are really non issues). In the end, I think it's really a case of the people who wrote the proposal being most comfortable with one type of hash table.

I should try writing a compatible version using some sort of open addressing... be a fun project.

Of course the only fair comparison is against the same compiler but without the modifications. Otherwise there is no way to measure the impact of a particular optimisation - everything else must stay the same.

First of all you have to use cilk to use this, the parallelism is explicit. They've gained a little bit of performance by adding fork/join to the LLVM IR. Long story short, many loop-oriented optimizations don't work for parallelized loops because optimizations only work on sequential code, they can't look across threads. For example, afaik there isn't a general optimizer that can move loop-invariant code outside of a parallel for loop.

The problem that they are trying to solve is real, but I think their approach is wrong, I don't want the LLVM IR to be polluted by this, there must be a more general way to solve this problem.

I need to do more than skim through the paper. Intel published a paper more than 10 years ago to describe how they annotated their IR to account for OpenMP constructs. It's clearly not as detailed as this paper, but there seems to be a few common points.

I disagree with you by the way: using the front end to build a "parallel AST" is great because it's there you can do a lot of semantic analysis, but you simply can't avoid adding metadata in the middle-end/IR if you want to be able.tonperform certain kinds of control and data flow analysis and eventually apply some optimizing transformations.

This is a rather hot topic, both at the theoretical level ("how do I propose an SSA notation with explicitly added parallelism?") as well as the practical level ("how do I implement that in an actual compiler?"). I know IBM has done a lot of work on their (commercial) compiler in that direction. They do publish stuff but they don't advertise as much.

Yes but in this case you're tied to a specific execution strategy, fork-join with work stealing.

How about optimizing the chaining of futures/promises/observables across threads or go-routine like continuations or parallel processing large vectors? I agree that LLVM needs to do analysis across executing threads in certain cases, but fork-join as described in this paper is too specific.

Also you can debate the place in the optimization pipeline, if there is no general solution, it would be best to run these optimizations before the regular LLVM IR creation.

All universities have been doing so for a long while, dynamic programming, artificial Intelligence, machine learning, names of whole fields are not a lot beyond publicity stunt (probably not the right word) researchers came up with in the 50s

The way I understand this article, the application of their system to LLVM was just as a demo. They could easily do the same thing to any compiler they wanted. It also implies that the actual code they injected wasn't that complex.

Writing a one off compiler for a research paper that just happens to perform better with you hand selected test cases is one thing. But writing a real compiler used be millions of developers around the world that performs better in ever possible case or even the majority of cases is something else. When this result is repeated by at least 5 other researchers then you can say it is something but most research papers dont go anywhere because they just forced the results to fit what they want to have a paper to write and get more grant money

Sure, but it's not mainstream and it's general purpose. In the near future, I think domain specific languages are the way to go. Halide, for example, is very functional. Managing global state with concurrency is strictly more difficult so I can't see imperative languages surviving in the parallelism domain. Point was that imperative languages are at a disadvantage compared to other approaches as far as potential goes.