and if so, is Perl doing any effort (maybe trough Perl6) in that direction?

Yes. Perl 6 is specified with concurrency in mind, and there are some constructs that can be automatically parallelized, and many features are geared to make that painless. Sadly only pugs implements some parallelism so far, and only very basic constructs.

Will the functional programming languages become more relevant with the multi-core revolution?

Especially pure functional languages like Haskell. I can imagine it's much easier for the compiler to determine interdependencies than for normal, imperative languages. But it remains to be seen how much that will matter in practice; I believe that both language families have much space to improve wrt concurrency.

Will it worth to start today my project in one of those concurrent programming languages?

That really depends on how easy or hard they are to parallelize. I'm currently doing some simulations that are embarrassingly simple to parallelize - I always have to cover a large parameter space, so I just start one process per parameter, let it run for an hour and collect the result in a file - there's really no point parallelizing that more unless somebody throws more than 200 CPUs at me, each equipped with at least 8GB RAM.

Which ones will survive, which will be abandoned and which will be a reference for concurrent programming?

I don't think that's clear yet. There are a few language that I'm pretty sure will stay, which have proven to fit very well for some problems (C and Haskell for example; I'm also pretty sure that Erlang has a sufficiently large community to survive).

I know, and the other hackers know that as well. At the heart of Rakudo is Parrot, and that already implements concurrency - it "just" needs more testing, and the compiler needs to emit the corresponding code.

How often do you think it can be expected that your program will have the whole of the computer to (ab)use? How often do you think the computer doesn't have several other tasks to run?

IMnsHO for most applications, one core will be plenty and the multiple cores will be used by multiple processes. For most of the rest, splitting the task into a few parts and running several instances of the program will be both good enough and easiest. So no, I do not think "multicore" programming is something most developers will need to care about in the near to mid future.

There is one big problem with the (even partially) automatic concurrency. If you split the task into too many too small pieces then you waste more than you gain. A perfect example would be the paralelization of something like @result = map {$_*2} @source;. The task you need to do with the items is simply too simple and inexpensive. Even if you split the (huge) @source into just a few parts and parallelized those, the overhead would slam any gain. But once the chunks get sufficiently big, they usually also get too complex to parallelize automatically. So I'm a little skeptic.

P.S.: I believe the concurrent stuff has been taken out of Clean. It did use to be called "Concurrent Clean" originally, but the concurrent evaluation annotations had been removed in version 2.0. Not sure what's the current status of things.

There are already optimizations made with respect to pipelining and with instruction sets themselves. To say the world will remain single core is to assume that every instruction is the direct ancestral result of what came before, every time. I don't think this is a reality of computing more than a perception of what we're used to. You need multiple cores for so many reasons these days: anything with java including firefox(firehog IMHO), windows, daemons, indexing, databases, and so on. Within each of these lies the opportunity to parallelize.

How often do you think it can be expected that your program will have the whole of the computer to (ab)use? How often do you think the computer doesn't have several other tasks to run?

Programs should not bother about limiting their CPUs/CPU time usage. Actually it is quite the opposite, they should use as many (concurrent) CPU resources as possible in order to finish sooner.

It is the operative system who has the responsibility to ensure that some program does not degrade the performance of any other application running concurrently by just creating too many threads or any other mean.

I'm not talking about limiting themselves artificially. I'm talking about bending backwards to escape the single core limitation. I do believe that most often it's not needed. That most often there's plenty other processes to use the other cores. And that if all those processes attempt to use several cores, they will all finish later than if they did not bother.

I'd say multi-core programming isn't just the future, it's the present. It's been at least a couple years since anything I wrote ran on a machine with just one core. Usually the kind of parallelism I need is easy to get - Apache/mod_perl runs multiple copies of my code and on the backend MySQL runs multiple threads to service them. For slightly harder problems requiring some coordination between processes I use Parallel::ForkManager and pipes with IO::Select.

I'm very sweet on Erlang, but whether it's right for you is something only you can judge. You might get to market faster by leaning on the built-in parallel features or you might get bogged down in the learning curve of a new environment. I think a lot depends on just how much parallelism you need to exploit - if we're talking about dual and quad core machines then some fork()ing Perl code is probably good enough. If you're deploying on a cluster of several hundred 32-core machines, well, you might need more management ifrastructure and going with a pre-built system may be a better idea than rolling your own.

Concurrency is a bug-spawning PITA. Managing access to shared resources is hard, and hard to test well. Any programming discipline which delegates responsibility for managing a large pool of shared objects to the user is a dead end.

The easiest way to achieve reliable concurrency is to minimize the number of shared resources. Using OS processes as your concurrency model achieves this without imposing any constraints on choice of programming language.

Unfortunately for those of us trying to write cross-platform compatible open source code, the menu of IPC techniques that work accross a wide breadth of operating systems is very limited. IMO, this is a problem which needs to be solved through OS innovation, not language innovation.

Does not compute. OS innovation != cross platform. Language innovation can help with cross platform coding if the language implementers are sufficiently clever. See, for example, Perl's fork() implementation on Windows. Not perfect, but better than anything MS is likely to give us in the next 10 years.

I think Erlang is about as close to "OS innovation" as you're likely to see in this space. The Erlang runtime is practically a little OS unto itself with its own process model, networking, storage, etc.

The following was started 2 days ago, but I've been unable to finish it. It was suggested to me I post it as-is anyway, and if there is sufficient interest, I'll try to complete it later.

My first response to almost every question you've asked, is: It depends.

It mostly depends upon the requirements of the algorithm and/or data that needs processing. There are essentially 3 basic types of parallelism (with many variations):

IO parallelism:

The simplest of them all. Mostly IO-bound.

Typified by webservers and the like. Each execution context is essentially independent of all others.

Each execution context spends most of its time waiting (a device or communications partner) for something to do. And when it does get something to do, it does it relatively quickly and then goes back to waiting.

Task parallelism:

Potentially the most complex. A mixture of IO and cpu.

Typified by the Google Map-Reduce architecture. (Though that is (currently) applied at the macro (box) level).

Each execution context runs a different algorithm on independent data, but they need to coordinate and synchronise between each other. Here the algorithms (tasks) are overlapped by passing the stream of data from one stage to the next in smallish chunks. Think of a pipeline where one stream of data needs to be processed by several different algorithms, but serially--read; decode; transform; encode; write.

Data parallelism:

Potentially the biggest gains. CPU-bound.

Typified by image (X-ray; MRI;etc.) processing.

One large homogeneous dataset needs a one or two, often simple algorithms applied to the entire dataset. Small subset of the dataset are assigned to each execution context perform the same algorithm(s) on.

The problems arise when the process also need to perform some cpu-intensive processing in addition to the IO-driven processing. This necessitates the programmer injecting artificial breaks into the cpu-intensive code in order to ensure timely servicing of the IO events.

Whilst not onerous for a single single program with exclusive use of the hardware, it becomes necessary to re-tune the software for each cpu; OS; end even application mix; that ths code will run on. making for costly porting and on-going maintenance.

As mentioned above, this type is currently being dealt with at the macro-level.

Using clusters of commodity hardware and disk-based 'pipelines' for shared storage. Whilst reasonably effective for the last, current, and possibly next generations of 1, 2, and 4-core commodity boxes with 2 to 4 GB of ram, the cost of all the disk-IO between stages will start to make itself felt as we move to larger numbers of cores and ram per box.

Once you have cores ranging from 8 to 32; multiplied by simultaneous threading per core of 2 to 8; and anything from 64 to 512GB per box, it makes less and less sense to use processes and disk-based storage to handle the pipelines.

When the machine can store an entire partition of the dataset entirely in memory, it is far quicker to load the dataset into memory once, and have each stage of the pipeline operate on it in place. You could do this by running each stage over the dataset serially, but as with the current scheme, handing over smallish chunks from stage to stage allows you to overlap the stages and vastly reduce the end-to-end processing time. So, running all the stages as threads, bucket-chaining over a single, large shared dataset is a natural evolution of the processes and disks model that brings huge efficiency saving for negligible extra infrastructure costs.

Just a single shared integer between each stage that is incremented by the previous stage and read-only to the following stage. Indicating how far the previous stage has made it through the dataset, and where the following stage continues it next cycle. Utilises only simple, two-party condition signalling with no possibility of dead-locks, priority inversions or any of those other scare-monger nasties.

This type is already happening in games and Computer Generated Imagery.

It simply doesn't make sense to partition these kinds of datasets (images etc.) across processes, let alone boxes. The shared-memory model and threading, is the only way to go. But again, the "big problems" of the shared memory model--deadlocking; priority inversion etc.--do not arise, because the each thread operates exclusively on its own partition of the overall dataset.

By linearly or recursively partitioning the dataset, no locking is required and each thread is free to run full-speed on its part of the problem to completion. The only synchronisation involved is the master thread waiting for the workers to finish before it does whatever needs doing with the results.

More and more, this kind of processing will be offloaded to specialist CPUs (dsps; gpus)(hereafter SPUs), for processing. However, with current setups this involves the transfer of data from cpu memory to SPUs memory and back again. And with potentially multiple processes needing the SPU's services, and with memory access already the bottleneck on performance, it will make more sense in the near future for Spus to do their thing by directly accessing the data in the main memory. Effectively making the SPUs cores extensions of the main cpu. We're already seeing this with SIMD and similar instruction sets.

The future is threaded. We are just waiting for the software to catch up with the hardware. And I fear we are going to have to wait for the next generation of programmers before that problem will be properly fixed.

Just as many of my generation have problems with using SMS--and with the language forms it generates--whilst it is simpler than tying shoelaces for the new generations. So it is with threading. Many in my generation only see the problems involved--not the potential.

Just as it tool the ubiquity of the web to drive the transition from th request-response model to the upcoming Web 2 era. So, it will require the ubiquity of threads and cores to drive the transition from forks&pipes to threads&shared state.It may take until the retirement of the Luddite generation for it to happen. But it will happen.

As you've rightly indicated, one of the primary drivers of the transition will be the evolution of computer languages that give easy--and well abstracted--access to the potential. Whilst many of those you cite are a adept at one or more of the types of concurrency, none of them are truly adept at all of them. And problems arise because real-world problems tend to require two or more of those types in the same application.

A secondary problem with most existing language abstractions of concurrency is that they tend to take one of two tacts:

A low-level approach--typified by C/C++ and libraries like Intel Threading Building Blocks--whereby the programmer is given access to, and required to exercise, full control over all aspects of the threading.

This not just enables, but forces the programmer to deal with all the issues of sharing state, not just for that data that needs to be shared, but all state! And that places the onus upon the programmer to ensure the segregation of per stage (per thread) state. A time consuming and rigorous task much better suited to automation by the compiler.

A high-level, encapsulated approach--typified by most of the concurrent FP languages like Haskell and Erlang--which removes most of the control from the programmers hands.

The problem here is that the FP (and even const correct procedural) languages prevent the programmer (at least without resorting to extraordinary procedures (with often "dangerous" sounding names)), from sharing state for those data that need to be shared, with the result that the data has to be needlessly and expensively copied, often many times, as it transitions through multi-stage pipelined processes; or as multiple workers concurrently do the same processing on their own small chunks of the total dataset, and then they are 're-assembled' back to a whole.

This compiler enforced (and needless) replication of state becomes a huge drain upon system resources via memory thrash, IO and communications protocols overheads. This can turn O(n) algorithms in to O(n^2) algorithms (or worse), once those overheads are factored into the equations. That detracts from, and can even completely negate, the benefits of concurrency. Add in the extra costs of concurrent development, maintenance and testing, and it is easy to see why people see little to be gained from the concurrency.

The solution is relatively simple, in concept at least. There needs to be a clear delineation between that state that needs to be shared--for efficient concurrency. And that state that mustn't be shared--for algorithmic safety. And the programmer must have explicit control over the former, whilst the compiler ensures and enforces the latter. In this way, thread procedures can be written without regard to the safety of local variables and state. Anything declared 'local', or 'non-shared', or better, any not explicitly marked as shared, is protected through compiler-enforced mechanisms from leaking between threads. At the same time, shared state can be passed around between threads as the needs of the algorithm dictate, without incurring huge penalties of copying involved in write-once variables.

Another way of viewing the two concurrency abstractions is:

the former seeks to give the programmer a zillion tools for controlling access to shared state and synchronising control flows between threads.

The problem here, beyond the well-known difficulties of getting this stuff right, is that the exposure and predominance of these mechanisms actively encourages programmers--especially those new to concurrency--to design their algorithms around utilising locks and synchronisation.

whilst the latter seeks to hide all such control within the abstraction beyond the programmers reach.

With this approach, you end up with either weird and unintuitive programming constructs and paradigms; or huge, unwieldy and slow compile-time analysis engines; or belt&braces, copy-everything and lock-everything always--"whether it needs it or not"--infra-structure and library code.

There is a third answer to the problems of locking and synchronisation. Avoid them! Sounds too simple to be a point worth making, but it is a fact that most algorithms and applications that can benefit from concurrency can, with a little care, be architected such that they use little or no locking or synchronisation, despite that they may manipulate large volumes of shared state, in place. The trick is to program the application so that only one thread attempts to access any given piece of data an any given time. I'm not talking about using mutexs to prevent concurrent access. More, mechanisms that don't allow the programmer to try. But without throwing the (procedural) baby out with the bath water.

This can be done today--in any language. It just requires the programmer to be aware of the techniques and apply them. But it would be a whole lot easier if programmers didn't have to become aware of the implementation details or re-invent them every time. To that end, there are several new abstractions that can help:

Memory delegates- handles to (subsections of) real shared memory, that can be passed between threads, but which can only be used by the thread holding the delegate. Old copies of the delegate become disabled. Runtime enforcement is fatal. Compile-time detection easy.

Thread-controlled shared variables:

This variable is shared--by threads A & B (& F ...) ONLY!

Access type controls on shared variables:

This variable is shared between threads A & B, but only thread A can write to it and only thread B can read from it. And B cannot read from it until A has written it. And thread A cannot write to it again until thread B had read it.

Bi-directional latching shared variables.

Variable shared between threads A & B, readable and writable by both; but cannot read until A has written; and once B has read, neither can read (and A cannot write again) until B has written; now only A can read and neither can read again (and B cannot write again), until A has written.

And so you encapsulate the request-response communications protocols in a single variable.

Self limiting queues:

Like thread controlled variables, which threads have access (and what type of access) is controlled, and both compile-time and run-time enforced. But in addition, the queues limit their own sizes such that any attempt to write when the queue is 'full' causes the writer to block. Any attempt to read when the queue is empty causes the reader to block.

You get self-regulating synchronisation with tunable buffering.

Declarative threading of functions for 'fire and forget' usage.

Delayed (lazy) results from threads (promises).

Instead of having explicitly create a thread passing the thread procedure address; retrieve the handle; and then later, explicitly call a blocking join to get the results. Instead, the above two combine to give something like:

With a few simple control and data sharing constructs such as these, all the scary, difficult parts of threading disappear under the covers, and the programmer can get back to concentrating uponn their application whilst knowing that it will benefit from whatever core and threads are available.

You can imagine the pipeline example (read; decode; transform; encode; write), from above being written something along the lines of;

This is a complicated problem. However I do not believe that multi-threading is a real solution for reasons that I explained on another forum some time ago.

The short version of what I said there is that Moore's law is looking to add cores faster than SMP can scale, at which point we'll go to NUMA inside of a CPU. And once we go to NUMA, multi-threaded programs don't scale. So while there are some very important niches which can benefit from serious multi-threading, I don't buy the dream of unlimited numbers of cores available to a single multi-threaded application.

But, like so many articles that set out to prove the point they want to make, rather than looking at the evidence and see where it leads, you spend much of the time setting up Straw Men, and then proceed to knock them down.

The first mistake you make, (in the chronology of the article), is exactly the "big problem with threading" that my post above took special pains to refute--but you probably didn't read down that far--that of "synchronisation & locking".

The whole paragraph starting "There are some basic trade-offs to understand.", is just an elaborately constructed straw man. Synchronisation & locking is only a problem, if you design your algorithms to need synchronisation and locking. So, don't do that!

Without repeating all the carefully constructed explanations above: Why would anyone architect an algorithm to use 32k cores, so that they all needed to access the same piece of memory? It beggar's belief to think that anyone would, and the simple truth is "they" wouldn't. There is no possible algorithm that would require 32k threads to contend for a single piece of memory. None.

Let's just say for now(*), that we have a 32K core processor on our desktop, and we want to filter a modest 1024x1024 image. Do we apply a lock to the entire image and set 32k threads going to contend for it? Or do we create one lock for each individual pixel? And the answer to both is a resounding: No! Why would we. We simply partition the 1M pixels into 32K lots of 32, and give one partition to each of the 32k threads.

Viola! The entire 1MP image filtered in the same time it takes one thread to filter a 32-pixel image. No locks. No contention. And the only "synchronisation" required is the master thread that waits for them to all finish. One load from disk. One store back to disk. As a certain annoying TV ad here in the UK has it: simples!

I don't really believe that any of use will have anything like 32k cores on our desktops in the next 20 years, but I'm just going with your numbers. See below.)

You simply cannot do that anywhere near as efficiently with processes; nor message passing models; nor write-once storage models.

The second mistake you make is that you mutate Moore's Law: "the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years", into: "we're doubling the number of cores.". And that simply isn't the case.

You can do many things with those transistors. Doubling the number of cores is only one possibility--and one that no manufacturer has yet taken in any two successive Moore's cycles! Other things you can do are:

AMD did it a while back with their Hypertransport. Intel has just played catch-up with Nehalem and their QuickPath Interconnect.

Add/increase the number of register files. Duplicate sets of registers that can be switched with a single instruction.

They've been doing this on embedded processors for decades.

Implement NUMA in hardware.

Look at the architecture of the Cell processor. By using one PPE to oversee 8 SPEs, and giving each SPE its own local memory as well as access to the PPE larger on-chip caches, you effectively have an 8-way NUMA architecture on a chip.

In summary, relating modern CPU architectures to the OS techniques developed for the IO-bound supercomputer architectures of the 1980s and '90s just sets up another Straw Man that is easily knocked down. But it doesn't have any relevance to the short-term (2-5 years) or medium-term (5-10) futures. Much less the long-term 20 year time frame over which you projected your numbers game.

Making predictions for what will be in 20 years time in this industry is always fraught; but I'll have a go below(**).

Your third mistake, albeit that it is just a variation on 2 above, is when you say: "On a NUMA system there is a big distinction between remote and local memory.".

Your assumption is that NUMA has to be implemented in software, at the OS level, and sit above discrete cores.

As I've pointed out above, IBM et al. (STI) have already implemented (a form of) NUMA in hardware, and it sits subordinate to the OS. In this way, the independence (contention-free status) of the cores running under NUMA is achieved, whilst they each operate upon sub-divisions of memory that come from a single, threaded-process at the OS level. This effectively inverts the "local" and "remote" status of memory as compared with convention NUMA clusters.

Whilst the local memory has to be kept in synchronisation with the remote memory, the contention is nominal as the contents of each local memory is simply caching a copy of a small section of the remote (main) memory. Just as L3 caching does within current processors. And as each local cached copy is of an entirely different sub-division of the main memory, there is no contention.

(**) My prediction for the medium-term 10 year, future (I chickened out of the long-term :), is that each application will be a single, multi-threaded process running within its own virtualised OS image. And at the hypervisor level, the task will be to simply distribute the threads and physical memory amongst the cores; bringing more cores on-line, or switching them off, as the workloads vary. This means that the threads of a particular VM/process/application will tend to cluster over a limited number of the available cores, controlled by dynamic affinity algorithms, but with the potential to distribute any single task's threads over the full complement of available cores should the need arise, and appropriate algorithm be chosen.

But that's about as far as I'm prepared to go. Time will tell for sure :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Previous related applications (mostly written >5 years ago) were written in high performance programming languages like C/C++

I have to say in advance that I've never did some serious Haskell programming, but I don't think this statement is true. C is still the language of choice for this kind of tools. If you are capable of writing a high performance string matching tool, it will be quite trivial for you to make it multi-threaded. The guys in our department worked really hard for over a year on a very efficient suffix array tool, parallelizing it was a matter of weeks. Another reason to choose "high performance programming languages" when it comes to high performance computing is simply memory. I know RAM is cheep these days, but suffix arrays are demanding. For example if you want to search in the complete human genome, you'll need far more than 8gigs RAM with the most efficient suffix array implementations available.

C is still the language of choice for this kind of tools. If you are capable of writing a high performance string matching tool ...

I know that, and that is why I asked that question. Is it worth to spend >1 year coding a "very efficient suffix array tool" when (maybe) you can code it in less time being less efficient but in a language that supports parallel and concurrent computation (ideally) easily?

The point about memory is relevant, but I don't know if critical in this case (I'm not trying to build suffix arrays over strings of 3Gbs)

Well, in the case of suffix arrays, it's not the coding in C that makes it difficult or time consuming, the algorithms are quite complicated. The speedup you gain by using clever techniques and efficient data structures isn't in the range of 2,4 or 8, it is in the thousands. For example I know that if you'd use a simple qsort() for the suffix array sorting, you would need weeks for a human chromosome compared to minutes the guys need here.

You could also see it the other way: if you use C and maybe OpenMP, then the simple to implement, naive algorithms might be already fast enough.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other