On my PC it takes several times as long as that just to start node up.

Something is wrong then:

Your right, sorry. I last measured it some time ago.
I tried again, and the start time for hello,world is near identical.
This script "qc" is slightly faster than node with optimization off, and slightly slower than node with -O3:-

Well, that recipe works fine. Except the resulting profile output is about the same. I can't believe it.

Now, the argument is that when using gprof with pthreads it only reports on the activity of the main thread so the results are all wrong. But it seems to me that in this application all threads get to run the same code, in which case having a measure of only what the main thread is doing would be fine.

Beginners don't need the tedium of C++ compilation or the syntactic complexity of C/C++ etc. They need something simple and basic.

Did you say BASIC?

There are many who know only one language, especially in the States and China. Even in these countries there are plenty of polyglots. Some develop a desire to do something new and useful and thus apply their language skills to the task of writing poetry.

Imagine how sad if, after trying and failing, the reason for failure was use of the wrong language. What if the only modern languages suitable for writing poetry were Italian, Arabic and Hindi. Then any attempt to write poetry by half the people in the world would be futile due to using the wrong language. That would be discouraging.

It is arguable, due to the universal nature of the human condition, that every human language is suitable for writing poetry. One might think by analogy that every Turing-complete programming language is suitable for writing complex algorithms. If this were true, then there would be no need to even consider the question whether to avoid BASIC.

A child learns to speak, not because it is easy, but because being able to convey their feelings, desires and needs is a great reward. In fact, learning to speak is now considered one of the tasks that take the most effort among the accomplishments of nearly every person. As with human languages, many people will only master the first programming language they are exposed to. However, unlike human languages not all programming languages are equally expressive. Not all have the same ability to convey the poetry of complicated algorithms to the computing hardware. Some are not readable enough, some are not productive enough and many are not efficient enough.

While one solution to lack of expressivity is scripting lyrics for rap music rather than writing poetry; another solution might be to avoid certain languages at the outset. Just like being able to write poetry liberates the human spirit through the expression of a person's most inner thoughts and perceptions, so does being able to efficiently convey complicated algorithms to a computer. Such digital liberation frees the human imagination from built-in features and standard subroutine libraries. It motivates the creation of new algorithms by making their practical implementation and use possible. A new age of personal computing should allow the average person to avoid the servitude of digital feudalism and might also raise the GDP.

The question remains, which languages are to be avoided at the outset and which languages lead to such great rewards that the difficult task of learning them is worthwhile?

Last edited by ejolson on Tue Jan 15, 2019 5:20 pm, edited 3 times in total.

Well, that recipe works fine. Except the resulting profile output is about the same. I can't believe it.

Now, the argument is that when using gprof with pthreads it only reports on the activity of the main thread so the results are all wrong. But it seems to me that in this application all threads get to run the same code, in which case having a measure of only what the main thread is doing would be fine.

Can that result really be true?

Since the only routine parallelized is the multiply routine one would expect it to take less time.

From a work-span point of view, the recursive multiply immediately runs on two cores and after two additions on a third. Upon sync there are a few subtractions and additions to put the results together. Thus, the O(n^1.58) part of the code distributes to available processors quickly with good parallel scaling and the only thing left are the O(n) terms. Given the relatively small size of the problem the O(n) part becomes significant somewhere between four and eight cores.

I created a script to add volatility to the C intermediate files so the gcc optimizer doesn't clobber things and create a segmentation fault in the resulting executable...

Ouch, what?!!

C/C++ optimizers do not "clobber" anything unless what what they are compiling makes use of some undefined behavior of the language (Barring compiler bugs). Use of undefined behavior means that anything can happen.

What you are saying is that the FreeBASIC code generator is buggy and it's generated C code is not guaranteed to run on all platforms with all standards compliant compilers, even with optimizations off.

Ideally it would be better to have the FreeBASIC devs fix these bugs rather than hack it with workarounds.

Personally I'd rather skip the middleman and write in C directly. I don't need any more bugs than I have already introduced by a code generator, especially when it serves no useful purpose.

The gcc backend to FreeBASIC is broken. This and the workaround was posted earlier. I've only followed through with timings for the Pi Zero.

No, I did not. I said "basic". Deliberately because "simple and basic" does not imply BASIC, there are many other options.

I have no idea much about human languages other than English. Except a smattering of Finnish. Finnish is to English as Forth is to C.

I guess the peoples of the world get on very well with whatever language they have. I could not believe that they don't have the means to express their inner being as well, or as poorly, as we do.

The question remains, which languages are to be avoided at the outset and which languages lead to such great rewards that the difficult task of learning them is worthwhile?

1) Like all languages it's helpful to use the language that the community you live in uses. That includes the community of people you deal with as well as the community of computers you are using.

2) For a rank beginner a simple language is better. Without the need to comprehend all kind of complex syntax and semantics before one can get even the simplest things working. That's not to say the language cannot have complex syntax and grammar but it should be something one can grow into. When kids start to learn their human language we are happy if they can string together a few simple words we don't expect them to employ the vast complexity of the whole language. Similarly those new to programming seem to get on fine with C++ in the Arduino environment, they don't need to be aware of the huge, fractal, complexity of the whole language.

3) Related to 2), although the language should make doing simple things simple so that the beginner can get started it should not be limited. BASIC is an example of simple but limited, ultimately a dead end. C++ can be simple (see Arduino) but has vast capabilities that one may not live long enough to explore fully.

I think that 99.9% of people who learn some programming are not going to ever be dreaming up new algorithms. In the same way that we learn maths in school don't invent new maths, we learn about literature in school but never write famous novels, etc, etc. However they have problems to solve and can make use of the tools (algorithms) software engineers can provide. They should not have have to (re)create the tools. If those tools are built into a programming language they can understand all the better. Think the big integer support and all those modules available for Python or Javascript.

Since the only routine parallelized is the multiply routine one would expect it to take less time.

I don't follow you. Thinking about what goes on here gives me headache.

Perhaps I would expect it to take less time but gprof is saying that using four cores for the multiplies has reduced their significance from 75% of the time to only 5%.

So far that does not seem credible to me.

I'm only getting a speed up of 2.8 on the Pi when going parallel.

parallel.c only speeds up by 2.9.

The scaling factor for my 8 hyper threads is even worse.

I'm beginning to think, via a hand waving argument in my mind, that the algorithms we have here are never going to scale past about 3, no matter how many processors one has. There is just too much stuff that has to be done sequentially, at all levels of recursion, in the fibo() and in the multiplies.

2) For a rank beginner a simple language is better. Without the need to comprehend all kind of complex syntax and semantics before one can get even the simplest things working. That's not to say the language cannot have complex syntax and grammar but it should be something one can grow into. When kids start to learn their human language we are happy if they can string together a few simple words we don't expect them to employ the vast complexity of the whole language. Similarly those new to programming seem to get on fine with C++ in the Arduino environment, they don't need to be aware of the huge, fractal, complexity of the whole language.

Well said.

There is always something new to learn in big languages like C++17, and D for that matter. Learning is fun.

Even simple C.
To my shame, despite their being in the language for 8 years now, I have never used _Generic or any of the atomic stuff.

Since the only routine parallelized is the multiply routine one would expect it to take less time.

I don't follow you. Thinking about what goes on here gives me headache.

Perhaps I would expect it to take less time but gprof is saying that using four cores for the multiplies has reduced their significance from 75% of the time to only 5%.

So far that does not seem credible to me.

I'm only getting a speed up of 2.8 on the Pi when going parallel.

parallel.c only speeds up by 2.9.

The scaling factor for my 8 hyper threads is even worse.

I'm beginning to think, via a hand waving argument in my mind, that the algorithms we have here are never going to scale past about 3, no matter how many processors one has. There is just too much stuff that has to be done sequentially, at all levels of recursion, in the fibo() and in the multiplies.

You have sent all the multiplies to worker threads and then wait in the main thread for them to return. Instead try sending only two of them and perform the third in the main thread. Does something like

The Cilk parallel programming extensions to gcc were deprecated about the same time OpenMP got a sophisticated enough scheduler to support the kind of recursive dynamic parallelism being used in the Karatsuba algorithm. Later versions of the 6.x series compiler seem to work fine as well as the 7.x and 8.x versions. What version of gcc are you using?

Last edited by ejolson on Sun Jan 13, 2019 8:30 pm, edited 1 time in total.

No noticeable difference when averaged over ten runs on the Pi 3 before and after.

This is gcc version 6.3.0

I conclude OMP is smart enough to understand what I'm saying. I did not say "run this on a thread, run this other thing on another thread,..." I said so this, that, the other in parallel. OMP knows that only requires 3 threads.

No noticeable difference when average over ten runs on the Pi 3 before and after.

I think Intel Parallel Studio has a parallel performance profiling tool, but I've never used it. I've always taken the old-fashioned approach and tried to think my way through any parallel scaling problems.

Maybe the problem is what I said earlier: Too many emulation layers. POSIX threads emulated by glibc using Linux native threads which are in turn emulated by Linux Subsystem for Windows using Windows threads.

Oh, wait, the problem is on the Pi as well? Do you have power supply or throttling problems?

No noticeable difference when averaged over ten runs on the Pi 3 before and after.

This is gcc version 6.3.0

I conclude OMP is smart enough to understand what I'm saying. I did not say "run this on a thread, run this other thing on another thread,..." I said so this, that, the other in parallel. OMP knows that only requires 3 threads.

I have tried parallel.c on an 8-core Cortex-A53 SBC running in 64-bit mode using taskset to limit the number of available cores to 4. With gcc version 6.4.0 I have the following results:

which indicates a speedup of 3.38 for the exact code in GitHub. Slightly better results can be achieved by over-provisioning the available cores with double the worker threads. Simply change the code in the work routine to read as

which is a 6.15 factor increase in speed. At this point the O(n) part of the algorithm which was not parallelized is contributing a significant percent of the total run time. Although one could attempt to parallelize some of the O(n) routines such as, for example, bigcarry, from another point of view computing million-digit Fibonacci numbers is just too small a problem for more than eight cores.

Finally, I should mention that on this particular SBC (the NanoPC-T3) the controls in /sys/devices/system/cpu/cpufreq/policy0 function and were set so that

I thought of one more thing aside from lock contention, throttling and emulated operating systems that might be causing the C++ code to run more slowly. With OpenMP as well as Cilk there is additional overhead related to growing cactus stacks and saving state whenever a subroutine that might spawn parallel work is called whether or not any parallel calls are subsequently made. Since you've lumped the parallel and serial recursion into the same subroutine, the serial part may experience an additional slowdown that could be avoided. Therefore, breaking the parallel and serial recursions into different subroutines as done in parallel.c may increase the resulting efficiency when running on multiple cores.

Here is a useful tool for measurement:http://valgrind.org/docs/manual/cg-manual.html
While the output may not be all that meaningful, I suspect you could use it before and after any change you make to improve locality of reference and see the difference.

Looks like over provisioning obtained another 10 percent for a 3-fold performance increase. Still, the numbers make me think your Pi is throttling when multiple cores are busy.

For the x86 computer, behind the scenes there are really only 4 cores, so scaling just under a factor four is not bad. Historically hyper-threads were a hardware work-around to increase responsiveness of the Windows scheduler. They also helped with the long Pentium 4 pipelines that tended to stall. Today hardware-threads are mostly a marketing scheme: Stalls are much less frequent due to compiler technology and processor design. They do, however, lead to some interesting side-channel security issues.

Here is a scaling study of parallel.c using a dual Xeon E5-2620 twelve-core server.

Note that the scaling is not as good as with the eight-core ARM machine. Since the CPUs are faster it is possible that memory bandwidth becomes an issue sooner.

Last edited by ejolson on Tue Jan 15, 2019 7:46 pm, edited 1 time in total.

I was wondering how my 4 core 8 hyper thread machine scales when attacking an "embarrassingly" parallel problem (no memory sharing). So I tweaked my Google benchmark code a bit. It now runs a simple big integer power function multiple times, using OpenMP, the std::async of C++ and in serial. (There is no parallelism in the actual big integer multiply here). The power function looks like this: