Next step is an AI as our decision makers and leaders?
Currently I see that as a benefit not a harm.

There are many villages and townships where Mayor is a part-time position. The thinking appears to be that it's better not to have too much of a good thing when it comes to politics. Along these lines, compared to a neural-network politician that works 24 hours a day 7 days a week without sleep named Marvin, I suspect many would prefer a trash-sorting robot named Wall-E.

I've been working on a parallel MPI Fibonacci code for use on distributed-memory clusters. Rather than dividing the computation into thousands of pieces as done with OpenMP, it appears necessary to keep things more course grained. As the conquer and divide recursion in Karatsuba multiplication naturally parallelizes into three parts at a time, such course-grained parallelization should work best when the cluster has 3^n nodes.

The super-cheap cluster inconveniently has 5 computational nodes. However, over-provisioning 4 of the 5 nodes by using 9 ranks for the computation could plausibly result in an efficient distribution of work. In this case, it is important that the MPI library has been configured to avoid busy waiting, which fortunately is already true for the super-cheap cluster.

I don't know whether the final MPI parallel computation will actually run faster on the cluster or not. The networking fabric is made out of bridged USB Ethernet gadgets while the nodes are Pi Zero computers. Thus, even though communications between nodes has high latency and low bandwidth, the CPU speed of each node is so slow that increasing performance by parallel processing may still be possible.

The distributed-memory MPI parallel code is turning out as troublesome as line-numbered Basic. After 50 or so big numbers successfully passed back and forth between the ranks, I'm receiving "Fatal Error in MPI_Recv: Unknown error class."

What does that mean? Could it be caused by a stray pointer? Maybe I need a deep-learning dog detector for debugging. When C works like this, why avoid Basic?

The MPI program is now working. Rather than merely a stray pointer, there was a race in gathering the results between lower and higher levels of the recursion. The unexpected order of the messages resulted in memory being freed before the return value arrived. This caused the program to scribble on the stack of the MPI_Recv function. Surprisingly, the problem was fixed by a thirteen-character change in the sync_bigmul3 routine to explicitly specify the ranks from which to receive messages. As the saying goes, if contractors built houses like programmers built programs, then not one of the three little pigs would have been saved.

The parallelism is relatively coarse grained. The last return in the recursion uses 3^n ranks while fibowork runs with 2*3^n ranks. Both ways of dividing up the computation must be efficiently distributed among the nodes of a cluster for good performance. Theoretically, this can be achieved by over-provisioning 3^n nodes with 2*3^n ranks. Alternatively, TUNE3 can be set to an impossibly large number so the linear recurrence in fibowork is not parallelized. In that case only the Karatsuba multiplication in bigmul3 will be parallelized in which case using 3^n ranks is all that's needed.

Sometimes over-provisioning is not practical because the MPI library employs busy waiting (almost all do by default) or takes a relatively long time to start each individual rank. In such situations setting TUNE3 large and using exactly 3^n nodes can result in better performance. When the number of nodes available is not a power of three then over-provisioning will almost always increase performance.

Since the super-cheap cluster has 5 single-processor nodes, one option would be to run the MPI program using 18 over-provisioned ranks. Alternatively, one could set TUNE3 very large and run the program using 9 ranks. In the case of the super-cheap cluster, the latter results slightly better performance. In particular, we have the following results in seconds:

The distributed-memory MPI parallel code runs 2.718 times faster than the simple serial code on the super-cheap cluster. Further comparing the parallel code to the serial code suggests that at least 2.5 seconds of the 9 second runtime is used for initialization. If these initialization costs are neglected, then 24 divided by 6.5 estimates the actual parallel speedup to be over 3.7 times faster. Following this line of reasoning, it is reasonable to conjecture that a 4-fold speedup might be achieved when calculating larger Fibonacci numbers.

The MPI code is about 55 lines longer than the original parallel.c OpenMP code. For reference fibompi.c follows:

The second age of personal computing started with the introduction of a single-board computer called the Raspberry Pi that was cheap enough even a child could own one. This resulted in clusters of real computers housed in Lego structures along with more sensible clusters put together by adults. In this way was born the personal computing cluster, a cluster which took no more electricity than a light bulb and less space than a toaster.

The golden age of personal computing started with the introduction of 8-bit single-board computers such as the Commodore PET. Although it was theoretically possible to assemble these computers into clusters using the built-in IEEE 488 interface, such was not done, possibly because it took too much Lego. People did, of course, still create personal clusters. A representative system is the 16-processor Z80 cluster assembled by Duane Elscott

used for computing fractal art. In general, however, the personal computing cluster only recently became practical.

Even so, it would appear that computing clusters are still much less popular than multi-core processors. Cost is not a factor: The super-cheap cluster in its entirety costs less than US $100, while the cost of a single x86 multi-core CPU is significantly more. Space is not a factor: The super-cheap cluster takes less space than most desktop or notebook computers. What is the difference? Why are multi-core processors everywhere and multi-node clusters not?

It is interesting to note that the 32-core Ryzen Threadripper, which is geared towards mainstream users, internally consists of four active dies arranged in a cluster: Two dies in that arrangement have directly attached memory controllers and two do not. The high-speed interconnect between the CPUs inside the Threadripper package enjoy a 25 Gbit/sec bandwidth with a 250 ns latency. Even though the Threadripper is often treated as a symmetric multiprocessor, internally it is actually a cluster of processors with non-uniform but fast memory access.

For comparison, the USB Ethernet gadgets which form the networking fabric of the Pi-Zero cluster operate with a 0.1 Gbit/sec bandwidth and an 800000 ns latency. The bandwidth is 250 times less and the latency 3200 times more. Latency is often the biggest issue with high-performance computing and why the Cray-1 was arranged in that distinctive C shape. In particular, the high-latency of the networking fabric in traditional clusters compared to NUMA means for efficient parallel operation that memory in a cluster must be viewed as distributed between processors rather than shared.

In either case, distributed-memory clusters and shared-memory multiprocessors both require parallel-processing techniques to expressively convey an algorithm to the computing hardware for execution. Although not written in line-numbered Basic, the MPI parallel program for computing Fibonacci numbers on a distributed-memory cluster was at least 55 lines of code more difficult to write than the OpenMP shared memory version. Could this additional programming complexity be the reason why multi-core notebooks, desktops and even phones are commodity items while computing clusters are not?

From a computer literacy point of view, it is important to know whether distributed-memory parallel processing, though difficult, is necessary to maintain the second age of personal computing and avoid the apocalypse that comes after digital feudalism. As anyone who has compared fat-free milk to one-percent knows, the one percent can make an important difference. Indeed, a continuation of the liberation provided by current technology is not at all certain. Consider, for example, the following graph of data obtained from the Wayback Machine showing the number of new posts to the Raspberry Pi forums per year over the last five years:

That looks terribly slow for a C like language that compiles to native code. BUT once again we have a case where actually calculating fibo(4784969) takes 2 or 3 seconds, then printing the result takes a minute and ten seconds! In fact, as far as I can tell most of that time is spent freeing up all the memory it used.

Perhaps there are better ways to write this in D but I know nothing about it. There is an option to use GMP from D but I'm not sure if we can count that as a standard library for D.

So there I was tenaciously avoiding BASIC again, when I accidentally wrote a fibo(4784969) in the D language

Something seems strange about the output. Usually head -c32 prints 32 digits while tail -c32 prints 31 digits and a newline.

I suspect the built-in library is computing in base 2 using a sensible algorithm but the base 2 to base 10 conversion for printing is unfortunately an O(n^2) algorithm. This may be the case with Python as well.

Like with any programming language, one is constrained by the vision of other developers when using standard subroutine libraries and built-in features. Maybe the multiply routine was micro-optimised while the printing routine was not. Could GMP be tricked into printing the results of the std.bigint calculation? If so, then it would be possible to have the algebraic notation of one library and the efficient printing of the other. If not, then one can really feel the all-or-nothing pain that results from using someone else's code. Alternatively, maybe the problem really is garbage collection.

The possibility of collecting garbage makes me want to go test the Go library that was linked earlier in this thread. From reading the code, that library appears to have an asymptotically reasonable base 2 to base 10 conversion used for printing performed by a conquer and divide application of division. However, it is impossible to know with certainty how expressively any code conveys an algorithm to the CPU without actually running it on the Raspberry Pi Zero.

I don't know, I promised myself I'd never be learning another programming language in my life, I've been through so many. Most of which are obsolete now. But they just keep coming...

I got attracted to D because of the involvement of one of my software nerd heros Andrei Alexandrescu plus a hint that work was going on to get it running on tiny micro-controllers plus a hint that there was an effort to use it as a hardware description language. Sadly the latter two of those seem to have fizzed out.

Something seems strange about the output. Usually head -c32 prints 32 digits while tail -c32 prints 31 digits and a newline.

Well spotted. It did actually print a line announcing "Done." prior to printing the output so that I could get a feel for how long the computation took. I snipped that out when I posted the output here so it looks a bit odd.

From what I read the dlang BigInt is using karatsuba and is optimized for use only around 1000 digits or so. Bigger than that and they suggest using GMP.

Coincidentally I just found this presentation on YouTube telling me that the D language is now working for RISC V, using GCC, via Fedora Linux: "Fedora on RISC-V 64-bit Introduction, Brief Overview and Latest Developments": https://www.youtube.com/watch?v=yxdT9gsBF_M

But still some road blocks:

1) 64 bit only.

2) Requires Linux to run.

3) No use for "bare metal" 32 bit RISC V. Still suck with C there.

I don't see D making it down to micro-controllers anytime soon. It can generate the code, like C, but some how needs a huge run time. Even if you are not using garbage collection or threads etc.

While APL led to A and then the A+ programming language, BCPL led to B, C and then the D programming language. As a teacher I asked myself, what in the grading scale comes after D?

In order to avoid a Fortran Fibonacci code, or at least delay it, the second page of the comic begun in this post is provided instead.

Note that the B in the above comic is unrelated to the B programming language. No identification with actual programming languages (living or deceased), places, buildings, and products is intended or should be inferred. Basically, all characters are fictitious but have become increasingly difficult to avoid, at least in this thread.

Last edited by ejolson on Tue Feb 12, 2019 1:32 pm, edited 1 time in total.

Amazingly one can still get recently maintained BCPL from Martin Richards. Which is usable on the Raspberry Pi apparently:

"Martin Richards maintains a modern version of BCPL on his website, last updated in 2018. This can be set up to run on various systems including Linux, FreeBSD, Mac OS X and Raspberry Pi. The latest distribution includes Graphics and Sound libraries and there is a comprehensive manual in PDF format. He continues to program in it, including for his research on musical automated score following."

Amazingly one can still get recently maintained BCPL from Martin Richards. Which is usable on the Raspberry Pi apparently:

I installed BCPL on an Amdahl mainframe over 35 years ago and at the time I was impressed that it "just worked" - because it was a binary executable and IBM hardware had remained backwards compatibility.

The only thing I can recall from then, was that BCPL could return an value from a statement expression (like an Algol 68 serial clause).
You can do this in C now with the GCC extension ({ .... })

B and BCPL were type-less which now I would regard as an absolute nightmare!

It seems I will need to do some studying to make an object-oriented operator-overloaded parallel coarray Fortran 2018 version of the Fibonacci code. I wonder how expressive that will be or whether it would have been better not to avoid BASIC. If one of the 50,000 views of this thread was made by Fortran Man, some help getting started would be appreciated.

In the mean time, here are some meaningless performance results comparing parallel.c using two different parallelization technologies on the same Xeon Gold 6126 server as before.

Note that MPICH was compiled to use the ch3:sock device which is likely slower than using shared memory to pass messages. It would be interesting to know how much of a performance penalty, if any, this incurs.

Has there been any progress adding the Visual Basic and BBC Basic programs to Github?