In a different thread appears a short self-contained C program which computes the first Fibonacci number with a million digits. This program implements big-number arithmetic using 64-bit integers as the underlying type. The Pi 3B+ running in 32-bit compatibility mode completes the computation in 15.43 seconds. Based on rescaling the clock speeds of a different ARM-based single-board computer, it was estimated that the Pi 3B+ running in 64-bit mode should complete this same computation in only 7.49 seconds. If true, that would be a two-fold increase in speed for a particular application just by switching operating systems.

It would be nice if someone who is running a 64-bit operating system on real 3B+ hardware could confirm that this estimate is correct. The program is available in this post. The above mentioned performance results are discussed in subsequent posts of the same thread.

Not sure if anyone has posted results for this as requested, but here's a run on an RPi3B+, gcc 8.2.0, gentoo-on-rpi3-64bit image, with and without -ffast-math (as expected, on arm64 this flag makes essentially no difference), FLIRC case, on-demand governor:

Not sure if anyone has posted results for this as requested, but here's a run on an RPi3B+, gcc 8.2.0, gentoo-on-rpi3-64bit image, with and without -ffast-math (as expected, on arm64 this flag makes essentially no difference), FLIRC case, on-demand governor:

Thanks for running the code on the Pi 3B+ in 64-bit mode. Compared to the timing of 15.47 seconds in 32-bit mode from this post, we have

15.47 / 7.740 = 1.999

which is nearly a 2-fold increase in performance. This confirms the similar result posted here.

From my point of view, the fibonacci.c program performs a real computation using an asymptotically reasonable algorithm. In particular, it uses Karatsuba multiplication along with the doubling formulas for the Fibonacci sequence to find the nth term. While some care has been taken with the code, it is definitely not hand-coded assembler tuned to a particular architecture. For these reasons this is not a synthetic benchmark, in my opinion, but rather a program which represents application-level performance that results from writing suitable code to solve a real problem in a high-level language.

It would be interesting to see an example of a reasonably written program which solves a real problem that runs 2-times slower on 64-bit compared to 32-bit. Are there any examples that can be quantitatively compared?

It would be interesting to see an example of a reasonably written program which solves a real problem that runs 2-times slower on 64-bit compared to 32-bit. Are there any examples that can be quantitatively compared?

Interesting challenge!

I suspect the only thing that's slower might be a program reading/writing vast numbers of pointers to and from memory.

Pointers (and the related size_t and ptrdiff_t), are the only types that change size gratuitously. You could argue about long, but a reasonably written program should be using stdint.h. Perhaps off_t, but that can be set to 64-bits in 32-bit mode.

The 31 general purpose registers, the removal of the slow instructions, the regular opcode layout, the 32 floating-point registers, and so on, means 64-bit mode is usually going to be a bit faster, like it or not.

It would be interesting to see an example of a reasonably written program which solves a real problem that runs 2-times slower on 64-bit compared to 32-bit. Are there any examples that can be quantitatively compared?

I suspect the only thing that's slower might be a program reading/writing vast numbers of pointers to and from memory.

Pointers (and the related size_t and ptrdiff_t), are the only types that change size gratuitously. You could argue about long, but a reasonably written program should be using stdint.h.

The 31 general purpose registers, the removal of the slow instructions, the regular opcode layout, the 32 floating-point registers, and so on, means 64-bit mode is usually going to be a bit faster, like it or not.

I'm pretty sure it is possible to create a synthetic benchmark that runs 2 times slower by leveraging memory bandwidth constraints when reading 64-bit pointers.

Since development of most mainstream desktop applications now target 64-bit platforms, I suspect most code that showed performance regressions on 64-bit platforms has already been rewritten. For example, one could use 32-bit integer offsets to a 64-bit base pointer in code where the excessive use of 64-bit pointers resulted in slowdowns. While this sounds like a lot of trouble, someone else has already done the tuning. Therefore, finding real-world examples where the 32-bit version runs faster than the 64-bit version may be rather difficult.

Not sure if anyone has posted results for this as requested, but here's a run on an RPi3B+, gcc 8.2.0, gentoo-on-rpi3-64bit image, with and without -ffast-math (as expected, on arm64 this flag makes essentially no difference), FLIRC case, on-demand governor:

Thanks for running the code on the Pi 3B+ in 64-bit mode. Compared to the timing of 15.47 seconds in 32-bit mode from this post, we have

15.47 / 7.740 = 1.999

which is nearly a 2-fold increase in performance. This confirms the similar result posted here.

From my point of view, the fibonacci.c program performs a real computation using an asymptotically reasonable algorithm. In particular, it uses Karatsuba multiplication along with the doubling formulas for the Fibonacci sequence to find the nth term. While some care has been taken with the code, it is definitely not hand-coded assembler tuned to a particular architecture. For these reasons this is not a synthetic benchmark, in my opinion, but rather a program which represents application-level performance that results from writing suitable code to solve a real problem in a high-level language.

It would be interesting to see an example of a reasonably written program which solves a real problem that runs 2-times slower on 64-bit compared to 32-bit. Are there any examples that can be quantitatively compared?

Has anyone checked the memory used 32 vs 64? Both in program size and memory used during the run?

Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

I think not likely. The bus on the RPi is 128-bits wide, hence why we can read 4 32-bit registers at a time in 32-bit real ARM without stalling the pipeline.

Since development of most mainstream desktop applications now target 64-bit platforms, I suspect most code that showed performance regressions on 64-bit platforms has already been rewritten. For example, one could use 32-bit integer offsets to a 64-bit base pointer in code where the excessive use of 64-bit pointers resulted in slowdowns. While this sounds like a lot of trouble, someone else has already done the tuning. Therefore, finding real-world examples where the 32-bit version runs faster than the 64-bit version may be rather difficult.

I think you need to take a look at the real world applications. Yes the example of the extreme Fibonacci will perform better on a 64-bit system, most applications will not do to the limits of the archetecture.

We still do not have a single cycle 32-bit divide, and it takes longer in 64-bit, there are many more examples where 32-bit is faster than 64-bit. Also as we can move 128-bits at a time to or from RAM if not in cache on either 32-bit or 64-bit there is no advantage for that either.

With the 32-bit ARM with its MMU we have the ability to address a space way bigger than is available on any system by more than 8000 times over. So we do not need 64-bit for memory access.

There are a few examples that we all know of where 64-bit is faster, these are the exceptions not the rule. People using exceptions to make something sound faster and better does not really work out in the end.

There is a reason that 32-bit systems still persist on any platform for which 64-bit is available. Those that use the 64-bit versions do it more for the bragging value, or they do not know the truth of performance. There is a reason that many that do know still flock to 32-bit x86 Linux even when there CPU supports AMD64 bit Long Mode. There is a reason that there is a huge demand for 32-bit ReactOS though not really anything to push the 64-bit version along.

So I must dissagree on this issue. 32-bit rules and will until every advantage of the 32-bit ARM is matched on the 64-bit ARM, including the timing for execution of any given instruction.

RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

We still do not have a single cycle 32-bit divide, and it takes longer in 64-bit, there are many more examples where 32-bit is faster than 64-bit.

Dividing large numbers is slower than dividing small numbers. If the numbers are the same size, then a 32-bit divide takes similar time to a 64-bit divide. I mean 42/12 will take the same time on both platforms. Obviously a 64-bit divide can deal with much larger numbers and so may potentially take longer - which is obviously not relevant.
Divide will never take one cycle on any platform, even Intel.

So I must dissagree on this issue. 32-bit rules and will until every advantage of the 32-bit ARM is matched on the 64-bit ARM, including the timing for execution of any given instruction.

You should look at the conditional instructions, the 64-bit ones have one less dependency than the 32-bit ones, and work better with modern CPU's (CSET/CSEL/CINC/CNEG/CINV etc). LDP/STP is much much faster than LDM/STM.

Simple things like ADD take the same time even though the 64-bit version can handle much larger numbers.

We still do not have a single cycle 32-bit divide, and it takes longer in 64-bit, there are many more examples where 32-bit is faster than 64-bit.

Dividing large numbers is slower than dividing small numbers. If the numbers are the same size, then a 32-bit divide takes similar time to a 64-bit divide. I mean 42/12 will take the same time on both platforms. Obviously a 64-bit divide can deal with much larger numbers and so may potentially take longer - which is obviously not relevant.
Divide will never take one cycle on any platform, even Intel.

Not long ago we said the same thing for Multiply, everyone believed that a single cycle multiply was not possible without increasing propagation delay to an unacceptable level, that has been proven wrong so I can see a time when the same is true of Divide. As it stands to implement a single cycle divide introduces to much propagation delay, and that is the same issue we had with multiply. The other solution of breaking a divide across multiple pipeline stages is not acceptable because it would make the pipeline way to deep to manage performance in a sane way (optimization would be even beyond compilers of the highest caliber).

Though just because it is not done does not mean it can not be done. And intel is a poor example of anything, except for lackluster design.

So I must dissagree on this issue. 32-bit rules and will until every advantage of the 32-bit ARM is matched on the 64-bit ARM, including the timing for execution of any given instruction.

You should look at the conditional instructions, the 64-bit ones have one less dependency than the 32-bit ones, and work better with modern CPU's (CSET/CSEL/CINC/CNEG/CINV etc).

So you are saying that it is lower latency to not be able to have every instruction conditional?
I would argue that, big time. That is the one thing missing from AARCH64 that will forever kill potential performance.

There are a bunch of cases where there is a huge advantage to have every instruction conditional (I know that a few of the newer instructions are not), and have the ability to specify which instructions set flags or not.

LDP/STP is much much faster than LDM/STM.

That is true. Though there are other ways around that issue, using NEON (ok it is a cooprocessor, still it is standard now), and equally fast on both .

So not really an advantage in most situations, with very few exceptions.

Also that is not the issue of the ISA, rather the implementation, it would be fairly easy to make LDM/STM single cycle for any load up to 4 registers (128 bits), with out adding much to the implementation, and without increasing any propagation delay in any stage of the pipeline.

Simple things like ADD take the same time even though the 64-bit version can handle much larger numbers.

That is a given, the propagation delay through the gates for the carry look ahead is minimally different between the two lengths when done correctly.

So I stand on my argument.

RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

2) There is no "huge advantage to have every instruction conditional".
As evidenced by the fact that the RISC V does not do that. If it were advantageous the RISC V designers would have used it. They have been studying and experimenting with these things for decades, they know. Besides, actual RISC V devices demonstrate it is not required.

3) There is nothing "lackluster" about what Intel has achieved. One can argue the x86 is a mess but Intel, bless'em, has invested billions in efforts to get off that to something else, i432, i860, Itanium of the decades. It there customers than continually demand more of the same, so they have obliged.

4) Real world applications have demanded 64 bit computing. The likes of Google would not buy all that 64 bit hardware if it was less efficient.

So you are saying that it is lower latency to not be able to have every instruction conditional?

I was just saying the new conditional instructions in A64 have one less dependency than the A32 ones.
They work in a different way. They are always executed and therefore the destination register is not dependent on its previous value.

I suspect the new conditionals were chosen as being the most useful ones.

That is the one thing missing from AARCH64 that will forever kill potential performance.

The exact opposite, it was to enable high performance on future ARM architectures. Pretty obviously, any out of order CPU will benefit. And it free's up four bits in the opcode enabling 32 registers instead of 16 - a huge benefit.
It sounds like you think the ARM CPU designers are wrong - which I very much doubt

LDP/STP is much much faster than LDM/STM. That is true.
Here is a cool thing!
I like LDP/STP because you can give the same register twice, which you cant with LDM/STM.
For example, I have a C structure that is 16 bytes in size and I want to zero it all.
The compiler changes "memset( &mystruct, 0, 16 )" into say "STP XZR, XZR, [X25]" (using register 31, the zero register)
You cant do that in one instruction with STM.
Edit: You can do it two instructions with NEON - as you say!

So you are saying that it is lower latency to not be able to have every instruction conditional?

I was just saying the new conditional instructions in A64 have one less dependency than the A32 ones.
They work in a different way. They are always executed and therefore the output register is not dependent on the previous value. That can break a dependency chain.

I suspect the new conditionals were chosen as being the most useful ones.

That is the one thing missing from AARCH64 that will forever kill potential performance.

The exact opposite, it was to enable high performance on future ARM architectures. Pretty obviously, any out of order CPU will benefit. You sound like you think the ARM CPU designers are wrong and/or stupid - which I very much doubt

Not by a long shot. I more think that the advantages one way or the other are unbalanced. The AARCH64 feels like an experimental ISA. As for the dependancy chain, that is on the coder.

On a personal note I still feel (because of the research we did while I was in university) that it is a better choice to use in-order multiple issue architectures than it is to use out of order multiple issue architectures. Either way you are unlikely to execute more than 4 instructions per cycle in a single stream (the limits of dataflow, regardless of number of registers), and either way you have about equal chance of issuing more instructions in parallel in a single stream. Though In order multiple issue has the advantage of being simpler to implement, and reducing potential propagation issues by being able to issue instructions without any extra pipeline delays (unlike most out of order implementations). Uses less components positive, simplifies the pipeline positive, at least equals potential performance positive. In either case there will need to be well optimized code.

LDP/STP is much much faster than LDM/STM. That is true.
Here is a cool thing!
I like LDP/STP because you can give the same register twice, which you cant with LDM/STM.
For example, I have a C structure that is 16 bytes in size and I want to zero it all.
The compiler changes "memset( &mystruct, 0, 16 )" into say "STP XZR, XZR, [X25]" (using register 31, the zero register)
You cant do that in one instruction with STM.
Edit: You can do it two instructions with NEON - as you say!

Yes there are definite advantages to the LDP/STP instructions. Now if we can get our conditionals back, have a way to execute normal ARM code without having to go through 3 state changes each way. Either that or have the licencing on 32-bit ARM cores go way down in cost so more companies are compelled to use the 32-bit, if ARM really wants to push the AARCH64 on the world in place of ARM ISA.

RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

So now you can run 64-bit software on 32-bit Raspbian without resorting to multiarch. LXC is not as user-friendly as Docker but gets the job done. With this proof-of-concept I don't see any fundamental reason that this shouldn't be possible with docker-ce:armhf, so I'll file a ticket with them.

Last edited by jdonald on Fri Dec 21, 2018 7:00 pm, edited 1 time in total.

My 2 cents is that while I’d love to see a 64 bit “official” Raspian, I’d prioritize 64-bit Rapberry Pi Desktop because I think it would benefit the educational mission of the foundation to get teachers to switch lab computers and their own laptops/desktops to the environment their students are using (without e.g. losing the benefit of > 4GB of memory, especially useful in the edu environment for stuff like media creation).

If kids NEED 64bit OS's then does the OS distribution NEED to be Raspbian?
I'm assuming serious stuff needs 64bit so anyone doing that sort of stuff could learn any OS?
In 10 years time kids will be saying "What's Linux, I use Raspbian"?
Raspbian then PiCore then Gentoo64 and on my PC's Mint.
That's 4 Linux Distributions I use to write OS less code for Pi's with Ultibo.

Blender for Artists works better on my Gentoo64 Pi box
Neddy Seagoon and Sakaki have made Gentoo64 work on Pi's.
It is almost at the stage where I can move Pi development 100% to Gentoo64.

Heater, don't have internet here for my Gentoo64 box, will get back to you on the answer.
WebGL uses OpenGLES, but this is something I want to do and even extend it to GLTF.
That's why I have moved to 64 bit OS on Pi's to find answers to stuff like this.

If kids NEED 64bit OS's then does the OS distribution NEED to be Raspbian?

From what I understand, Raspbian is for the benefit of teachers and parents who just want the computer to be configured by default in a way suitable for children to learn computer science. Raspberry Pi Desktop on x86 compatible hardware provides a familiar programming environment for those already comfortable with Raspbian.

Given that common desktop computers have been 64-bit capable for more than a decade, I found it surprising that Raspberry Pi Desktop was 32-bit compatible. While very few people are running Pentium III, original Athlon processors and earlier, there are some nested virtualization solutions (e.g. a virtual machine running inside another virtual machine) for which 32-bit is required. Maybe that is why Raspberry Pi Desktop is 32-bit.

While very few people are running Pentium III, original Athlon processors and earlier,....

Hmmm... I think my Win98SE box has a P-III. I know that my SuSE system dual Opteron-240 CPUs, though. I built that one in 2003 and picked SuSE because it was the only commercially available 64-bit Linux system at the time.

I see value in a 64-bit Raspian due to the performance implications of having 64 bit instructions for arbitrary precision arithmetic and the potential to improve the speed of certain applications, e.g. Mathematica. I’d switch to Raspberry Pi desktop if it were 64 bit. What I’m doing isn’t serious enough to need multiple OS’s, so I’d be happy to stick with one.

Heater, don't have internet here for my Gentoo64 box, will get back to you on the answer.

Heater - 19fps on my Celeron Core Duo Mint box and 5fps on my Gentoo64 Pi3B+.
Aquarium crashed Firefox on Gentoo64 and runs 10fps on Mint.
Gentoo64 Firefox is the Nightly 63.0.3 Developers? version.
The saga Neddy Seagoon went through to compile it is on the Gentoo forums.

If I remember right, Firefox is using the Servo engine, not sure if that has VC4 hardware acceleration?
Yep this WebGL stuff is pushing the Firefox browser in Gentoo64 over the edge
Firefox might crash but Gentoo64 is still running

Mind you the point of me getting a 64bit OS was to learn how to code GL without an OS.
So a browser that is perfectly fine with many tabs of text/GL tutorials is ok for my stuff, at the moment.
Who knows, now that Aarch64 Pi's are doable engines like Servo can be improved by someone from the Pi community.

For gaming a Steam type baremetal OS running GLTF models should be possible.
Probably not for commercial use but certainly for research and Computer Science coding.
This is not something that needs to be handicapped by only doing it 32bits.
All four of those 64bit 1.4GHZ Arm, 128 bit NEON cores I suspect will be needed?