... there is still the requirement that Raspbian be compatible across the entire Raspberry Pi line.

This is the biggest problem. Do you create and maintain separate images that are model specific, or do you try and roll them into one version?

Raspbian images have become popular for their simplicity (just burn-n-boot), but people get confused already with just 3 images (light, desktop and full), and NOOBS vs images only adds to the confusion. This will only get worse with more images.

I think it's time to at least get rid of NOOBS in favor of PINN so there can be a standardized installation method which works regardless of filesystem or card size (burn image to card with Etcher).

Although we might need to come up with a new name, since the acronym won't make sense when NOOBS is no longer around.

As far as images go, I'd still like to have the option, but I don't know how much more work will be required to maintain both 32 & 64 bit images.

My mind is like a browser. 27 tabs are open, 9 aren't responding,
lots of pop-ups...and where is that annoying music coming from?

Testing was done on the same 2GB Pi4 with the same USB3 adapter and SSD, headless, no gui running. Only the SD card was swapped out between the 32bit and 64bit kernel runs. The SD cards were the same brand/model/size. The test was run 3 times for each config. Results were consistent, within 40 seconds best to worst for any particular config.

The pure Raspbian config being 26% slower than the 64/64 config was a big surprise. The job does not appear to be CPU constrained under Raspbian, the "busy core" rarely exceeding 65%, with the other three cores mostly idle. Under 64/64 the busy core approached 80% utilization. Temperature was never an issue and the CPU never throttled in any of the tests.

There you have it. 64 bit does have advantages even for a few rather normal tasks (not edge cases like a few people think). You would probably see a even bigger difference if that backup would be encrypted since the AArch64 instruction set can cope better with AES

There you have it. 64 bit does have advantages even for a few rather normal tasks (not edge cases like a few people think).

Is this really a "normal task" though ?

No one has disputed that 64-bit can be faster, and for some things there will be more gains than in others, and the longer those things take the more notable the differences may be.

When what one is doing only takes a short time, performance gains are far less noticeable. A web page which takes 'no time at all' to render would seem just as fast as one which takes even less.

And likewise when one isn't particularly worried about the time taken, say one were creating a 100GB backup rather than restoring one and waiting on that completing, an extra 26% probably wouldn't even be noticed.

This is just like any other benchmarking which shows particular things are better for one thing than another, but that doesn't always or necessarily translate to being noticeably better for all one does.

My solution to finding all the anagrams in the 650,000 odd words of the british-english-insane dictionary file supplied, as provided by Raspbian package wbritish-insane, is substantially slower on a Pi 3 running a 64 bit Debian than the 32 bit Raspbian:

My solution to finding all the anagrams ... is substantially slower on a Pi 3 running a 64 bit Debian than the 32 bit Raspbian

I guess that goes to show that 64-bit is better if one has code which suits 64-bit but things may be worse when one doesn't.

But I would also suggest that, for many things, being somewhat slower is no more noticeable than being somewhat faster. The overall gain or loss for an individual would all depend on what they were doing.

I'm not saying the opposite either that 64 bit is always better. It isn't but crypto, compression and similar things which are part of day to day use (under the hood since a lot of software uses them) sees better results. Yes, there is a disadvantage in that memory usage increases a bit because of 64 bit pointers and you can see it hurts performance a bit especially on the Pi 2 and 3 (much lower memory bandwidth) but on the Pi 4 i see no reason not to use 64 bit at least for 2 and 4 GB models. And as has been said countless times software is slowly moving to 64 bit. I would like to use Firefox for example because i trust Mozilla more than Google and Chromium is no option if Google eventually makes ad blockers unusable. Then there are emulators a few of which would surely see a benefit. And Dolphin doesn't work at all since it requires 64 bit (and has good reasons to only support 64 bit as you can read in a blog post linked earlier).

And before anyone says "if you want all of this the Pi is not for you" i have a ~5 years old Notebook with 16 GB RAM, i7 and SSD which is perfectly able to do all of it but at multiple times the power usage and heat because x86(_64) is a bloaty and inefficient crap arch. I'm actually impressed how close the Pi 4 is to this machine and the advertisement as desktop computer is justified but then it should be able to run most of the desktop software (minus commercial software which is not coming no chance). I also understand the RPiF has other priorities and i'm not getting my hopes up for an official 64 bit option any time soon but hopefully they consider it in the future. Even 64 bit kernel and 32 bit userland would be ok (since we can add a 64 bit userland easily).

OpenMW: Looks like @TotalJustice got it running on a Pi 4: https://www.youtube.com/watch?v=bSxf9afRRyo
Not sure if using an older source snapshot, ran a 64-bit OS, or just managed to make the latest 32-bit build run, but @bomblord you'll probably want to hit em up for the binaries before going through the onerous build procedure yourself.

Firefox armhf is supported on Buster though. See installation instructions. What happened with Stretch is that we had two years of darkness due to an armhf-specific regression (AAPCS) in gcc 5. Things are hopefully going to be okay on every FF release for the next two years until Debian Bullseye (fingers crossed).

Firefox armhf is supported on Buster though. See installation instructions. What happened with Stretch is that we had two years of darkness due to an armhf-specific regression (AAPCS) in gcc 5. Things are hopefully going to be okay on every FF release for the next two years until Debian Bullseye (fingers crossed).

That seems rather "workaroundish" to me adding Ubuntu repos to Raspbian. If this was really just a
compiler bug preventing it from working we should see builds in the default repos soon?

Chromium and Firefox both work in Gentoo64 on the Pi3B+.
They kept swapping positions as the fastest browser over the 6+months I used them.

The latest Raspbian Buster Chromium seems to have caught up and passed them.
I expect that to change soon.

Arch, Gentoo are both distributions that are rolling, Debian is a bit less bleeding edge.
Use 32bit Raspbian Buster if you don't like experimenting or rolling OS's.
It is now a pretty good OS even on a Pi3B+.

It is probably going to take 6months before we get a choice of 64bit OS's.
Try them all and find out which is "best" for your application.

Choice is good, diversity is good.
Thank goodness MS did not take Pi's seriously else we would all be using Win10

The above three cases are Debian aarch64 chroot, Debian armhf chroot, and Raspbian armhf on the metal. It's all running on the same Pi, same SD card, in the same boot session.

There's some variation--as if something occasionally interferes even though I have lightdm turned off--but most of the time aarch64 run is 6% faster than 32-bit as shown above. For whatever reason, running with Raspbian system libraries is consistently 1% slower than the Debian armhf container.

On top of different system releases it doesn't sound like a well-controlled experiment if those are two different kernels. Reminds me of a year ago somebody on reddit convincingly wrote that they got dramatic speedups with 64-bit on elliptic curve cryptography, then it turned out the author was also comparing across different distros.

The above three cases are Debian aarch64 chroot, Debian armhf chroot, and Raspbian armhf on the metal. It's all running on the same Pi, same SD card, in the same boot session.

There's some variation--as if something occasionally interferes even though I have lightdm turned off--but most of the time aarch64 run is 6% faster than 32-bit as shown above. For whatever reason, running with Raspbian system libraries is consistently 1% slower than the Debian armhf container.

Did you compile with "-march=armv8-a+crc+simd -mtune=cortex-a72 -mfloat-abi=hard" for the 32 bit builds? All the libs and everything on Raspian is built for ARMv6 which means it doesn't even use the full potential at all.

The above three cases are Debian aarch64 chroot, Debian armhf chroot, and Raspbian armhf on the metal. It's all running on the same Pi, same SD card, in the same boot session.

There's some variation--as if something occasionally interferes even though I have lightdm turned off--but most of the time aarch64 run is 6% faster than 32-bit as shown above. For whatever reason, running with Raspbian system libraries is consistently 1% slower than the Debian armhf container.

On top of different system releases it doesn't sound like a well-controlled experiment if those are two different kernels. Reminds me of a year ago somebody on reddit convincingly wrote that they got dramatic speedups with 64-bit on elliptic curve cryptography, then it turned out the author was also comparing across different distros.

Still, its to small difference for the human eye to see (if i understood the numbers correct.) We´re dealing with hundredths of a second in difference in some cases

Still, its to small difference for the human eye to see (if i understood the numbers correct.) We´re dealing with hundredths of a second in difference in some cases

You don't care for these hundreds of a second but it does add up especially since there is more running than just a single software/process which of course don't all benefit or benefit in the same way but you get the idea. It all contributes to system performance.

That's right running the anagram program as a benchmark like that is pretty hopeless. Using "time" is not a good way to do such benchmarks and the execution time is getting so small that that getting a stable result is impossible.

When first wrote that algorithm, in Javascript, it was taking about 10 seconds, we have versions in BASIC that take many minutes and even hours! So using "time" then was reasonable.

Still, having run the thing many times on 32 and 64 bit ARM and 64 bi Intel there are definitely big differences, not just a few percent. Different variants of that code perform best on different machines. I have yet to figure out what the actual differences in the code are that account for that. They are all very similar.

At the end of the day I don't care about differences of 100ths of seconds or even whole seconds running that program. It's just that it's a new language to me and I'm trying to get a feel for how to write efficient code with it. If it's a language with a future on our servers it will have to perform well.

That's right running the anagram program as a benchmark like that is pretty hopeless. Using "time" is not a good way to do such benchmarks and the execution time is getting so small that that getting a stable result is impossible.

When first wrote that algorithm, in Javascript, it was taking about 10 seconds, we have versions in BASIC that take many minutes and even hours! So using "time" then was reasonable.

Still, having run the thing many times on 32 and 64 bit ARM and 64 bi Intel there are definitely big differences, not just a few percent. Different variants of that code perform best on different machines. I have yet to figure out what the actual differences in the code are that account for that. They are all very similar.

At the end of the day I don't care about differences of 100ths of seconds or even whole seconds running that program. It's just that it's a new language to me and I'm trying to get a feel for how to write efficient code with it. If it's a language with a future on our servers it will have to perform well.

I am ever hopeful that ARM64 Neverware Cloudready will materialise.......

Thanks pica200 for pointing out the ARMv6 vs ARMv7 catch. Running rustup show revealed I was using the stable-arm-unknown-linux-gnueabihf toolchain & corresponding target but the correct one to try here is stable-armv7-unknown-linux-gnueabihf.

That's right running the anagram program as a benchmark like that is pretty hopeless. Using "time" is not a good way to do such benchmarks and the execution time is getting so small that that getting a stable result is impossible.

Yeah, for small runtimes we can get large swings with all the unknowns of networking traffic or my USB keyboard interrupts. To mitigate that, here's a script to run 50 times and get a histogram:

That way I can always get a consistent top-ranking runtime. Unfortunately the additional script process or additional redirect seems to add 45 ms to each run, but at least the relative runtime ratios remain consistent.

Last edited by jdonald on Mon Aug 12, 2019 2:10 pm, edited 1 time in total.

I think 64 bit is mostly worth it, you get double the NEON registers, that can improve performance drastically if the software is written to take advantage of it.
However an app that is constrained by memory bandwidth limits and doing zillions of pointer manipulation will probably suffer

That way I can always get a consistent top-ranking runtime. Unfortunately the additional script process or additional redirect seems to add 45 ms to each run, but at least the relative runtime ratios remain consistent.

Maybe put the loop in the rust program itself to minimize the system overhead?

Looks like Heater is starting to port the anagrams program to cargo bench which will take care of that.

Technically there's even more overhead from the existing redirect, and that can be removed by changing > anagrams.txt to > /dev/null or giving the program a --quiet option.

For a more widely known benchmark, today I ran sysbench (sudo apt install sysbench)'s cpu test. Out of the box on Debian Buster the 64-bit version is more than 10x faster than 32-bit for that prime number sieve implementation.

It turns out that on Debian armhf, by default gcc never generates the udiv/sdiv instructions, instead resorting to a conventional slow integer division sequence. Those instructions aren't available on Cortex-A9 processors so Debian or any "armhf" Linux flavor errs on the side of compatibility.

This highlights how it's very important to generate 32-bit code with the -mcpu arg (in jahboater's comment on the first page) for these sorts of comparison experiments. -mcpu makes more significant changes compared to -mtune because it's allowed to break compatibility. Once I added -mcpu=cortex-a72, sysbench gets 10x faster in the 32-bit case to become on par with aarch64.

So even when we're really careful with our controls, it's easy to miss these things on 32-bit vs 64-bit experiments. This particular case is even more sinister than the effects of Raspbian's ARMv6 binaries. One way to describe it is that it's using ARMv7 while limiting to an instruction set for an older ARMv7!

This highlights how it's very important to generate 32-bit code with the -mcpu arg (in jahboater's comment on the first page) for these sorts of comparison experiments. -mcpu makes more significant changes compared to -mtune because it's allowed to break compatibility. Once I added -mcpu=cortex-a72, sysbench gets 10x faster in the 32-bit case to become on par with aarch64.

So even when we're really careful with our controls, it's easy to miss these things on 32-bit vs 64-bit experiments. This particular case is even more sinister than the effects of Raspbian's ARMv6 binaries. One way to describe it is that it's using ARMv7 while limiting to an instruction set for an older ARMv7!

-mcpu has been deprecated at least on x86 and other archs will probably follow in the future. -march + -mtune together are the exact same as using -mcpu. -march specifies the minimum arch the code should run under and -mtune tells gcc to optimize for a specific CPU within the range of CPUs for the arch.