Why you should run a 64 bit OS on your Raspberry Pi4
One of the cool thing of working for a software company is that very often you get new hardware prototypes to test.
But this is not the case, I bought the Rpi4 because it’s extremely cheap!
The Rpi4 comes with a quad core ARM Cortex A72, up to 4 GB of RAM and a gigabit ethernet port, at a very low price of 35 $.
Raspberry provides Raspbian (a Debian derivative), an already ready distro for their products, so I put it on an sd card to boot it quickly.
I was looking at the syslog and I noticed that, uh, both the kernel and the whole userland are compiled as armv7, which means 32 bit ARM.
I know for sure that the RPi4 is 64 bit capable, so I refused to run a 32 bit OS on it. I get another sd card and I installed Debian on it. A lean and mean Debian compiled as aarch64, which means 64 bit ARM.
As soon as the 64 bit OS booted, I was curious to know how much it performs better than the 32 bit one, so I did some tests.
EDIT: by popular demand, I’m publishing the Debian image.
The two partitions (boot and root) are compressed in a .tar.xz file, and there is a conveniente script
mksd which partitions an SD card and extracts the above.
I’ve kept it simple, so it’s a very minimal distribution, you have to install your preferred tools by hand.
The kernel is not the vanilla I used in the tests, but the stable 4.19 by Raspberry, because it supports a whole range of device that my build doesn’t.
The system is configured to get an IP via DHCP on the ethernet interface. Login via SSH with credential user/user and then gain root with
I’ve put the whole thing in a zip archive here:
Feedback is welcome.
Raspberry just started selling the Raspberry Pi4 with 8 GB RAM.
As you can imagine, this is another good reason to use a 64 bit kernel, otherwise the usable memory will belimited to a mere 3 GB.
The first test which came to my mind was the old drystone bench which exists since the dawn of time
dhrystone is a program written in the 1988 which does some math calculations.
It’s unlikely to simulate any modern workload, the only way we still use it is to have somewhat consistency between past architecure and softwares.
A modern number crunching application could be some hash calculation, so I wanted to do a SHA1 test. Unfortunately the Debian sha1sum utility was compiled without libssl or kernel crypto support, so I had to compile it from source.
To avoid I/O bottleneck, I calculated the hash of a 2 GB sparse file as with
truncate -s 2GB, so the I/O from the sd card was zero:
SHA1 hash is a more real life benchmark that dhrystone as this algorithm is used in really a lot applications, e.g. torrent, git, etc.
A 64 bit system means that RAM can be accessed in 8 byte read/writes per instruction.
I wrote a simple tool which allocates a big buffer, writes it and then reads it back. To be sure that the RAM was really allocated I used mlock() on the whole buffer. In this test the buffer is 2 GB; a 3 GB buffer worked in 64 bit mode but gave an out-of-memory error in 32 bit.
I noticed that many Rpi users use the board as mediacenter, so I did an audio encoding with the two most used codecs.
I encoded “Echoes” by Pink Floyd because it’s a very long track to obtain some measurable values. To avoid I/O both the source and the destination file were on a ramfs:
Another usage of the Raspberry boards is to act a simple VPN or firewall.
I don’t endorse the usage of such systems for this purpose, but many people have still slow <100 mbit links, so they can turn a blind eye on the bad Rpi performances.
The first question is: how much traffic can the Rpi4 handle?
We need to measure the pure networking power of the board, without the limitations of the physical interface first, so I run an iperf3 session between two containers.
Beware, containers use to comunicate via a veth pair, and veth is known to accelerate the traffic via a lot of fake offloads.
IP checksum offload is done by just skipping the checksum calculation, while TCP segmentation offload is done by never segmenting or reassembling the traffic: big chunk of 64k data are just passed in memory as is.
To overcome it, I disabled the offloadings with
ethtool -K veth0 tx off rx off tso off gro off gso off
The fastest thing that a network appliance can do is to drop traffic, and the fastest way to drop traffic is via a TC drop rule. To avoid reaching the line rate, I used the minimum ethernet frame size, 64 byte.
This is a drop rate test.
Although both systems were unable to reach the line rate (which is 1.5 Mpps), the 64 bit kernel scored a bit more than the 32 bit one. If you want to use the Rpi4 as firewall, a 64 bit kernel is definitely a must have.
As expected, OpenVPN is 10x slower than WireGuard. A less expected result is that OpenVPN performs the same in both 32 and 64 bit mode.
WireGuard instead, almost saturates the gigabit port in both versions, indeed we have the same results with both kernel, probably we hit the NIC limit.
To check if WireGuard could go even faster, I did another VPN test using two containers, so I skip the physical ethernet.
The only drawback with this container test is that both the iperf3 client and server were running on the Rpi4, keeping two cores busy.
As expected, OpenVPN and 32 bit WireGuard, which were CPU limited, performed worse, while 64 bit WireGuard performed better.
Often I read statements like “It’s not worth it”, “you will gain a few milliseconds”, etc. just because the Rpi is not that powerful.
That’s not true! As any embedded guy may know, with slow hardwares, having a very optimized software is even more important than with powerful ones.
I already knew that a 64 bit OS would perform better on the Rpi4, what I didn’t knew was how much.
This is why I did this test series, I hope that you enjoy reading it!