Enthusiasm never stops

Tag Archives: Linux

Many years ago I wrote the library popen_noshell which improves the speed of the popen() call significantly. It seems that now there is a standard and very efficient way to achieve the same. Use the posix_spawn() call. Its interface is a bit grumpy and complicated, but it can’t be very simple after all, because posix_spawn() provides both great efficiency and lots of flexibility.

Let us first examine the different ways of spawning a process on Linux 4.10. Here are the different implementations of the following functions:

In the latest versions of the GNU libc, posix_spawn() uses a clone() call which is equivalent to the vfork() arguments of clone(). Therefore, a logical question pops up – why not use vfork() directly. “The problem are the atfork handlers which can be registered. In the child process they can modify the address space.”

Of course, it would be best if posix_spawn() was implemented as a system call in the Linux kernel. Then we wouldn’t need to depend on the GNU libc implementations, which by the way differ with the different versions of glibc. Additionally, the Linux kernel could spawn processes even faster.

The current implementation of posix_spawn() in the GNU libc is basically a vfork() with a limited, safe set of functions which can be executed inside the vfork()’ed child. When using vfork(), the child shares the memory and the stack of the parent process, so we need to be extra careful indeed. There are plenty of warnings in the man pages about the usage of vfork().

I am glad that my implementation and this of the GNU libc guys is very similar. They did a better job though, because they handle a few corner cases like custom signal handlers in the parent, etc. It’s worth to review the comments and the source code of the patch which introduces the new, very efficient posix_spawn() implementation in the GNU libc.

Since OpenSSH version 6.7 the default set of ciphers and MACs has been altered to remove unsafe algorithms. In particular, CBC ciphers and arcfour* are disabled by default. This has been adopted in Debian “Jessie”.

I tested five different platforms having CPUs with and without AES hardware acceleration, different OpenSSL versions, and running on different platforms including dedicated servers, OpenVZ and AWS.

Since the processing power of each platform is different, I had to choose a criteria to normalize results, in order to be able to compare them. This was a rather confusing decision, and I hope that my conclusion is right. I chose to normalize against the “arcfour*”, “blowfish-cbc”, and “3des-cbc” speeds, because I doubt it that their implementation changed over time. They should run equally fast on each platform because they don’t benefit from the AES acceleration, nor anyone bothered to make them faster, because those ciphers are meant to be marked as obsolete for a long time.

A summary chart with the results follow:

You can download the raw data as an Excel file. Here is the command which was run on each server:

Servers which run a newer CPU with AES hardware acceleration can enjoy the benefit of (1) a lot faster AES encryption using the recommended OpenSSH ciphers, and (2) some AES ciphers are now even two-times faster than the old speed champion, namely “arcfour”. I could get those great speeds only using OpenSSL 1.0.1f or newer, but this may need more testing.

Servers having a CPU without AES hardware acceleration still get two-times faster AES encryption with the newest OpenSSH 6.7 using OpenSSL 1.0.1k, as tested on Debian “Jessie”. Maybe they optimized something in the library.

While working on my latest pet project which involved 10 GigE transfers, I noticed a significant difference between the results shown by “iperf” and “iftop“. A fellow blogger also noticed this discrepancy. In order to get to the bottom of this, I did some additional tests using different MTU sizes, and observing the output of “iperf”, “iftop”, “iptraf”, and the raw Linux network device counters as seen by “ifconfig”.

iperf – this tool measures the TCP performance, as per documentation; therefore it counts the useful payload in a TCP/IP transfer; this is layer4 in the OSI model

iftop – this tool counts all IP packets, as per documentation; my tests show that it also operates on layer4, just as “iperf”, because ARP traffic (on layer3) is not counted at all; the fact that “iftop” cares about connections+ports also suggests that it operates at layer4

iptraf – this tool seems to be too old now, and its results were off by a multiple of 4 to 5

ifconfig – shows the most low-level statistics, namely bytes that passed as RX or TX through the network device; the most trusted source of performance data

We notice that both “iperf” and “iftop” measure the useful payload data that we can transfer per second. Since all OSI layers have some overhead, let’s take a look at what theory says about bandwidth efficiency in Ethernet:

with a standard MTU frame of 1500 bytes, we get 94.93% efficiency (5.07% overhead)

with a jumbo MTU frame of 9000 bytes, we get 99.14% efficiency (0.86% overhead)

Those numbers correspond very closely with the results shown by “iperf”.

It’s only “iftop” which differs a lot. Analysis of its source code reveals the reason for this and how we must interpret the displayed results:

The authors of “iftop” decided to round to Gigibit (multiple of 1024), instead of the more common Gigabit (multiple of 1000). This makes the difference by “iftop” bigger as the transfer rate gets higher. For Gigabit the difference is 7%.

Once the “iftop” values are converted from Gigibit to Gigabit, they also match the results by “iperf” and the raw Linux network device counters.

I wanted to build a proof of concept for a scalable, highly available, fault tolerant, distributed block storage, which utilizes commodity hardware, runs on a 10 Gigabit Ethernet network, and uses well-tested open-source technologies. This is a simplified version of Ceph. The only single point of failure in this cluster is the client itself, which is inevitable in any solution.

to achieve a 3x replication factor, the block devices from each storage server are grouped into 5x mdadm software RAID-1 (mirror) devices; each RAID-1 device “md1” to “md5” contains three disks from a different storage server, so that if one or two of the storage servers fail, this won’t affect the operation of the whole RAID-1 device

all RAID-1 devices “md1” to “md5” are grouped into a single RAID-0 (stripe), in order to utilize the full bandwidth of all devices into a single block device, namely the “md99” RAID-0 device, which also combines the size capacity of all “md1” to “md5” devices and it equals to 75 GB

the storage servers and the client machine were limited on boot to 4 CPUs and 2 GB RAM, in order to minimize the effect of the Linux disk cache

only sequential and random reading were benchmarked

Linux md RAID-1 (mirror) does not read from all underlying disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow: --level=10 --raid-devices=3 --layout=o3 Continue reading →

In this article we will demonstrate the use of the “network” namespace which enables a process to have independent IPv4 and IPv6 stacks, network interfaces, IP routing tables, iptables firewall rules, the /proc/net and /sys/class/net directory trees, sockets, etc.

One of those interfaces will be used as a communication point from the side of the original default network namespace. We will assign “10.200.1.1” for IP address:

ifconfig v-eth1 10.200.1.1 netmask 255.255.255.0 up

It is time to enter the new network namespace. Once we have created the new namespace, we will associate the second interface “v-peer1” with it, then we will configure an IP address “10.200.1.2” and add a default route through the first interface which will act as a router:

The original default namespace, our original Linux installation, must be configured to act as a router. Otherwise the processes inside the new network namespace won’t have any Internet access. Configuring a Linux network router is a straightforward task:

Finally, you can enable inbound connections to the processes in the confined new network namespace. Let’s assume that you have a daemon listening on TCP port 10105. Here is how you can forward any new incoming connections to the processes inside the new network namespace:

Pros: Using separate network namespaces gives us full network isolation and control over a group of processes. Additionally, we can match incoming packets against a process which is not possible in a standard “iptables” setup using the “-m owner” match extension. These are huge security benefits.

Cons: The technical implications are that the Linux host has to do (a lot) more work because of the DNAT/SNAT operations and their related connection tracking overhead. If you are running a high traffic server, you should plan and test accordingly. Furthermore, one additional network interfaces pair is created for each new network namespace. Linux can handle hundreds of network devices but still this is a factor to be considered.

The better security features outweigh the drawbacks in most use-cases though. Last but not least, it is very easily to run a process with completely detached network and this won’t cost us anything on the Linux host.

pid namespace (new) — children will have a distinct set of PID to process mappings from their parent

user namespace (new) — the process will have a distinct set of UIDs, GIDs and capabilities

In this article we will demonstrate the use of the “mount” namespace which lets us mount a filesystem per-process without affecting the rest of the system. Using such a private mount for “/tmp” has mainly security but also usability benefits.

Here are all the commands which you need, in order to start a process with a private “/tmp” directory: