Ubuntu, LXD, Samba and the dreaded “sys_setgroups failed” error

Sometimes, errors produced by Samba can be really annoying. In such cases, they sure as hell depend on your actual configuration, environment, kernel and Samba versions and so on. If you are unlucky enough to have a comparatively rarely used configuration, you might either find outdated forum threads, forum posts with no replies or bug reports that have been marked as “Closed” with a comment that “it is impossible to reproduce the problem”. And yet, in your particular case, the problem is there and you have no clue what have you done wrong to be punished so hard.

Just like in this case.

First of all, there is a need to describe the particular environment. We have a host, let’s call it “server01“, which happens to run Ubuntu 14.04 LTS and LXD, and an LXD container, let’s call it “dev1“, which happens to run Ubuntu 16.04 LTS. We have Active Directory provisioned in our network, with RFC2307 (Unix extensions). For users/groups where Unix UIDs and GIDs are not defined, we use the TDB backend of idmap, for “normal” users and groups we use the AD backend with RFC2307 to get our UIDs and GIDs so that they would be consistent across all the *nix machines on the network. Having consistent UIDs and GIDs is good and helps with file transfers, migration of complicated access lists and other issues.

Having Samba installed and configured in the container, dev1, it was found that it is impossible to connect to it using any client, be it the cifs filesystem driver on Linux, smbclient or a Windows workstation. We would get this error with smbclient, for instance:

Some outdated forum posts and bug reports could be found, but they all were related to Samba version 3.x and Linux 3.5. When kernel 3.5 came out, it implemented some changes in how the sys_setgroups() function worked, returning an error if GID -1 was passed to it, and Samba happened to do just that. Yet, this issue is old (dating back from year 2012) and has long been patched, so it shouldn’t appear in modern incarnations.

Some “widow posts” were also found that were more or less recent, but they had no replies or replies like “Yeah, I also have this problem. Somebody help, please!”

So, it seemed that there will be no problem with that. Quite logically, having read about past issues with Samba not willing to cooperate with the kernel, a decision was made to upgrade the host’s kernel to the latest 16.04 LTS hardware enablement stack. That was easy, that’s a good thing to do in any case, and it turned out not to help at all, but, should you wish to do so, you can read about it in Ubuntu’s Wiki page.

Then, some brainstorming was done about what and how doesn’t work here. You see, the sys_setgroups function performs a kind of “resolving” task – you pass it a list of group IDs and it returns you which users belong to it. So, it would seem that some group IDs cannot be resolved and/or seem “illegal” to the kernel, so an error is returned instead, causing the smbd process to crash.

While there are no actual practical limits on UID and GID use in a “normal” Linux system, we have to remember that we are running in an LXD container. For security reasons, as the filesystem is shared between the host and the containers, each user who launches containers has its own namespace – that is, a base value from which its “internal” UIDs and GIDs start. For instance, normally, the UID for the root user is 0. Should a particular user’s namespace start with, say, 100000, the root user within the container would still be 0, but from the host’s point of view it would be UID of 100000. Container’s UID 1000 would be 101000 and so on. These values are configured in two files on the host, /etc/subuid and /etc/subgid. After taking a look at these 2 files, it was discovered they look like this:

/etc/subuid:lxd:231072:65536

/etc/subgid:lxd:231072:65536

That is, containers launched by the “lxd” user have the base of 231072 and the containers have a maximum of 65536 UIDs and GIDs allowed. This could be easily tested within the container:

That is, we first create an empty file called “test“, then set its owner to UID 65535. Remember that we can have 65536 UIDs and GIDs, numbered from 0 to 65535. We could successfully set the UID to 65535, but not to 65536 which in our case is “out of range”. Hmm, this is a clue, but what UIDs and GIDs could our beloved Samba want to use? This could easily be spotted in the /etc/samba/smb.conf file; the relevant part is here:

We see that we have a range of 500-4000000 reserved for the AD backend, and 4000001 to 5000000 reserved for the TDB backend. While users with their Unix attributes set certainly have UIDs less than 65536, groups such as BUILTIN\administrators or BUILTIN\users have their GIDs calculated by the TDB backend. This means that we simply need to extend the allowed UID and GID range for our container.

The container was shut down and the /etc/subuid and /etc/subgid files were changed accordingly:

/etc/subuid:lxd:231072:5000000

/etc/subgid:lxd:231072:5000000

Then, the LXD daemon was restarted to take the new configuration into account, after which the container was fired up:

To summarize all of this, the problem of that bizarre Samba crash was caused by several factors:

Active Directory environment

The particular Samba configuration with large UID/GID ranges

Running in an LXD container which had the default range for UIDs/GIDs, i.e. 65536, while Samba wanted a range of 5000000.

As can be seen, such cases are not obvious and require deeper analysis of the issue at hand, and the solution is not as obvious as it might seem at the start. Yet, with enough knowledge about the topic and some brainstorming they can be solved. This article was written to help others who might run into this issue – feel free to link to this post.