Problem

We want to secure BookKeeper clusters, so at least we have to cover this main issues:

Use ACLs while writing to ZooKeeper

Provide an extensible Authentication framework (in BookKeeper 4.4.0 there is already partial support for this)

Provide a out-of-the box plugin which implements standard SASL authentication, with at least support for GSSAPI/Kerberos and DIGEST-MD5 mechs

Provide initial support for Authorization

SSL/TLS support for encrypting comunications and/or to implement client authentication

Proposals

Major concers are about protocol compatibility, data encryption, configuration on Bookie side and client-side, rolling upgrades and ZooKeeper security.

Secure ZooKeeper data structures

In order to implement a secured ZooKeeper layout at least we have to set ACLs which prevent unknown users to modify and/or access ledgers metadata and bookies metadata.

The idea is to have a global "secure zookeeper nodes" (zkEnableSecurity), to be set both on bookies and on clients, as bookkeeper clients do the most of the work actually.

The initial proposal is to set the default ZooKeeper ACL ZooDefs.Ids.CREATOR_ALL_ACL to every Z-Node created from BookKeeper clients and Bookies.

This will work fine only if every client and bookie performs authentication to ZooKeeper with the same "user".

This is a very common practice in the Hadoop ecosystem (see Kafka, HBase...) and with Kerberos based setups.

Usually in a Kerberized ZooKeeper setup each machine logs in using a principal in the form zkclient/HOSTNAME@REALM and ZooKeeper strips out the HOSTNAME and REALM part: this way each client gets authenticated as simply "zkclient" user.

Changes on protocol for AuthMessage message:

On 4.4.0 there is a issue on the BookieAuthProvider and ClientAuthProvider classes, these classes depend directly on the AuthMessage class which in turn depends on protobuf, which is a shaded dependency.

The idea is to remove the usage of that class on the public API and make the implementor use a standard AuthToken class which is a simple wrapper for an array of bytes.

Common Authentication frameworks, like SASL, need only to exchange a sequence of arrays of bytes, and so this new API would be suitable for most of the usecases.

On 4.4.0 the AuthMessage message is defined to be extensible, such a feature will not be needed any more.

ISSUE: can we drop the 'extensions' line on the definition of AuthMessage in BookkeeperProtocol.proto ?

Rolling upgrade to switch from authentication types:

The proposal is to support the switch from "no-auth" to any plugin is the following:

1) when no auth is configured on bookie-side the default "AuthDisabledPlugin" plugin comes to play (on 4.4.0 we call it "NULLPlugin" )

2) when no auth is configured on bookies the bookie goes to a permissive mode in which any client is allowed to connect and use bookie functions, any 'auth' message will be answered with an OK status code from the AuthDisabledPlugin

3) when no auth is configured on bookies each attempt of authentication by the client will receive the OK status code from the AuthDisabledPlugin, and so the client will understand that the bookie is not performing authentication, and goes to the authenticated state

A rolling upgrade from AuthDisabledPlugin to any other plugin can be implemented by rebooting clients with the new plugin: this way bookies will authenticate clients with AuthDisabledPlugin and clients will go to authenticated state anyway.

Then you will restart bookies with the new plugin and clients will perform real authentication.

Bookie-to-Bookie comunications will follow the same flow during the rolling upgrade, because each bookie can be a client for the other bookies and will authenticate thru the AuthDisabledPlugin flow or to the requested plugin flow.

Attaching the authenticated user to the server-side peer:

Another issue is to attach to the server-side view of the connection (there is no specific class, it is the BookiePipeline instance created for each client) the user-id (principal) of the client which performed authentication, this will be the base for ACLs, quotas, auditing and further security-related features.

A change to the AuthProvider interface will be needed, at least the AuthHandshakeCompleteCallback needs to receive the user-id upon successful authentication.

ZookKeeper enable the usage of JAAS to configure SASL based DIGEST-MD5 authentication, this is very useful for little deployments and for testing purposes. See ZooKeeper documentation for better explanations. Kakfa supports a similar configuration for simple JAAS/SASL setups.

Following the conventions the "Bookie" principal will be bookeeper/HOSTNAME@REALM (for instance in zookeeper it has to be zookeeper/HOSTNAME@REALM and for kafka it is kafka/HOSTNAME@REALM).

The authorized id (user) will be the Principal attached to the JAAS subject, so bookeeper/HOSTNAME@REALM. In ZooKeeper there are options to strip out the HOSTNAME part and the REALM part (see kerberos.removeHostFromPrincipal, kerberos.removeRealmFromPrincipal properties).

ISSUE: In order to support a distinct JAAS configuration entry for the Auditor (Bookie-to-Bookie comunications) the special BookKeeper instance used by the Auditor must be configured in order not to use the standard "BookieClient" entry. Maybe we can use an internal configuration option handled by BookKeeper client like 'auditorMode'. In this mode the entry will be choosen following the Auditor conventions.

ISSUE: On ZookKeeper the SASL mechanism is decided upon the type of JAAS Subject, this is very simply from admin to be configured. We should make the configuration more explicit, something like a configuration property sasl.mech=GSSAPI|DIGEST-M5.....

SSL/TLS Support

The initial proposal is enable TLS support using the StartTLS mechanism, this way Bookie will continue to advertise only one network endpoint, which in turn is effectively used as Bookie ID.

Having a Bookie with more than one endpoint will need more metadata refactoring, the StartTLS proposal let us skip this change for this version.

In order to support TLS we have to implement TLS comunications on Netty 3 (maybe on Netty 4) and add a new protocol message to implement the StartTLS.

We can implement an AuthPlugin which will use the Certificate sent from the client and attach it to the server-side connection peer.

We can add an optional scheduled task which checks certificate validity, this can be done inside the AuthPlugin, but the AuthPlugin will need to hold a reference to an "handle" to the underlying connection, in order to shutdown it in case of certificate expiration

In order to support such a Connection Handle the PeerInfo structure passed to the AuthPlugin should be changed and become an active object

Enrico Olivelli thanks for driving this. I am fine with splitting this. However please keep in mind, the intention for proposal is for big feature changes/improvements, it is not a replacement for JIRA tickets. A proposal is a place to host the major design discussion, for implementation, a proposal will have to be divided into small jiras for tracking the process of developing a proposal.

Enrico Olivelli - I checked the all the jiras linked in this proposal. It seems they are all resolved. Is this done? If so, do you mind helping with the documentation about the security support and then let's try to write a blog post about the bookkeeper security feature.