Some notes to share…

Secured NiFi cluster with Terraform on the Google Cloud Platform

This story is a follow up of this previous story about deploying a single secured NiFi instance, configured with OIDC, using Terraform on the Google Cloud Platform. This time it’s about deploying a secured NiFi cluster.

In this story, we’ll use Terraform to quickly:

deploy a NiFi CA server as a convenient way to generate TLS certificates

deploy an external ZooKeeper instance to manage cluster coordination and state across the nodes

deploy X secured NiFi instances clustered together

configure NiFi to use OpenID connect for authentication

configure an HTTPS load balancer with Client IP affinity in front of the NiFi cluster

Note — I assume you have a domain that you own (you can get one with Google). It will be used to map a domain to the web interface exposed by the NiFi cluster. In this post, I use my own domain: pierrevillard.com and will map nifi.pierrevillard.com to my NiFi cluster.

Disclaimer — the below steps should not be used for a production deployment, it can definitely get you started but I’m just using the below to start a secured cluster (there is no configuration that one would expect for a production setup such as a clustered Zookeeper, disks for repositories, etc).

If you don’t want to read the story and want to get straight into the code, it’s right here!

What is Terraform?

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

Configuration files describe to Terraform the components needed to run a single application or your entire datacenter. Terraform generates an execution plan describing what it will do to reach the desired state, and then executes it to build the described infrastructure. As the configuration changes, Terraform is able to determine what changed and create incremental execution plans which can be applied.

The infrastructure Terraform can manage includes low-level components such as compute instances, storage, and networking, as well as high-level components such as DNS entries, SaaS features, etc.

What is NiFi?

Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. In simpler words, Apache NiFi is a great tool to collect and move data around, process it, clean it and integrate it with other systems. As soon as you need to bring data in, you want to use Apache NiFi.

Why ZooKeeper?

Apache NiFi clustering

Best is to refer to the documentation, but, in short… NiFi employs a Zero-Master Clustering paradigm. Each node in the cluster performs the same tasks on the data, but each operates on a different set of data. One of the nodes is automatically elected (via Apache ZooKeeper) as the Cluster Coordinator. All nodes in the cluster will then send heartbeat/status information to this node, and this node is responsible for disconnecting nodes that do not report any heartbeat status for some amount of time. Additionally, when a new node elects to join the cluster, the new node must first connect to the currently-elected Cluster Coordinator in order to obtain the most up-to-date flow.

OAuth Credentials

First step is to create the OAuth Credentials (at this moment, this cannot be done using Terraform).

Once the credentials are created, you will get a client ID and a client secret that you will need in the Terraform variables.

By creating the credentials, your domain will be automatically added to the list of the “Authorized domains” in the OAuth consent screen configuration. It protects you and your users by ensuring that OAuth authentication is only coming from authorized domains.

Download the NiFi binaries in Google Cloud Storage

In your GCP project, create a bucket in Google Cloud Storage. We are going to use the bucket to store the Apache NiFi & ZooKeeper binaries (instead of downloading directly from the Apache repositories at each deployment), and also as a way to retrieve the certificates that we’ll use for the HTTPS load balancer.

Deploy NiFi with Terraform

Once you have completed the above prerequisites, installing your NiFi cluster will only take few minutes. Open your Google Cloud Console in your GCP project and run:

Deploy script

If you execute the above commands, you’ll be prompted for the below informations. However, if you don’t want to be prompted, you can directly update the variables.tf file with your values to deploy everything.

Variables to update:

project // GCP Project ID

nifi-admin // Google mail address for the user that will be the initial admin in NiFi

san // FQDN of the DNS mapping for that will be used to access NiFi. Example: nifi.example.com

proxyhost // FQDN:port that will be used to access NiFi. Example: nifi.example.com:8443

ca_token // The token to use to prevent MITM between the NiFi CA client and the NiFi CA server (must be at least 16 bytes long)

oauth_clientid // OAuth Client ID

oauth_secret // OAuth Client secret

instance_count // Number of NiFi instances to create

nifi_bucket // Google Cloud Storage bucket containing the binaries

Here is what it looks like on my side (after updating the variables.tf file):

Execution of the deploy script

Explanations

The first step is to deploy the NiFi Toolkit on a single VM to run the CA server that is used to generate certificates for the nodes and the load balancer. Once the CA server is deployed, a certificate is generated for the load balancer and pushed to the Google Cloud Storage bucket.

The script you started is waiting until the load balancer certificate files are available on GCS. Once the files are available, files are retrieved locally to execute the remaining parts of the Terraform template. It will deploy the ZooKeeper instance as well as the NiFi instances and the load balancer in front of the cluster. All the configuration on the NiFi instances is done for you. Once the script execution is completed, certificates files are removed (locally and on GCS).

After 5 minutes or so…

The load balancer has been created and you can retrieve the public IP of the load balancer:

Retrieve the external public IP of the HTTPS load balancer

You can now update the DNS records of your domain to add a DNS record of type A redirecting nifi.pierrevillard.com to the load balancer IP.

I can now access the NiFi cluster using https://nifi.pierrevillard.com and authenticate on the cluster using the admin account email address I configured during the deployment.

Here is my 6-nodes secured NiFi cluster up and running:

6-nodes secured NiFi cluster6 nodes with the elected primary and coordinator nodes

I can now update the authorizations and add additional users/groups.

Note — you could use Google certificates instead of the ones generated with the CA server to remove the warnings about untrusted certificate authority.

Cleaning

To destroy all the resources you created, you just need to run:

terraform destroy -auto-approve

As usual, thanks for reading, feel free to ask questions or comment this post.

Hello,
Thanks a lot for this post. I’m using Google Cloud Platform and it looks like a very interesting way to deploy Nifi !
But I’m asking myself if there is a way to access the interface of a “chosen” node with this configuration ?
And if I could interact with the REST API behind the load balancer ?

You can interact with the REST API behind the load balancer because the LB is defined with a sticky session (the same user will only access the same node, as long as the node is available). You can also access a node in particular if you want – but it might be required to tweak things a little bit for the OIDC authentication (to be tested).