Sunday, November 4, 2018

The Challenge

When using the Elastic Stack, I've found that Elasticsearch and Beats are great at load-balancing, but Logstash... not so much as they do not support clustering. The issues arise when you have end devices that do not support installing Beats agents which send to two or more Logstash servers. To get around this, you would typically:

Set up any one of the Logstash servers as the syslog/event destination

Pro: Only one copy of the data to maintain

Con: What if that server or Logstash input goes down?

Set up multiple Logstash servers as the syslog/event destinations

Pro: More likely to receive the logs during a Logstash server or input outage

Con: Duplicate copies of the logs to deal with

A third option that I've developed and laid out below contains all of the pros and none of the cons of the above options to provide a highly-available and load-balanced Logstash implementation. This solution is highly scalable as well. Let's get started.

Prerequisites

To begin creating this proof-of-concept solution, I began with a very minimal configuration:

Log Server Configuration

OS install

For this, I simply created a small VMware Fusion virtual machine using the CentOS 7 Minimal ISO as my installation source (this one in particular). The rest of the machine creation is pretty straight-forward. (Note: I did change from NAT to Wi-Fi networking as I was having very strange issues with NAT networking)

After starting the virtual machine, the install process will begin. This is where you can just do a basic install, but I chose a few options that hit close to home with my day job:

Partition disk manually if intending to use a security policy (this would otherwise cause a security policy violation that will keep us from proceeding)

Logstash configuration

There would be no way to show off all of the possible Logstash configurations (that's some research for you :) ), so I'll just set up a simple one for testing our highly-available Logstash cluster:

This is a bit different, but the API will need to be exposed outside the localhost:

sudo vi /etc/logstash/logstash.yml

Uncomment http.host and set to the server's IP address

Uncomment http.port and set to JUST 9600

The input and output configuration for Logstash is next (you can change the filename to something else... unless you agree). For this testing, I'm just setting up a raw UDP listener on port 5514 and writing to a file in /tmp.

If you chose the DISA STIG Policy during the VM build, comment out "net.ipv4.ip_forward = 0" (yes... this is a finding if this system is not a router. But once ipvsadm is running it IS a router. So we're all good ;) )

sudo sysctl -p

Keepalived

Here's where the real bread-and-butter of this setup lies: keepalived. This application is typically used to provide a virtual IP between two or more servers. If the primary server were to go down, the second (slave) would pick up the IP to avoid any substantial downtime. This is not a bad solution in regards to high-availability, but that means only one server will be online at a given time to process our logs. We can do better.

Another feature of keepalived is virtual_servers. With this, you can configure a listening port for our virtual IP and, when data is received, will forward to a pool of servers via a load-balancing method of your choosing. The configuration would look something like this:

Logstash Health Checks

You'll probably notice a reference to inputstatus.py in the above configuration. Keepalived will need to run an external script to determine whether or not the configured "real server" is eligible to receive the data. This is typically pretty easy to do with TCP... if a SYN, SYN/ACK, ACK is successful, we can assume the service is listening. This is not an option with a Logstash UDP input as nothing is sent back to confirm that the service is listening. What can be used instead is the API. The following script simply makes an API call to list the node's stats, parse the resulting list of inputs, and, if the input we're looking for is up, exit normally.

Keepalived will add this server to the list of real servers if the exit code of our script is 0 and remove it from the list if it is anything except 0. The aforementioned keepalived configuration is set up to check this script every 5 seconds for minimal log loss if one goes down. Adjust as you see fit here (i.e., how much loss can you acceptably handle).

Of course, you would have to create several of these if you have Logstash listening on multiple ports, but cut and paste is easy. Just look at /var/log/messages to ensure that these scripts are exiting properly. If you see a line like "Oct 30 09:44:58 stash1 Keepalived_healthcheckers[16141]: pid 16925 exited with status 1", either the script failed or a particular input is not up. Since this error message isn't the most descriptive, you'll have to manually test or view each input on each host to see which one it is. You can manually test the Logstash inputs (once that service is running) by issuing:

/bin/python /etc/keepalived/inputstatus.py <IP> <input-id>

Firewall Rules

Sure, we could just disable firewalld... but we did just expose our API to anything that can reach this machine, so we need to lock this down a bit better. Don't worry, the rules are pretty straight-forward. (Note: replace '192.168.1.111' with your host which is sending logs to Logstash and '192.168.1.210', '192.168.1.220', and '192.168.1.230' with the two Logstash servers and virtual IP address, in that order).

The Second Logstash server

Shut down the Logstash server virtual machine since it's much easier to just clone this one and make a few configuration changes instead of stepping through this process all over again.

Now that it's shut down...

Boot the second one up (leaving the first powered off for now) and make the following changes in the VM console:

Set hostname

sudo hostnamectl set-hostname stash2

Set IP address

sudo vi /etc/sysconfig/network-scripts/ifcfg-<interface>

Change IPADDR to appropriate IP address

sudo systemctl restart network

Change Logstash listening IPs

sudo vi /etc/logstash/logstash.yml

Change http.host to stash2's IP address

sudo vi /etc/logstash/conf.d/ryanisawesome.conf

Change host to stash2's IP address

Swap the unicast_src_ip and unicast_peer IP addresses

sudo vi etc/keepalived/keepalived.conf

Reboot

sudo reboot now

Now, you should be able to start the original virtual machine (in my case, Stash1)

Putting It All Together

We've finally reached the point to fire up all the services and test out the HA Logstash configuration. On each Logstash VM:

sudo systemctl enable logstash

sudo systemctl start logstash

sudo systemctl enable keepalived

sudo systemctl start keepalived

You can monitor that Logstash is up by viewing the output of:

sudo ss -nltp | grep 9600

If you have no output, it's not up yet. If it doesn't come up after a few minutes, check out /var/log/logstash/logstash-plain.log to any error messages. Personally, I like to "tail -f" this file right after start logstash to ensure everything is working properly (plus it looks cool to those that look over your shoulder as all that nerdy text flies by).

On each machine, you can now check that ipvsadm and keepalived are configured properly and playing nice together. You should be able to run the following command and get similar output (you IPs may be different, but you should see TWO real servers):

ip a

Only ONE of the two servers should have the virtual IP assigned (by default, the one with the higher IP address since the priority is the same and this is the tie-breaker when using VRRP)

To test that load balancing is happening, the sample log source (in my case, my host operating system) will need to send some data over UDP 5514 to the virtual IP address. To do this, I'm going to use netcat (but really anything that can send data manually over UDP will work... including PowerShell).

What I just did was send four test messages to the virtual IP. If everything worked properly, the virtual server will have received the messages and load-balanced, in a round-robin fashion, to each server's /tmp/itworked.txt file. On each server, let's check it out.