I have a cluster of three servers that are all part of an ELK (Elasticsearch Logstash Kibana) cluster receiving netflow/sflow/ipfix data. Everything appears to be working fine and without using netdata one would assume it was working perfectly but I'm seeing the following issue:

I've been researching this for the most of my time over the last few days now and I am not making any progress whatsoever. I've tried tuning things with sysctl with absolutely no effect. The same graph pattern continues relentlessly and the RcvBufErrors and InErrors peak at about 700/events per second. Occasionally I'll see a spike or a dip while making changes in controlled manner but the same pattern always prevails with the same peak values.

The values I've tried increasing with sysctl and their current values are:

Note I'm also getting the 10min netdev budget ran outs | 5929 events issue as well but this is less of a concern. That's why I've increased net.core.netdev_budget and net.core.netdev_max_backlog described above.

Since I'm using Elastiflow on top of LogStash I've also tried raising the number of workers (from 4 to 8), queue size (from 2048 to 4096) and receive buffer (from 32MB to 64MB) for each of the logstash inputs but I'm not seeing any difference either. I've given plenty of time for the logstash restart and things to reflect the new settings but the issue remains the same although the patterns on the graphs did change somewhat. I see more RAM being used by udp etc but no change on the packet loss situation.

Any ideas on what I can do to find out what I need to change and how to actually determine what they should be set to would be appreciated.

Edit the systemd service file for Logstash, it should be /etc/systemd/system/logstash.service. Change NICE=19 to NICE=0 and restart Logstash.

At a CPU nice level of 19 Logstash is running at the lowest priority and just about any other process will bump it off the CPU. Changing it to a nice level of 0 (the default when no nice level is specified) should significantly increase the throughput of Logstash and reduce UDP packet loss.

The fact is that the UDP input for Logstash is not designed to scale well across multiple CPUs. There is a well documented rate of diminishing returns, that is largely related to kernel buffer contention as well as the fact that Java doesn't pin the worker threads to a specific core. IMO, 4 workers is the sweet spot between throughput and operational overhead. In benchmarking the Logstash Netflow codec, There is almost no gain above 16 cores.

If I had a 56 core machine and needed to maximize throughput I would run multiple instances of Logstash, fronted by NGiNX as a load balancer.