Additionally there is insane memory usage on boxes affected by lack of registry updates and filebeat process can grow up to several dozens of gigabytes of RES memory (until it's gonna be killed by OOM).

UPDATE: So it looks like that change of tail_files from true to false isn't a hard requirement to trigger this, I already see some nodes that were left with "true" and failed to update registry after restart too. It's just happens less often on them.

There is nothing meaningful in FB logs,

It's CentOS6

All affected FB instances are still delivering events properly, despite growing memory consumption (they gonna be eventually killed by OOM). My biggest filebeat instance at the moment is at 22gb of mem used:

It's probably also not a file ulimit problem (bumped to 50k and according to filebeat stats there are usually 0.5k of files open)

ruflin:

After starting filebeat with tail_files: true, did you add events and check if the registry had the correct entries?

Did you add events during the time filebeat was down?

Yes, I'm running Filebeat on syslog receiver boxes so logs are constantly updated (with some amount of newly created file every once a while), including period when FB is down. It looks like registry file wasn't modified with any update after first restart at all.

One more UPDATE: Apparently even restart isn't required to trigger this bug. I already had more than occurrence of Filebeat instances that suddenly stopped updating registry file without any restart or second instance being spawned.

So the example timeline was as follows:
18:00 Filebeat started
00:00 Registry updates stopped,Filebeat began to consume more memory from that point.
Apart from lack of registry updates, nodes remained operational and were able to deliver events to logstash.
00:00 was also a time when some amount of files were created, but for sure it didn't hit open file limit.

So I'm wondering, is it possible that in some cases all of the registry updates are accumulated in memory instead of being written to disc? That'd explain both issues.

For ~10 minutes after restarting Filebeat with publish_async: false there are zero events delivered, but Filebeat's constantly changing registry file. Plus time needed to fully stop filebeat (as to wait until filebeat process shuts down) is also incredibly long (3-5 minutes?). Worth to note that rc1 never behaved that way.

We currently have the suspicion that the above is related to publish_async and the large number of logstash instances you have listed in your config. It would be interesting to hear if you see the same behaviour if you just use 1 LS instance.

I would expect in the above that published_but_not_acked_events is > 0. I will check here if perhaps something with the counting goes wrong.

Could you share your full log files from startup until you finish filebeat?

I can confirm that disabling publish_async helped and I'm not longer seeing:

increased memory usage,

any problems with registry updates (both while running and after restarts)

ruflin:

We currently have the suspicion that the above is related to publish_async and the large number of logstash instances you have listed in your config. It would be interesting to hear if you see the same behaviour if you just use 1 LS instance. I would expect in the above that published_but_not_acked_events is > 0. I will check here if perhaps something with the counting goes wrong.

Smallest setup I've tested so far had only 3 LS nodes and it was affected by this issue too. I'll test with one.

(for now) surprisingly I've spotted no hard evidence on published_but_not_acked_events being a culprit; I had nodes with 0 non-acked events that were still affected by registry problems and I'm also having a lot of nodes that are showing non-acked events with publish_async disabled that behave just fine.

publish_async + load balancing might not play well together. Problem with async + load balancing is, batches can become transferred out of order. If one batch gets 'stuck' between retries but others can still be send, all batches might pile up in memory. Normally internal queues are bounded, so beats should stop at some point, but some combination of settings + speed differences + spurious network issues might trigger some issues, so internal queue becomes unbounded. I have some potential fix in mind, but still want to understand what exactly is going on. I wonder if problem is reproducible with publish_async and exactly one LS.

published_but_not_acked_events indicates some events could not be pulibhsed (not ACKed by LS). That is, these events must be resend (re-enqueued and load balanced one more).

The thing about the registry is, the registry requires in order updates. That is all batches published (ACKed or not) need to be cached in memory in order to guarantee registry updates being in order. If one batch gets 'stuck' for whatever reason (send to a very slow logstash instance), batches can pile up. The upper bound by bounded send-queues might not trigger due to load-balancing.

I guess issues could be triggered by load-balancing already if instances can handled batches at different speeds. The published_but_not_acked_events is often a sign LS did not respond in time and the publish request either timed out or logstash did close the connection (logstash being overloaded). That is published_but_not_acked_events can be a good indicator for some LS instances either being overloaded or having to deal with back-pressure themselves.

With this many LS instances, have you checked the LS outputs can actually deal with the load by this many LS instances?

Can you post your filebeat and LS configs? I've no idea about your settings.