Hi,
we have a cluster v6.3 (java 1.8) and from time to time we see a problem with the gc on the data nodes.
The rate drops to zero and the jvm heap usage goes up to a 100% (which cause the cluster to stop serving).
The data nodes have are 30gb ram and half is allocated to the heap.

First thing is that we need to understand the high heap usages seen for the above case. GC failures are expected if heap usage approaches 100% mark. For more details, I would recommend you to go through this article.

There are a few questions to understand your problem.

What is the size of the cluster and its topology (number of master & data nodes)?

How many shards the cluster is hosting and what is the distribution per node?

@Itay_Bittan Slow GC will impact all other operations on the cluster. I see you have changed some of the cache settings (indices.memory.index_buffer_size is increased to 30%, indices.queries.cache.size to 30%, both defaults to 10%), indices.requests.cache.size is increased to 5%, default value is 1%.

Can you check what are the sizes of these individual caches? You can try fetching individual cache sizes using _nodes/stats/indices?pretty. It might be the case that you are over allocating the caches.

Also How big the cluster_state is? Cluster_state size is proportional to the mapping size.

What is an average size of shard in your ES cluster? Do you have bigger shards greater than 40G?

@Amit_Michelson@Itay_Bittan Looks like you are over allocating shards. You have 15G of heap memory available over each node, which means that there are (700/15) = 47 shards / 1G heap. It is recommended to have 20-25 shards per 1G heap which means for 15G, shard count should be 300-375 shards. More details can be found at this link. Also the more heap per node you are allocating for caching & other stuff, the less number of shards that node can accommodate.

For the current situation, you need to either reduce the number of shards or increase the number of nodes to accommodate the existing shard count.

As you seem to have indices that are read-only, I would recommend force merging them down to a single segment if you are not already doing so. As discussed in this webinar, this has the potential to reduce your heap pressure.

Thanks guys.
To be more accurate:
total cluster size: 750GB
Every one of the serving data nodes hold ~710 shards.
Every one of the indexing data nodes hold ~85 shards.
Serving indices replication_factor = 6
Serving indices are being replaced once (or twice) a day.
Our indices sizes are different, it can be from few MB to 15 GB.@Christian_Dahlqvist we are doing force merging before relocating index from index zone to serving zone. We have interesting issue there BTW: it seems like the size of the index increased after this operation (while we expected the opposite, since we have a lot of deleted docs).
we are doing it with only_expunge_deletes=False