Hey guys - it could be something silly i did on my end, but we have been having a lot of “hot sharding” problems lately.

I have had several people look at the stack, from our autoscaling group, thread pool size, config options, EC2 compute/network/ram usage etc, and still seem to be getting hot shards, which leads them and myself to believe maybe random uuid isnt randomizing across the shards evenly?

Unfortunately, its only really visible at scale, due to our current provisioning, but when we are under heavy load, we noticed about 10% of our shards that start falling further and further behind.

Has anyone else seen this or have any insight? I am going to set up lambda logging of partition keys to dynamo to do some basic analysis or look for patterned output.

Our collector to raw is running perfect with no visible hot sharding, including the loads that hot shard our good/bad streams.

Look forward to hearing back from you guys with suggestions of where to look. Thanks!

Also, if it helps/is related:
we run ip_lookups using maxmind, event_fingerprint and user-agent-utils (3 total enrichments)

the bad stream had content that was failing schema validation by passing a null to a required field.

It is typically our bad bucket that starts throttling provisioned write throughput which is what drags our enriched stream behind (since the stream enrich app processes both good bad, when its flooded with bad, good also pays price)

I tried scaling # of shards, no avail
i added more stream-enrich boxes through our ASG, did not seem to help (still falling behind)

i tried splitting hot shards, helped somewhat but did not fix root problem