Kamil Mrzygłód's personal blog

For some reason I've had hard times searching for proper use cases when considering Event Hub Capture feature. On the first glance it seems as a reasonable functionality, which can be used for many different purposes:

in-built archive of events

seamless integration with Event Grid

input for batch processing of events

On the other hand I haven't seen any use cases(I mean - besided documentation) using Capture in a real scenario. What's more, the price of this feature(roughly 70$ per each TU) could be a real blocker in some projects(for 10TUs you pay ~200$ monthly for processing of events, now add additional ~700$ - my heart bleeds...). So what is Capture really for?

Compare

Experience shows, that never ever disqualify a service until you compare it with other cloud components. Recently my team has been struggling with choosing the right tool to process messages in our data ingestion platform. To be honest it's not that obvious scenario as it looks. You could choose:

EH Capture with Azure Data Lake Analytics

Direct processing by Azure Functions

Azure Batch with event processors

Stream Analytics

VM with event processors

EH Capture + Azure Batch

Really, we can easily imagine several solutions, each one having pros and cons. Some solutions seems to be more for real time processing (like Stream Analytics, VM with event processors), some require delicate precision when designing(Functions), some seem to be fun but when you calculate the cost, it's 10x bigger that in other choices(ADLA). All are more or less justified. But what about functionality?

IT'S JUST NOT THERE

Now imagine you'd like to distribute your events between different directories(e.g. in Data Lake Store) using dynamic parameter(e.g. a parameter from an event). This simple requirement easily kills some of solution listed above:

ADLA has this feature in private preview

Stream Analytics doesn't have this even in the backlog

On the other hand, even if this feature was available, I'd consider a different path. If in the end I expect to have my events catalogued, Azure Batch + EH Capture seems like a good idea(especially that it allows to perform batch processing). This doubles the amount of storage needed but greatly simplifies the solution(and gives me plenty of flexibility).

There's one big flaw in such design however - if we're considering dropping events directly to an Data Lake Store instance, we have to have them in the same resource group(what doesn't always work). In such scenario you have to use Blob Storage as a staging scenario(what could be an advantage with recent addition of Soft Delete Blobs).

What about money?

Still Capture is more expensive than a simple solution using Azure Functions. But is it always a case? I find pricing of Azure Functions better if you're able to process events in batches. If for some reason you're unable to do that(or batches are really small), the price goes up and up. That's why I said, that this requires delicate precision when designing - if you think about all problems upfront, you'll be able to use the easiest and the simplest one solution.

Conclusion

I find Capture useful in some listed scenarios, but the competition is strong here. It's hard to compete with well designed serverless services, which offer better pricing and often perform with comparable results. Remember to always choose what you need, not what you're told to. Each architecture has different requirements and many available guides in most cases what cover your solution.

This is a perfect test whether you understand the service or not. Pardon me if this was/is obvious for you - apparently for me it was not. What's more, some people still seem to be confused. And we're here to avoid confusions. Let's start!

Hit the limit

Event Hub's model(both from technical and pricing point of view) is one of the easiest to understand and really, really straightforward. Let's perform a quick calculation:

In the overall price we can exclude cost related to the number of events processed(since it's like 5% of the money spent for this service). Now there's one more thing worth mentioning - TUs come as a pair - 1TU gives you 1MB(or 1K events) ingress and 2MB(or 2K events) egress. The question is:

"How to kill Event Hub's egress?"

Let's focus a little and try to find a scenario. We cannot exceed easily 1MB of egress having max ingress of 1MB. Maybe it'd doable by loading lots of data into EH and then introducing a consumer, which will be able to fetch and process 2MB of events per second. Still, this doesn't allow us to exceed the maximum of 2MBs of egress. We're safe.

But what if you introduce another *consumer group*? Since there's no filtering in Event Hub, each consumer group gets the same amount of events(in other words - when you have N consumer groups, you will read the stream N times). Now in the following scenario:

1MB of ingress Consumer1(Consumer Group A) Consumer2(Consumer Group B)

You've just hit the limit of 1TU(since you have 2MB of egress). Now let's try to scale and extend this solution. Let's introduce another consumer group:

1MB of ingress Consumer1(Consumer Group A) Consumer2(Consumer Group B) Consumer3(Consumer Group C)

Now 1TU is not sufficient. By scaling out our Event Hub to 2TUs we can handle up to 4MBs of egress. In the same moment we can handle 2MBs of ingress. So if some reason throttling was your friend and kept the load up to some limit, you can quickly face problems and need to scale out once more.

Be smarter

As you can see, mistakes can be made and relying on consumer groups to filter(or orchestrate) events is not a way to go. In such scenario it'd much better to post events directly to e.g. Event Grid or use topics from Service Bus, so we can easily route messages. You have to understand, that the main purpose of Event Hub is to act as a really big pipe for data, which can be easily digested - misusing it could give you serious headaches.