Scaling Azure Functions to Make 500,000 Requests to Weather.com in Under 3 Minutes

One of the best parts of my job is having the opportunity to work with some amazing customers who are putting together innovative solutions in the cloud. In this post, I’d like to share some details about a recent engagement that taught me a lot about Azure Functions, scale, and other related topics.

The Challenge

Retrieve the most recent, up-to-date forecast information for 500,000 locations in the United States by calling an API hosted on weather.com.

Complete the process in under 5 minutes.

Repeat every 15 minutes.

In the end, with the help of a great team, we were able to accomplish this in under 3 minutes! Here is how we put it together….

Overall Design

At a very high-level this diagram represents the overall flow:

Coordinates are stored in Azure Table Storage – it’s cheap, easy to work with and fast (as long as you partition efficiently). Since the records are not going to change, this is only done once.

When the process begins, Durable Functions are used to populate a group of Storage Queues with the coordinates from the table. Each message represents a pending request to the weather service.

An Azure Function is invoked for each message in the queue.

Requests are made to the 3rd party service to get the forecast details.

The results of the forecast requests are published to Event Hubs for further processing.

What about batching calls to weather.com?

One of the first challenges we encountered was that the weather.com API provided to the customer did not support batching requests. This means that each location (latitude and longitude coordinates) requires its own request to the API.

Load-Balancing

The initial design leveraged a single storage queue and Function App. After some testing, it quickly became apparent that the bottleneck was the single Function App that processed all the messages. Even though the scale for the queue-triggered function was fairly quick, it didn’t meet the goal of completing all 500,000 requests in under 5 minutes.

To address this, the load must be balanced across multiple queues and Function Apps. Therefore, each queue will have a dedicated Function App and only a subset of the records to process. For example, the first queue is responsible for records 1-100,000. Subsequently, the second queue will be populated with the next 100,000 records from the table, and so on. This basic load-balancing exercise distributes the work across multiple services that can now run in parallel.

Ultimately, this approach significantly increases the throughput and rate of requests made to the weather service.

Calling the Weather.com API

The Weather Company has a set of premium APIs for their enterprise customers that allows them to retrieve forecast information in a variety of ways. Obtaining an API key for this advanced analytics service is something that requires a content licensing agreement with The Weather Company.

The function that processes each message is extremely simple. It is nothing more than a queue-triggered function with an output binding to an Event Hub. Even though we can only make one request to the weather API at a time, the function anticipates the message payload to contain a collection of coordinates, just in case batching is supported in the future.

Generating the Load

Now comes the most challenging part – generating the load for the Function Apps to process. Remember, the goal is to distribute 500,000 records evenly across several queues (5), every 15 minutes. We want this done as quickly as possible – that’s where Durable Functions and the Fan-out/Fan-in pattern come in.

A quick breakdown of the load generator:

It begins with a normal HTTP triggered function that we can call manually or set a recurrence for.

An orchestrator function is invoked to begin the workflow.

The orchestrator function calls 5 sub-orchestrator functions, one for each queue.

The sub-orchestrators spin up a large set of activity functions that will retrieve records from the table and send messages to their respective queues.

When putting this together I didn’t realize that you could only retrieve 1,000 records at a time from a table. This forced me (in a good way) to come up with a simple partitioning strategy. In short, every set of 1,000 records was given a unique partition key that had the following naming convention: {queue-number}-{partition-count}.

So the first 1,000 records for queue #1 would have the partition key 1-0. The next 1,000 records would be assigned the key 1-1. The following screenshot highlights some of these records in Azure Storage Explorer:

Now, when the coordinates are retrieved by the activity functions, they return 1,000 records at a time.

Azure Service Limits and Design Considerations

Looking back on this project, one of the key takeaways was the importance of researching and understanding the limits for each of the services we used.

This goes beyond just learning about quotas within a subscription. Instead, it is about considerations such as the service level agreements (SLAs), throughput options, costs and a plethora of other variables that should and must influence how you design and architect solutions in the cloud. These limits, strengths and weaknesses are magnified when scale and performance are principal requirements of a solution.

Hi, Nuri. Went with queues because it’s a natural fit for this competing-consumer model and it also scales very quickly for Azure Functions. Event Hubs doesn’t seem like the right fit for the load generation part – we don’t need ordering, message retention and it’s not intended to act like a queue. The scale needs to be organic each time this kicks off (every 15 minutes). If we used Event Hubs the messages would stay there and the scale on the Functions side wouldn’t even come close.