How Per-Second AWS Billing Helps With Data Processing

Because of the hourly increments in billing, users spend a lot of time playing a giant game of Tetris with their big data workloads — figuring out how to pack jobs to use every minute of the compute hour. Examples:

If a job could be run on 10 nodes and finished in 20 minutes, it was better to run it on fewer nodes so that it takes around 50 minutes. As a result, you would pay less.

Running two 10 node jobs that took 20 minutes in parallel would cost two times more than running them sequentially.

The above problem got compounded if there were many such jobs to be run. To handle this challenge, many organizations turned to a resource scheduler like YARN. Organizations were following the traditional on-premises model of setting up one or more big multi-tenant cluster on the cloud and running YARN to bin-pack the different jobs.

Read on to see how this has changed as a result of per-second billing.

Related Posts

Tristan Robinson looks at what’s currently available in terms of security on Azure Databricks: You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not […]

Melissa Coates has some great advice for people working with data lakes: Q: Partitioning by date is common. Where should the dates go in the folder hierarchy? Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder […]