“In general, the number of records per partition should not exceed 20 million. In addition, the size of a partition should not exceed 250 MB.”

This guidance is reasonable but implies a strict size limit that should not be exceeded when in fact no such hard and fast limit exists. Query performance scales well as partition sizes exceed 20 million rows or 250MB, so that isn’t the primary concern here. Rather, the main criterion is manageability and processing performance. Partitions can be processed in parallel, so the more there are the more can be processed at once (assuming that I/O can keep up). Furthermore, the smaller a partition is the less memory is required for the ProcessIndex step to build the indexes and aggregations. Therefore, the likelihood of spilling temporary files to disk during processing (and the associated performance hit) decreases with smaller partitions.

So lots of partitions is good, right? Well, not always. The more partitions you have the more things you have to manage. That carries a cost. Also, partitions have some metadata and when the number exceeds several thousand, the cost of managing this metadata becomes apparent.

There is much more detail in the updated performance guide on best practices around partition sizing, but there are some simple rules of thumb to keep in mind. These are (in order or priority):

a)Keep the number of partitions fewer than a few thousand

b)Get the most out of your hardware and process as much data as possible by processing multiple partitions in parallel

c)Avoid spilling to disk when processing indexes and aggregations (the ProcessIndex step of the processing stage)

d)Target 20 million rows or 250 MB per partition

Some simple examples:

-If you’re building a 1TB cube, the first rule takes precedence and you should exceed the 250 MB suggestion to keep the partition count down. In this case, sizing partitions between 500MB to 1 GB is reasonable.

-If records come in at the end of each day and history does not change you can get the quickest processing with daily partitions for recent data and monthly or annual partitions for historical data.

-If you have to reprocess a good portion of your cube on some regular basis, the size of partitions can be driven by how many can be processed in parallel keeping CPU and/or I/O usage as high as possible.