Build a Cube to Perform Scenario-Based Risk Analysis

When a client, who I’ll refer to as Big Green Cola, asked my team to help them “run scenarios” using a cube, we were skeptical as to whether a cube was the right technology to use. We wanted to make sure that we would be able to meet their requirements, and we feared that the client was hoping to run extremely complex simulations that wouldn’t be possible using cube technology. After several meetings, during which we discussed the client’s needs in more detail, we agreed to attempt to design and develop a cube to run the scenarios.

Big Green Cola manufactures and distributes soft drinks. The company contracts with independent distributors, who in turn distribute to retail outlets. Given the recent upheaval in the economy, the corporate officers at Big Green Cola became concerned that some of their independent distributors might not be able to continue distribution.

Big Green Cola wanted to do scenario-based risk analyses that would simulate the potential failure of some of their distributors to see what effect it might have on their operations. Big Green Cola wanted to be able to plan for the possible reassignments of retail outlets to competing independent distributors on short notice in the event that a distributor declared bankruptcy or otherwise went out of business. The cube would let them see which geographic areas would be affected by the failure of a particular distributor and which other Big Green Cola distributors were already operating in that same geographic area or nearby.

A major user requirement was that the Scenarios Cube show the original structure of the route and its distributor, even after a distributor and its route were cast as “unhealthy” in a given scenario. In other words, the change in status would be reflected only in the measures and calculated members, not in the hierarchies or pivot table layouts. This would make actual versus scenario distribution patterns more obvious, and thus instructive, for the users. Let’s take a look at how we designed and built a cube that let our client use a cube to run simple simulations of changing conditions. (You can download the sample cube by clicking Downloads at the top of the article page.)

Distribution Scenarios Cube Star Schema

We had already provided Big Green Cola with a general usage cube that included dimensions for Date, Time, Product, Distributor, Route, Retail Location, and Geography. The Distribution Scenarios cube reused four of those dimensions: Distributor, Retail Location, Route, and Geography, as shown in Figure 1.

TheoriginalDistributor dimension contained a distributor ID, distributor code, and distributor name. The RetailLocation dimension included only RetailLocationID and RetailLocationName, although there are plans to add groups and categories to this dimension in the future.

The Route dimension included information from each distributor regarding how that distributor groups and runs its delivery logistics. Big Green Cola was considering using additional data provided by the distributors for even better short-term contingency planning using geocoded data, but the company decided to table that enhancement for a future release. In lieu of the geocoded data, the Routes dimension simply provided a recognizable grouping of the locations that Distributors actually use in their planning and delivery and that the Big Green Cola users found helpful when developing contingency plans.

The original Geography dimension included the States, Regions, and Divisions of the United States used by the Census Bureau. For more granularity, we added zip code data provided by the US Postal Service. The originalGeography dimension, therefore, contained the Region, Division, State, City, and Zip Code levels. For the purposes of this article, however, we reduced the dimension so that it shows only StateName, City, and ZipCode.

Scenario Facts

The single measure in our fact table was based on the Cases field and represented the average case volume. The average case volume was derived by rolling up the total number of cases of all varieties of soft drink products delivered to a single retail outlet during a seven-day calendar week. These weekly numbers were averaged for data going back three months. The measure was simply called Cases in the fact table and was renamed Total Cases in the cube.

Once the fact table was loaded we created a view that triples the amount of data stored within it. Our reason for tripling the data was that it was quite simply the only way we could come up with to run the scenarios. In our particular data set, triplication didn’t present a performance problem. In the event that you’re working with a very large set of facts, and you still want to use triplication, I recommend looking for opportunities to roll upthe data to smaller sets prior to triplication.

To understand the purpose of data triplication, begin by considering the fact that dimensions, in common usage, act either as bucketing mechanisms (for slicing) or as filters (for query results or data outcomes). When dimensions are placed on the column or row axes of pivot tables, they act as slicers. When members of a dimension are selected or deselected, they’re acting as filters. Our interest was in letting users select members as a way of selecting a scenario, which, in a sense, filters two out of every three versions of a record, but leaves the third behind and still viewable.

For example, imagine a Colors dimension that includesRed, Orange, and Yellow members. If you use that dimension in the page filter of a pivot table and you select only Red and Orange, you’re effectively removing Yellow from view. But what if instead of removing Yellow, we wanted to change Yellow to Black? Or, more specifically, we wanted to mark the Yellow records as Removed but leave them visible and show their compromised state of Removed. This isn’t typically possible in a cube, which is why we tripled the data as our first step.

To continue with our Colors example and demonstrate the purpose of triplication, begin by creating three records to represent each original record—an actual record (Actual Record), a scenario state record (Scenario Record), and a record to mask the scenario state record until the scenario is run (Scenario Reversing Record), as shown in Table 1.

Roll up the Yellow records by using SQL to group on the Item, Quantity, and Record Count attributes, as Table 2 shows. If you then put a filter onto the data set shown in Table 2 and excluded all Yellow rows, you’d have fact data that would appear not to change, except in an indicator field, as shown in Table 3 as Record Count.

Once an indicator field has been developed that will change according to the user’s selection, that indicator field can be used to drive the calculation of other fields. Note that in the Color dimension example, if the user were to select Black as the compromised Color value, we would have a strange and messy result. This would be equivalent to choosing Big Green Cola as the compromised distributor in our cube. We don’t want to simulate the failure of Big Green Cola; we want to simulate the effect of compromised distributors on Big Green Cola. Later in this article, I’ll describe how we got around this issue in a way we felt wasn’t ideal but certainly sufficient and acceptable to our client.

Rather than tripling the rows in our fact table, we created a view that would use each row in the fact table three times. The view can be created in the SQL Server data engine or it can be implemented as a Named Query in the cube’s Data Source View. We thought it would be best implemented as a SQL Server Analysis Services (SSAS) Named Query; however, we would have lost our comments if we had implemented it there, so in the sample code we implemented the view in the data engine, as shown in Web Listing 1.

Web Listing 1: Code to Create Three Versions of Each Record from a Single Original Record

CREATE VIEW \\[dbo\\].\\[vw_fact_CasesByLocation\\] AS SELECT 'A' as RecordType -- Actual Record ,RetailLocationID ,DistributorID --Actual Distributor ,GeographyID ,RouteID ,Cases ,0 as ScenarioCases FROM dbo.fact_CasesByLocation UNION SELECT 'S' as RecordType /* Scenario Coverage record Big Green Cola Covers the Delivery if a scenario is in effect*/ ,RetailLocationID ,0 as DistributorID /*Distributor 0 indicates that in the current scenario the record is compromised, and in a sense doesn't have a distributor */ ,GeographyID ,RouteID ,0 as Cases ,Cases as ScenarioCases -- under scen conditions all the cases for this record will be scen cases FROM dbo.fact_CasesByLocation UNION SELECT 'R' as RecordType -- Reversing Record. When no scenario is in effect, this cancels out parts of A and parts of S ,RetailLocationID ,DistributorID -- Also the Actual Distributor ,GeographyID ,RouteID ,0 as Cases ,(Cases * -1) as ScenarioCases -- to mask Scen Cases of Scen record when no Scen is in effect FROM dbo.fact_CasesByLocation GO

Understanding the Triplicated Records Using SQL

Before moving on to acube structure, it’s best to first perform some SQL queries on our actual relational data that show the effect that filtering would have on the data because this is what the cube will do to create the scenario. If you query the Cases by Location data and group it by the original records using RetailLocationID as the marker for an individual entry and sum the Cases and Scenario Cases fields, you’ll see that your data will appear with a Scenario Cases total of zero in all the rows, as shown in Web Figure 1. This indicates that none of the entries have an unhealthy distributor.

Next, remove a particular distributor from the original data set (or simulateitsunhealthy state) using a Where clause, as shown in Web Figure 2. The resulting data set will show that some rows now have a value of zero for their number of Cases, and the value formerly in the Cases column now shows up in the Scenario Cases column. In Web Figure 2, the data indicates that RetailLocationsIDs 1 through 5 are all serviced by Distributor 2because the only distributor compromised in the WHERE clause is Distributor 2 and the Cases for locations 1–5 now have Scenario Cases instead of Cases. You can then use the Scenario Cases value as the indicator field. Any row with \\[ScenarioCases > 0\\] indicates a retail location that has been affected by the current scenario selected by the user. This example simulates how the cube will remove distributors yet retain the original records to display information about their new states.

Building the Scenarios Cube

As previously stated, we built the Scenarios cube using dimensions we already had in place for more traditional cubes. This made the development time very short. Our original Distributors dimension was modified only slightly to create what we called the Scenario Healthy Distributors dimension. We created it using a Named Query in the Data Source View, as shown in Figure 2. We used the data from the standard Distributor dimension table and unioned on a row for the non-distributor that would represent any route that Big Green Cola would need to find coverage for during a particular scenario. See Listing 1 for the code we use in the Named Query of our Data Source View, and see Figure 2 for a snapshot of that same code in the context of the Named Query.

Our original plan was to have the user select unhealthy distributors one-by-one from the dimension, but after deciding we would give users access to the cube via an Excel workbook, we realized we wouldn’t be able to use a NOT EQUALS operator because of limitations in Excel 2007. We included a note to the users telling them to select all and then deselect the carriers they want to simulate as “unhealthy.”

For this reason we called the dimension Scenario Healthy Distributors, as all items that were selected would be represented as healthy, and all that were not selected would be represented as unhealthy. We let the users know that this dimension and the wording might initially cause some confusion, but that after a bit of practice we expected users to feel comfortable using the dimension.

In my earlier example using the colors yellow and black, we would have used the Named Query in the Data Source View to rename Black to DO NOT SELECT.We recognized that users might also be confused if we allowed them to use the Healthy Distributors dimension as a slicer instead of as a filter. We again approached our users about this potential problem. They assured us that they were willing to overcome the issues with whatever level of training was required to help them understand the proper use of the dimension. To help reduce confusion, we named the dimension FILTER ONLY - NOT FOR USE AS SLICER, as shown in Web Figure 3. All the other dimensions were built using standard methods.

Adding Measures

We created three measures by pulling in the Cases, Scenario Cases, and RetailLocationID fields from our fact table. Then we renamed these measures to Non-Affected Cases, Scenario Affected Cases, and Retail Location Count, respectively, as shown in Table 4.

Note that when you change the aggregation on the Retail Location Count to Distinct Count, SSAS will warn you that it’s better to “break distinct count measures into separate measure groups.” Indeed, it would be a better-performing design to use the RetailerLocationID from the dim_RetailLocation table. However, for the purposes of this article we will accept this less-than-ideal design for now to focus on the main topic of this article. Also, note that in the more ideal design, the Count function can be used instead of Distinct Count, which will be as accurate, but better performing, than Distinct Count as we have used it.

Designing the Cube’s Calculated Members

After adding the three measures we created several Calculated Members using Listing 2.

We start with a Text field that will clearly mark those retail locations (or routes or zip codes) that have an unhealthy distributor, as shown in Figure 3. The \\[Distributor Health\\] Member will display the word “Unhealthy” in red for any entry in which the original entry and reversing entry have been removed. When those records are filtered out, the Non-Affected Cases value will be zero and the Scenario Affected Cases valuewill be greater than zero. The Big Green Cola Distribution Risk cube actually includes many other calculated members that users can use to perform additional analyses on the effects of the impaired distributor. These other calculated members are either proprietary or are too complex to describe here. The Distributor Health measure is the lynchpin of the design, and once you’ve built an indicator field, such as our Distributor Health field, you’ll find that you’re able to use the value in this member to drive many other members of your own design.

Using the Cube to Run Scenarios

In our sample cube we included only a few records. We provided just enough sample rows to display a variety of routes, locations, and distributors. With this small record set we are able to illustrate how the scenario cube works.

We begin with the cube showing all distributors in a healthy state. Recent averages show that the Bryan, Texas, area stores get about 61 cases a week. Austin, Texas, storesget about 33 cases a week, as shown in Web Figure 4.

If we add the Healthy Distributors dimension to the Filter area of the cube, like Figure 4 shows, and choose the NOT EQUALS operator, we can then select the Sure Distribution Company for an Unhealthy status. In most dimensions, a choice like this would result in the removal of entries—the selected item with a NOT EQUALS operator would be filtered out by the user’s selection. In this cube, the user selection changes the status of those rows, instead of removing them. If we then add more of our calculated members and drill the Geography dimension down to Zip Code, as shown in Figure 5, we can see the effects of the unhealthy distributor on particular zip codes.

The percentage of product and the percentage of delivery locations could be rolled up to higher levels for each successive group of managers, culminating in a single number that would represent to a CEO the effect of a distributor’s failureor success. Conversely, with a few reversing adjustments, a tactical user could drill down to identify exact locations and alternative distributors within the same area, as shown in Figure 6. In the event that the company opted to begin contingency planning, the tactical user could then start contacting distributors based on the scenario they had just run.

For example, imagine the Big Green Cola COO hears that the Sure Distribution Company is expected to announce it has filed for bankruptcy in the next few days. She asks her director of logistics to identify a percentage effect on operations. Her director of logistics can begin at the Region level, giving percentages there and instructing his state or county directors to drill all the way down to the zip code and develop their contingency plans complete with cases affected and alternativee distributors.

Reports that might have taken many days to develop and compile before this cube was developed will now take under an hour. During our Big Green Cola project, we were able to identify a way to give the user an interface to select the distributors they wanted to “impair” and have those selections affect certain items represented in the cube without having the standard, expected effect of removing the items from view. After changing the items, but not removing them from view, we provided a way to easily access and display the impaired status of an item (as changed by the user selection) and to make use of that recognized status change in further calculations. The example used in this article is a simple one, but the implications for letting a user change fact row data on the fly are numerous.

From the Blogs

Duplicate records clutter databases and render the data within them unclear. This kind of problem is very common, and it’s the main reason that deduping software exists. But there’s another benefit to deduplication software: the ability to infer connections between individual records from various data sets....More

Companies looking to grow and extract value from their data are increasingly turning to Chief Data Officers (CDOs) to execute their data strategy. The role is new, and a playbook is necessary to address the many challenges CDOs face....More

After spending 20 years building analytics, BI and database solutions, I've focused on Cloud data solutions over the past 2 years. I've chosen 5 common challenges that I face every day with Cloud migrations and that you'll face in your Cloud BI projects....More