Pre-Aggregated Data and Changing Dimensions

Normally when working with data we assume that we have it on the lowest possible grain and we are challenged by the need to aggregate it along various hierarchies. By the nature of BI, we get to work with large sets of detailed data collected by an existing system.

But what happens when we are given the aggregations and we need to work with these instead? Recently Nick Barclay and I were involved in such an implementation. Nick designed a Health and Safety Dashboard and after that I built it with Reporting Services, PerformancePoint, Analysis Services and SQL Server 2005 relational databases. We were told in the beginning of the project that the aggregation business rules are so complicated that rebuilding them for the purposes of the dashboard was way out of scope. I had not had experience with pre-aggregated data and I did not foresee a major problem, which became apparent after a few months of development when the Business Unit hierarchy changed significantly.

Aggregation Rules

In contrast with typical business scenarios when an SSAS function like Sum and some custom roll-ups works perfectly well, the Health and Safety data needed to be aggregated in a complex way and the organisation we built the dashboard for had already invested in a system managing these aggregations. In example, a simple rule would be – if we have more than 1 major incidents in a Business Unit, it gets an Amber score for Safety Compliance. If it has more than 4, it becomes Red. In turn, its parent is the same – if the sum of all major incidents for its descendants is greater than 1, it becomes Amber and with more than 4 – Red. There were also quite a few dependencies between various dimensions and reluctantly we agreed to work with the data the way it was. The following diagram shows the way an aggregate for Major Incidents works:

Major Incidents Aggregation 

Japan doing badly on Major Incidents also makes Asia and Global look bad.

The problem

The actual problem comes from the fact that our dimension hierarchy can change and if it does the aggregations do not make sense historically. We could have a change in our hierarchy:

path2709

and we may need to generate the following simple report for before and after the change:

majors_incidents_sample

The historical data suggests that something was wrong with China as the only child of Asia before the change in hierarchy, while in fact, the culprit then (and now) was Japan. As we would not be able to see how the hierarchy looked before, we would not be able to analyse the data we have accordingly. Also, we cannot compare data for Asia after the change to data for Asia from before the change along the new hierarchy.

Possible solutions

In my project, the client actually agreed that it is alright to have this handicap in the system and a possible solution was a complete historical rebuild of the aggregations followed by a complete reload of the system data after every hierarchy change. Lucky for us, the data set was not too big and was not expected to grow too much.

Another solution would be to use a Slowly Changing Dimension, so we can show the hierarchy as it was when the aggregations were made – this would improve the analytical value of the data because business users would be able to see why a Business Unit like Asia was Red in the context of an outdated hierarchy.

The best solution would be to build our own aggregations and work with those, as then we are gaining the ultimate flexibility to represent data in the exact way the business users need it, but unfortunately sometimes it is very hard to convince them that spending twice the funds on the same problem is a good thing.

Advertisements