February 2009 – Boyan Penev on Microsoft BI

When we need to see historical changes of a dimension in our OLAP cube the common practice is to implement it as a SCD – or a Slowly Changing Dimension. There are a few ways to do this and a really good definition of the different types of SCDs can be found in Wikipedia: Slowly Changing Dimension. Also, there are quite a few articles on Implementing SCD ETLs in SSIS, two of which are:

SCD Wizard Demo – SSIS Junkie blog example of a package using the Slowly Changing Dimension transformation in SSIS
MSDN Article on the Slowly Changing Dimension transformation in SSIS

Since SQL Server Integration Services 2005 and 2008 include a SCD transformation it is not too hard to implement such dimensions.

Here I am discussing a typical requirement – to be able to have a SCD and a Current version of the dimension.

First, it is important to notice that a SCD should have two dimension keys: a unique surrogate key identifying every version of the dimension members and a non-unique code, which is common for all versions for a dimension member. This is also very important if we want to be able to determine the current version of a dimension member. An example of a very simple dimension table utilising this design is:

Here we have two distinct dimension members with Code of 1 and 2. Member1 has two versions and Member2 has three. The SKeys (surrogate keys) for these versions are unique but the codes stay the same for each member. Also, notice the From and To dates which allow us to distinguish the periods for the member versions. We can have an IsActive or IsCurrent bit column, which shows us the latest version of a node, but we can also just filter on dates which are 9999-12-31, which will give us the same result.

Assuming the described design I will move on to discuss the ways to build a dimension in SSAS.

First, the standard way to link the dimension table to our fact table is through the surrogate key. We can have a regular relationship between the two tables. As the fact data is usually also linked to a Time dimension, fact records linked against the periods between the From and To dates of our SCD will be linked to that versions SKey. An example of a fact table with a few rows, which can be linked to the dimension table above is:

The row with a FactKey of 1 will be linked against Member1Ver1, while FactKey 2 will go against Member1Ver2. Therefore, when we slice our cube by Time and our dimension we will see:

This is the standard way to implement our SCD and these are the results we would expect. Now, we get a new requirement. We want to be able to see both this and an aggregation against the current version of our dimension. We have a few ways to implement it. One obvious way is to create another dimension containing only the current dimension members. This can be easily achieved if we add a Named Query in our DSV, which shows only the current dimension members:

SELECT SKey
, Code
, Description
FROM DimTable
WHERE ToDate = ‘9999-12-31’

The result will be:

Then we need to replace our fact table with a Named Query, which shows the DimSKeys for current version dimension members:

SELECT ft.FactSkey
, dt_current.DimSKey
, ft.TimeKey
, ft.Amount
FROM FactTable ft
INNER JOIN DimTable dt
ON ft.DimSKey = dt.SKey
INNER JOIN DimTable dt_current
ON dt.Code = dt_current.Code
WHERE dt_current.ToDate = ‘9999-12-31’

This will give us the following result:

When we slice our cube, all records for Member1 will be against the latest version:

Implementing this, we can have two dimensions in our cube, so our users can use the one that makes more sense for their needs:

Dimension and
Dimension (Historical), and the Historical designation stands for, in technical terms, a SCD

However, we can also implement this in a different way, which allows us to avoid building such logic in a view or our DSV. The trade-off is some space on our disks and one more column in our fact table. Instead of adding a new column through writing SQL, we can simply add the dimension Code in the fact table. Then, we can build our dimension again by getting the latest versions, but instead of having the SKey as a dimension key, we can use the Code. It is of course unique across all dimension members, as long as we filter our the non-current versions. The query for doing this is exactly the same as the one we used before. However, we need to change our fact table design and add a DimCode column:

Then, we create two dimensions again, but we link the Historical dimension with the DimSKey column and the Current one with the DimCode column. The result of slicing the cube by the current version is exactly the same as before. The trade-off is space vs. processing time and CPU usage. It is up to the developer to choose the more appropriate way to build the solution.

So far I discussed two ways of having our SCD and Current Version dimension in different dimensions in our cubes. There is, however a way to combine both in the same dimension. To do this, we need to have two levels in the dimension: a parent level, which contains the current version of the dimension members, and a child level, which contains the historical versions. In example:

Member1Ver2
Member1Ver1
Member1Ver2
Member2Ver3
Member2Ver1
Member2Ver2
Member2Ver3

This way the historical versions aggregate up to the current version and we can use either level, depending on what we want to achieve. To build this, we can use our current dimension table and add a parent level through SQL. This way, we do not need to update all records when a new version comes:

SELECT dt.SKey
, dt.Code
, dt.Description
, dt_p.SKey AS ParentSKey
FROM DimTable dt
INNER JOIN DimTable dt_p
ON dt.Code = dt_p.Code
WHERE dt_p.ToDate = ‘9999-12-31’

The result is:

Then, we can build our Parent-Child dimension and we can use the Parent level is we want to have current versions and the Child level for the historical ones.

This approach allows us to combine the two dimensions into one. It is also possible to implement it in a non-parent child fashion because the hierarchy is not ragged.

It is always advisable to make sure we actually need a SCD and avoid it whenever possible because it is not always intuitive for users to use one. Splitting our fact data on multiple rows can be surprising for users and understanding how the historical dimension works and the multiple nodes it consists of can be a problem. However, it lets us satisfy a common requirement and therefore it is quite important to know how to build.

As Reporting Services allows us to implement some quite complex logic, report authors are often tempted to build a lot of it in their reports. Having some simple calculations there does not generally cause harm, as simple + and – operations specific to a report should probably be built in it. However, I have seen aggregations, complex queries and code hacks, which would normally be performed fairly easily in Analysis Services or SQL Server built in Reporting Services.

There are a few distinct benefits we can get from keeping logic out of our reports.

Firstly, by having our MDX scripts and SQL Queries (stored procedures) in the back-end allows us to use them in different reports with no need to maintain them by painfully synchronizing them with every change. Also, this allows business MDX-ignorant users to use complex measures, increasing the value of the BI solution for them.
Then, normally, our database server is usually maintained by people who have skills in SQL or MDX, and can analyse and optimise the scripts there instead of relying on report authors to do so. Even if we have BI professionals maintaining the whole solutions, it is far more convenient to have most of the logic stored on the same server. Whether it is a performance bottleneck, or a bug, it is far easier to fix it by tracing either cube calculations or stored procedures than logic stored in a report.
Sending a large amount of data between the back-end server and SSRS is increasing the stress on the network, which can be critical when multiple requests with millions of rows in results are passed to it.
Caching is important when different reports use the same stored procedures or MDX calculations.
And a last point: a database/OLAP server is just faster in calculations that SSRS.

In general, using stored procedures as SQL Server data sources instead of building queries in SSRS; or alternatively, keeping MDX in calculations in the OLAP cubes is the first step towards avoiding excessive code accumulation in SSRS. Then, avoiding anything more complex than simple sums and groupings in the reports and moving them to the stored procedures or MDX calculations is another desirable way to structure our solutions. Finally, even simple things like Actual – Budget variance measure should be built in MDX or SQL rather than in the reports. Following these simple rules and keeping ourselves disciplined when building reporting solutions improves their performance, maintainability and even usability.

Month: February 2009

Combining Slowly Changing Dimensions and Current Dimension Versions

Improving performance and maintainability by keeping logic out of our reports