PowerPivot Data Modelling for Performance

PowerPivot allows us more flexibility than SSAS UDM when it comes to data modelling. We are no longer constrained to dimensions and facts. As Thomas Ivarsson showed us in his PowerPivot Schema Flexibility post, we can avoid certain data manipulations when building a model in PowerPivot. The same holds true for relational database scenarios, as well. We can write SQL queries and analyse data without complying with any particular model. However, we have the concepts of a data mart and a star schema, which are not based on SSAS UDM but rather the other way around. In this post I will compare a normalised and a denormalised model in PowerPivot and reason around the relevance of data modelling and denormalisation with Microsoft in-memory technologies.

I prepared two SQL Server databases: PTest_Normalised and PTest_Denormalised. First I populated the normalised model with random data and then I moved that into the denormalised one. The two data sets are identical with the small exception of the date key, which in the normalised one is datetime, while in the denormalised model I converted it to a meaningful integer.

The specifics of each database are as follows:

PTest_Normalised

PTest_Denormalised

The largest table in the models is the table containing Order Details. It contains exactly 50 million rows. The next largest is Order Header (or Order in the second model), which contains 20 million rows. The rest are significantly smaller with the next largest being the Customer table with 1 million rows.

After setting the databases, I imported them into PowerPivot. Since I have implemented foreign keys and consistent names of my columns, PowerPivot detected and created all relationships for me as they should be (see the diagrams above). Some applicable measurements of the two models in PowerPivot:

PTest_Normalised

5.2 Gb RAM
3.3 Gb File

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

What this tells us is that there is not much of a difference between the two when stored in memory and on disk. The denormalised model is slightly larger, but the difference is not even close to what we have in the relational database. This is no doubt due to the fact that with columnar compression in memory the dimension key columns in the fact table are so well compressed that the advantages of normalisation are mostly imperceptible. Interestingly, if we want to save disk space and memory, normalising the model will do little to help us. Existing data marts not fitting in memory will not get much better if normalised, and based on this case study I would advise against normalising them as a solution to the problem (while buying more RAM would definitely help).

Once the two models are in PowerPivot we can start the fun and build some pivot tables.

First I added my only measure SalesAmount to the Values area and placed YearName in Slicers Vertical, MonthName and DateName on rows:

When I clicked on a particular year in the slicer the execution times were:

PTest_Normalised

Cold (first time click): 14s

Warm: (subsequent clicks on the same item): 3s

PTest_Denormalised

Cold: 2s

Warm: <1s

There is a large intermediary table (OrderHeader) in the normalised model, which would have influenced these results. To avoid this problem, I tried slicing the Sales Amount measure by Products. I modified the pivot table for my next test by substituting date for product:

The results:

PTest_Normalised

Cold: 4s

Warm: 3s

PTest_Denormalised

Cold: 4s

Warm: 3s

Clearly, avoiding the large intermediary table helped with performance. Here, however we see that if we have small chained relationships performance is not all that bad. Therefore, snowflake schemas seem to be quite alright with PowerPivot. In my observations most of the 3/4s responses went on retrieving the long lists of Subcategories and Product Names. If I eliminate this:

PTest_Normalised

Cold: <1s

Warm: <1s

PTest_Denormalised

Cold: <1s

Warm: <1s

Chains seemingly do not matter much, as long as the tables in them do not contain too many rows. How many is too many – well, it depends on your case, but I would be cautious with more than 1 million rows, unless your hardware allows it. Of course, if you examine my normalised model because a lot of relationships go through the Order Header table, slicing by many of the “dimension” attributes results in very sluggish query responses. Mentioning hardware, all my testing was done on a laptop with 8Gb RAM, i5 460M 2.53Ghz CPU and a 7200 RPM hard disk.

In conclusion, I think that the results from my little test case show that data marts are not irrelevant even when considering the speeds of in-memory technologies. Reading Kimball and understanding the concepts behind data marts will be important as data is data, and an optimal data model does play a significant (if not the most significant) part in a PowerPivot model just like in a relational database. While the memory required for importing the data in PowerPivot is almost the same, performance can be greatly reduced when modelling is not taken seriously. The flexibility which PowerPivot offers us is great, but in some scenarios it could potentially be a problem – especially when the users decide to serve themselves with too much data.

Advertisements

Microsoft BI TechCenters: Learning and Resources

Apart from subscribing to the blogs and reading the books in the Sites and Blogs and Books sections of this blog you can also visit the following TechCenters on TechNet and MSDN:

SSAS – Multidimensional Data

http://msdn.microsoft.com/en-us/sqlserver/cc510300.aspx

SSAS – Data Mining

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx

PowerPivot for Excel

http://technet.microsoft.com/en-us/bi/ff604673.aspx

You can find a good collection of whitepapers, articles and links to various resources.

Avoiding NULLs in SSAS Measures

In a recent discussion I made a statement that many data marts allow and contain NULLs in their fact table measure columns. From the poll responses I am getting here, I can see that more than half of everyone voting does have NULLs in their measures. I thought the ratio would be smaller because in many models we can avoid NULLs and it could be a best modelling practise to attempt to do so. There are a few techniques which lead to a reduction of the need for NULL-able measures and possibly even to their complete elimination. While converting NULLs to zeros has a performance benefit because of certain optimisations, it also loses it in many cases because calculations operate over a larger set of values; in addition the performance hit of retrieving and rendering zeros in reports and PivotTables may be unnecessary. Therefore, it is advisable to test the performance and usability benefits of Preserving or converting NULLs to BlankOrZero. Thomas Ivarsson has written an interesting article about the NullProcessing measure property on his blog.

I am offering two approaches, which should be always considered when dealing with measure columns containing NULLs.

Adding Extra Dimensions

Let’s assume that we have a fact table with a structure similar to this:

Date        Product    Actual Amount    Forecast Amount
201101      Bikes      100              NULL
201101      Bikes      NULL             105

By adding a Version dimension we can achieve a structure like this:

Date        Product    Version        Amount
201101      Bikes      Actual         100
201101      Bikes      Forecast       105

This way we eliminate the NULL measure values and we get maintainability and arguably a usability boost as well.

Splitting Fact Tables

Another way to improve the data mart model is to ensure that we have more fact tables containing a smaller number of measures. Let’s illustrate this concept with a simple example:

Date        Product    Amount    Exception Amount
201101      Bikes      100       NULL
201102      Bikes      120       10
201103      Bikes      110       NULL
201104      Bikes      130       NULL

In this case, Exception Amount could be an extra cost which is relevant in rare cases. Thus, it may be better moved to a separate table which contains less data and may lead to some space savings on large data volumes. Additionally, it would also allow for the same analytics in SSAS if we implement it as a separate Measure Group. In example, we could rebuild out model like this:

Date        Product    Amount
201101      Bikes      100
201102      Bikes      120
201103      Bikes      110
201104      Bikes      130

Date        Product    Exception Amount
201102      Bikes      10

Even though we have two tables and one extra row in this case, depending on how rare the Exception Amount is, the savings of an extra column may be worth it.

Sometimes it is not possible, or correct to apply these modelling transformations. In example, we could have a valid design:

Date        Product    Sales Amount    Returns Count
201101      Bikes      100             10
201102      Bikes      120             30
201103      Bikes      110             NULL
201104      Bikes      130             20

Here adding Measure Type dimension would help us with eliminating the NULL, but would also mean that we would have to store two different data types – dollar amounts and counts in the same measure, which we should definitely avoid. Also, since we may have a very few months with NULLs in the Returns Count column and the ratio of non-NULLs to NULLs may be low, we would actually lose the benefits of the second approach.

From Analysis Services point of view, it is better to use NULL, again depending on the non-NULL-to-NULL ratio we have. The more NULLs we have, the better it is to not convert the NULLs to zeros as calculations which utilise NONEMPTY will have to work over a larger set of tuples. Also, here we may notice the difference between using NONEMPTY and EXISTS – the second one will not filter out tuples which have associated rows in the fact table even when they are NULL. It also depends on the usage scenarios – if a user filters out non-empty cells in Excel, the Returns Count for 201103 will appear as 0 if we store zeros (because of the way NON EMPTY, which Excel uses, works), or if we have set the measure to convert NULLs to zeros. In the opposite case (when we have NULLs and the NullProcessing property is set to Preserve) Excel will filter out the empty tuples.

Regardless of the scenario, it is always a good idea to try to remove NULLs by creating a good dimensional model. If we are in a situation where NULLs are better left in the measure column of the relational fact table, we should consider whether we need the default zeros in our measures, or if we would be better off avoiding them altogether.

I would be interested to see if and why you do allow NULLs in your measure groups if the reason is not in this article. Also, it would be interesting to see if there are other good ways to redesign a model in order to avoid storing NULL measure values.

New Blog

I just saw that John Simon (my boss at Avanade Melbourne) has started his own blog at:

http://jsimonbi.wordpress.com

The first four posts offer a description of various approaches to hierarchies. I am looking forward to more good content, which based on John’s career and knowledge in BI is not far away.

SSAS SCDs: Showing Active Dimension Members With No Data

Type 2 dimensions play well with SSAS because when we use them SSAS determines which members are relevant to which periods based on the data in the measure group we have. This is always the case – if we slice by a particular date, then all members from a SCD which do not link up to the data in the slice will yield empty values, which we can easily hide in most clients (or by using an MDX function like NON EMPTY). This works well in most cases. However, there is a small problem in a small subset of all usage scenarios – that is when we have no data against a member relevant for the same period. Because it still yields nothing when we go through the data in the cube, it gets hidden, too.

Let’s explore a simple case. If we have a SCD called Employee Type, which has two values: Internal and External for 2010, but three members Internal, Non-Internal and Contractor for 2008 and 2009; and then we have a measure called Employee Count, if we slice by Employee Type we get something like:

Internal          150
External           20
Non-Internal       15
Contractor         10

When we slice by a year we may get something like:

                 2008    2009    2010
Internal           50      50      50
External                           20
Non-Internal               15
Contractor          5       5

Note that if we slice by 2008 and we hide empty cells we would get:

                 2008
Internal           50
Contractor          5

However, what we may want to see on our report may be:

                 2008
Internal           50
Non-Internal
Contractor          5

Un-Hiding empties would actually show:

                 2008
Internal           50
External
Non-Internal
Contractor          5

Well, in SQL we would just use the EffectiveFrom and EffectiveTo dates in the dimension table to determine the correct results. In SSAS we are not that lucky. I would like to show a possible solution.

Firstly, for an SCD we would typically have the dates I just mentioned – EffectiveFrom and EffectiveTo for each row. I would typically exclude them from the dimension in SSAS, but in our specific case we need them. Therefore, we can add them and just hide them instead of excluding them completely. Once we have them set up in this way we could write a bit of MDX, using LinkMember:

WITH
MEMBER [Measures].[ActiveType]
AS
  IIF({{LinkMember([Type].[EffectiveFrom].CurrentMember,
                   [Date].[Date]):
        LinkMember([Type].[EffectiveTo].CurrentMember,
                   [Date].[Date])}*
       [Date].[Year].CurrentMember}.Count = 0,
      NULL,
      1)
SELECT
{
  [Date].[Year].[Year].&[2008]*
  [Measures].[Employee Count]} ON 0,
{
  NONEMPTY([Type].[Type].[Type],
           [Measures].[ActiveType])
} ON 1
FROM [MyCube]

Now we get exactly what we want (Non-Internal shown in 2008 but with no data):

                 2008
Internal           50
Non-Internal
Contractor          5

Please note that for large dimensions this is a very bad approach from performance point of view and should be avoided. Surprisingly enough, users rarely consider performance in their top 10 priorities, while functionality somehow always makes it there, so someone might find this technique useful in extreme user cases.

It would be also interesting if there is another approach to this scenario, which I may be unaware of – I am sure developers have hit (or have been hit) by this problem in the past and there must be other solutions, as well.

Note: If you try to replicate this in Adventure Works as I did, you will find that the keys in the Date dimension(s) and the keys for the Start and End Date in the Type 2 SCDs are not the same. The Date dimension uses integer keys, while the Start and End Date attributes use datetime. Therefore, LinkMember will not be able to match these cross-dimensional attributes and the above approach will not work. As a lesson from this exercise – Effective From and Effective To columns should be of the same data type as your date attribute key in the Date dimension table.

Default Measure in SSAS Cubes

When writing MDX it is always a good idea to know what the context which the query executes in is. If we do not explicitly specify a hierarchy member in a tuple SSAS replaces it with the default one, which way more often than not is the All member for dimensions. Because the All member is the default default member, it is easy to think that the default is always the All. This is untrue – the default can be easily changed through BIDS. A sinister (not really, but the word is cool) consequence of this is very apparent when we consider the Measure dimension.

Measures are in a dimension of sorts. It does not have an All member, so we can say it is non-aggregatable; however it has a default member, which is used whenever we do not explicitly specify a [Measures].[<member>] to be used. The most confusing part is when we omit the Measures member in a function call, and a prime example is the NonEmpty() function call which goes like this:

NonEmpty(<set>,<set>) -> NonEmpty([Customer].[Customer].[Customer], [Date].[Calendar].[CY 2007])

What we get here is a slight problem. NonEmpty still uses a measure – that is the default measure. And if we are not careful, we end up getting incorrect, or even invalid results (e.g. if the default measure is not related to one of the dimensions we may get the set of all members from the first set). However, if we do:

NonEmpty([Customer].[Customer].[Customer], ([Date].[Calendar].[CY 2007], [Measures].[Internet Sales Amount]))

We will get exactly what we want – the function uses the correct, or at least a known measure.

Often we do not know what is the default Measure member. To find it out we can just write this little query:

WITH
MEMBER [Measures].[dMemberName] AS
[Measures].DefaultMember.Member_Name
SELECT
{
[Measures].[dMemberName]
} ON 0
FROM [Adventure Works]

The result is the default Measure member name.