PowerPivot Data Modelling for Performance

PowerPivot allows us more flexibility than SSAS UDM when it comes to data modelling. We are no longer constrained to dimensions and facts. As Thomas Ivarsson showed us in his PowerPivot Schema Flexibility post, we can avoid certain data manipulations when building a model in PowerPivot. The same holds true for relational database scenarios, as well. We can write SQL queries and analyse data without complying with any particular model. However, we have the concepts of a data mart and a star schema, which are not based on SSAS UDM but rather the other way around. In this post I will compare a normalised and a denormalised model in PowerPivot and reason around the relevance of data modelling and denormalisation with Microsoft in-memory technologies.

I prepared two SQL Server databases: PTest_Normalised and PTest_Denormalised. First I populated the normalised model with random data and then I moved that into the denormalised one. The two data sets are identical with the small exception of the date key, which in the normalised one is datetime, while in the denormalised model I converted it to a meaningful integer.

The specifics of each database are as follows:

PTest_Normalised

PTest_Denormalised

The largest table in the models is the table containing Order Details. It contains exactly 50 million rows. The next largest is Order Header (or Order in the second model), which contains 20 million rows. The rest are significantly smaller with the next largest being the Customer table with 1 million rows.

After setting the databases, I imported them into PowerPivot. Since I have implemented foreign keys and consistent names of my columns, PowerPivot detected and created all relationships for me as they should be (see the diagrams above). Some applicable measurements of the two models in PowerPivot:

PTest_Normalised

5.2 Gb RAM
3.3 Gb File

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

What this tells us is that there is not much of a difference between the two when stored in memory and on disk. The denormalised model is slightly larger, but the difference is not even close to what we have in the relational database. This is no doubt due to the fact that with columnar compression in memory the dimension key columns in the fact table are so well compressed that the advantages of normalisation are mostly imperceptible. Interestingly, if we want to save disk space and memory, normalising the model will do little to help us. Existing data marts not fitting in memory will not get much better if normalised, and based on this case study I would advise against normalising them as a solution to the problem (while buying more RAM would definitely help).

Once the two models are in PowerPivot we can start the fun and build some pivot tables.

First I added my only measure SalesAmount to the Values area and placed YearName in Slicers Vertical, MonthName and DateName on rows:

When I clicked on a particular year in the slicer the execution times were:

PTest_Normalised

Cold (first time click): 14s

Warm: (subsequent clicks on the same item): 3s

PTest_Denormalised

Cold: 2s

Warm: <1s

There is a large intermediary table (OrderHeader) in the normalised model, which would have influenced these results. To avoid this problem, I tried slicing the Sales Amount measure by Products. I modified the pivot table for my next test by substituting date for product:

The results:

PTest_Normalised

Cold: 4s

Warm: 3s

PTest_Denormalised

Cold: 4s

Warm: 3s

Clearly, avoiding the large intermediary table helped with performance. Here, however we see that if we have small chained relationships performance is not all that bad. Therefore, snowflake schemas seem to be quite alright with PowerPivot. In my observations most of the 3/4s responses went on retrieving the long lists of Subcategories and Product Names. If I eliminate this:

PTest_Normalised

Cold: <1s

Warm: <1s

PTest_Denormalised

Cold: <1s

Warm: <1s

Chains seemingly do not matter much, as long as the tables in them do not contain too many rows. How many is too many – well, it depends on your case, but I would be cautious with more than 1 million rows, unless your hardware allows it. Of course, if you examine my normalised model because a lot of relationships go through the Order Header table, slicing by many of the “dimension” attributes results in very sluggish query responses. Mentioning hardware, all my testing was done on a laptop with 8Gb RAM, i5 460M 2.53Ghz CPU and a 7200 RPM hard disk.

In conclusion, I think that the results from my little test case show that data marts are not irrelevant even when considering the speeds of in-memory technologies. Reading Kimball and understanding the concepts behind data marts will be important as data is data, and an optimal data model does play a significant (if not the most significant) part in a PowerPivot model just like in a relational database. While the memory required for importing the data in PowerPivot is almost the same, performance can be greatly reduced when modelling is not taken seriously. The flexibility which PowerPivot offers us is great, but in some scenarios it could potentially be a problem – especially when the users decide to serve themselves with too much data.

Advertisements

12 thoughts on “PowerPivot Data Modelling for Performance”

  1. Very nice benchmark. I did something similar one year ago with a beta version but I used a smaller set of data.
    It is interesting that denormalization is still ìmportant, but star schema is no longer the only reasonable data model behind a multidimensional analysis. To complete this analysis, we should test the case where OrderHeader contains important information that cannot be denormalized at OrdersDetail level. Is it better to denormalize OrderHeader dimensions down to OrderDetail fact table, or is it affordable to keep a link between two OrdersHeader and OrdersDetail fact tables?

    Marco

    Like

  2. Hi Claire,

    PowerPivot will handle much more than 1 million rows in a table. In my case I have 50 million in one table and 20 million in another with total row count of more than 71 million rows. PowerPivot is doing just fine – I get less than 3-4 seconds responses when browsing the data.

    I am describing a specific case in PowerPivot when altering the model yields better results. It is, however, quite specific.

    Like

  3. An outrageously geeky and interesting post Boyan 🙂

    But interesting to note the decreasing relevance of good database design in the traditional star schema sense.

    James

    Like

  4. Thanks James 🙂

    Yes, with PowerPivot we can analyse data without worrying too much about models. In fact, I am testing this same model in one table only and I will post about how it goes soon. PowerPivot is very tolerant with modelling, which hopefully will increase its adoption.

    Like

  5. Nice post! If you do a follow-up, it would be interesting to see the same sorts of results against some basic & complex calculated measures. I wonder how much things will slow down in a 3NF schema if you use functions like RELATED, RElATEDTABLE, FILTER, etc.?

    Like

  6. Hi Boyan, nice blog.

    I hate to be pedantic but as a data modelling geek I had to comment on a couple of points on your analysis:

    1. The difference in the models is not really between a normalised model and a denormalised model but more an business process based model and a business object based model. In the second model, only 3 of the tables are denormalised and these are only small tables so there is little wonder why there is not much difference in size between these particular models. For the record, a star schema is only a partially denormalised model as the fact table is generally normalised (although gender in your example is an exception since it has a relationship to another foreign key) and it is just the dimension tables that aren’t. When you start looking at large Type 2 dimension with many attributes you will see massive differences in sizing over using a full 3NF model.

    2. The main purpose of normalisation is not to reduce storage, but for data manageability – 1 piece of information stored once. Star schemas are great for performance but terrible for managing data. E.g. ever tried to add a new attribute to a large Type 2 dimension with back dated history – a nightmare because you have to process all the records to merge in the new information. With full 3NF this is easy because you just add a new table for the attribute so it is independent from the other history. This is why many people now are promoting architectures with both 3NF for data integration, management etc purposes, and star schemas (highly focused, type 1 dimensions) for performance and useability.

    Like

  7. Hi Ian,

    Thanks for your feedback. On your points:

    1. Absolutely! It is not a completely denormalised model. But starts are inherently less normalised than a 3NF model and that is the point of the comparison. Effectively, as you may have noticed I am comparing a normalised schema vs a less-normalised schema (which I am calling denormalised) because when using PPVT we will often need to decide whether we should go for the star at all or not…no argument here about what is normalised and denormalised…

    2. The use of 3NF + start is nothing new actually. You both reduce storage and enhance manageability with normalisation. The comparison in terms of data size is here because it is very important when it comes to in-memory BI. In fact there is a large difference in the size on disk but not so much in RAM. This is not because denormalised database schemas are the same as normalised ones in terms of data size (otherwise we would see no difference in database size as well) but because PowerPivot’s columnar compression kicks in for the fact table.

    Like

Comments are closed.