PowerPivot allows us more flexibility than SSAS UDM when it comes to data modelling. We are no longer constrained to dimensions and facts. As Thomas Ivarsson showed us in his PowerPivot Schema Flexibility post, we can avoid certain data manipulations when building a model in PowerPivot. The same holds true for relational database scenarios, as well. We can write SQL queries and analyse data without complying with any particular model. However, we have the concepts of a data mart and a star schema, which are not based on SSAS UDM but rather the other way around. In this post I will compare a normalised and a denormalised model in PowerPivot and reason around the relevance of data modelling and denormalisation with Microsoft in-memory technologies.
I prepared two SQL Server databases: PTest_Normalised and PTest_Denormalised. First I populated the normalised model with random data and then I moved that into the denormalised one. The two data sets are identical with the small exception of the date key, which in the normalised one is datetime, while in the denormalised model I converted it to a meaningful integer.
The specifics of each database are as follows:
PTest_Normalised
PTest_Denormalised
The largest table in the models is the table containing Order Details. It contains exactly 50 million rows. The next largest is Order Header (or Order in the second model), which contains 20 million rows. The rest are significantly smaller with the next largest being the Customer table with 1 million rows.
After setting the databases, I imported them into PowerPivot. Since I have implemented foreign keys and consistent names of my columns, PowerPivot detected and created all relationships for me as they should be (see the diagrams above). Some applicable measurements of the two models in PowerPivot:
PTest_Normalised
5.2 Gb RAM
3.3 Gb File
PTest_Denormalised
5.5 Gb RAM
3.5 Gb File
What this tells us is that there is not much of a difference between the two when stored in memory and on disk. The denormalised model is slightly larger, but the difference is not even close to what we have in the relational database. This is no doubt due to the fact that with columnar compression in memory the dimension key columns in the fact table are so well compressed that the advantages of normalisation are mostly imperceptible. Interestingly, if we want to save disk space and memory, normalising the model will do little to help us. Existing data marts not fitting in memory will not get much better if normalised, and based on this case study I would advise against normalising them as a solution to the problem (while buying more RAM would definitely help).
Once the two models are in PowerPivot we can start the fun and build some pivot tables.
First I added my only measure SalesAmount to the Values area and placed YearName in Slicers Vertical, MonthName and DateName on rows:
When I clicked on a particular year in the slicer the execution times were:
PTest_Normalised
Cold (first time click): 14s
Warm: (subsequent clicks on the same item): 3s
PTest_Denormalised
Cold: 2s
Warm: <1s
There is a large intermediary table (OrderHeader) in the normalised model, which would have influenced these results. To avoid this problem, I tried slicing the Sales Amount measure by Products. I modified the pivot table for my next test by substituting date for product:
The results:
PTest_Normalised
Cold: 4s
Warm: 3s
PTest_Denormalised
Cold: 4s
Warm: 3s
Clearly, avoiding the large intermediary table helped with performance. Here, however we see that if we have small chained relationships performance is not all that bad. Therefore, snowflake schemas seem to be quite alright with PowerPivot. In my observations most of the 3/4s responses went on retrieving the long lists of Subcategories and Product Names. If I eliminate this:
PTest_Normalised
Cold: <1s
Warm: <1s
PTest_Denormalised
Cold: <1s
Warm: <1s
Chains seemingly do not matter much, as long as the tables in them do not contain too many rows. How many is too many – well, it depends on your case, but I would be cautious with more than 1 million rows, unless your hardware allows it. Of course, if you examine my normalised model because a lot of relationships go through the Order Header table, slicing by many of the “dimension” attributes results in very sluggish query responses. Mentioning hardware, all my testing was done on a laptop with 8Gb RAM, i5 460M 2.53Ghz CPU and a 7200 RPM hard disk.
In conclusion, I think that the results from my little test case show that data marts are not irrelevant even when considering the speeds of in-memory technologies. Reading Kimball and understanding the concepts behind data marts will be important as data is data, and an optimal data model does play a significant (if not the most significant) part in a PowerPivot model just like in a relational database. While the memory required for importing the data in PowerPivot is almost the same, performance can be greatly reduced when modelling is not taken seriously. The flexibility which PowerPivot offers us is great, but in some scenarios it could potentially be a problem – especially when the users decide to serve themselves with too much data.