In my last post I examined a normalised vs a denormalised model in PowerPivot. In some cases, though, users will invariably avoid this de/normalisation “stuff” and import a single table in PowerPivot. After all, PowerPivot targets Excel users, and Excel users are used to using large workbooks – in most cases a large extract provided by their friendly DBA or database developers. This is why the scenario where PowerPivot becomes a tool to overcome the 1 million rows limitation in Excel will be quite common. But how does it perform, is it wise to do this and when? I will try to answer some of the questions in this post.
Performance
To test performance I mashed up the data in my PTest environment and put it in a single table. Of course, comparing the space on disk between the normalised, denormalised and the single table approaches there was a massive difference with the single table being by far the largest, followed by the denormalised model and the normalised being the smallest. This is what we would expect and is one of the reasons for normalising at the first place.
In a database the single table would be very inefficient since it will lead to lots of IO in many scenarios. However, when imported in PowerPivot the sizes compare like this (variation from denormalised given in brackets):
PTest_Denormalised
5.5 Gb RAM
3.5 Gb File
PTest_Normalised
5.2 Gb RAM (-0.3Gb)
3.3 Gb File (-0.2Gb)
PTest_SingleTable
4.3 Gb RAM (-1.2Gb)
2.8 Gb File (-0.7Gb)
Obviously the elimination of many distinct keys leads to significant space savings. In fact, the single table approach is by far the most efficient when comparing memory utilisation and disk space (when saving the Excel file).
After I did my standard slicer testing, I got extremely good performance out of my single table. No relationships whatsoever seem to be a fast approach to PowerPivot. So, if there were no considerations we could jump into a conclusion that a single table is the best possible option for PowerPivot. Well, the next few section of this post will show why this is actually not the case.
Usability
Let’s quickly add another hypothetical table to our model. What if we need to add some more data and we decide to ask our friendly DBA to give us another extract with more of the same? Now we have two massive tables and we want to analyse the data from both of them. We hit a serious problem – slicing by a slicer built from one of the tables does not work with the other one. This is an obvious scenario for SSAS developers. We need to have a relationship between the tables we use for slicing. In other words, if we want to slice both tables by Date, we need to have a common table which we base the slicer on. Our model should look like the following diagram:
We do not have this relationship and trying to add one directly between the two big tables fails because we do not have a column with distinct values. This is the most obvious showstopper when considering using a single large table I can see.
Size
When we have one table memory utilisation seems very good. However, when we have more than one of these behemoths in memory, we can reasonably expect that due to duplication of attribute data (e.g. we have to store our Products and Customer Names a number of times in memory), we will have a problem – unnecessary duplication of the same data. This is especially true for high-cardinality attributes – that is where we have many distinct values, like in my sample Order Number attribute (20+ million distinct values). In such cases separating these in their own “dimension table” would save memory and disk space.
DAX
How these models compare when building DAX calculations is a topic on its own and I will soon show some comparisons which should answer two questions – which is the most convenient and intuitive model to work with and which one is the fastest. For now it would suffice to say that always working over a large set of values in different tables could be expected to be the slowest (however, I have not done sufficient testing to confirm this yet).
In conclusion, when doing ad-hoc analytics over some extracts which would definitely not need to be mixed up with others, the single table approach works very well with PowerPivot. It certainly extends the functionality Excel offers natively. However, if we are building extensible models, which are to be shared, enhanced and would form the heart of our Team BI, we should avoid the single table because of the other considerations listed above.
It is a great article! I understand that your discussion is restricted in powerpivot technology. However, it may lead readers to think that one table approach is by far the best to model their data marts. Is it true? If you could compare powerpivot single table approach vs SSAS UDM model (with partitions), that would be great.
LikeLike
Hi D Fang,
It is not true when it comes to usability. For simple analysis it can be arguably a useful approach an in some cases may be preferable over using massive native Excel workbooks. I do not think that SSAS UDM vs Partitioned UDM is a fair comparison anyway because of the functional differences. Maybe in the future we can compare an equivalent of a UDM model in PowerPivot…
LikeLike