In a recent discussion I made a statement that many data marts allow and contain NULLs in their fact table measure columns. From the poll responses I am getting here, I can see that more than half of everyone voting does have NULLs in their measures. I thought the ratio would be smaller because in many models we can avoid NULLs and it could be a best modelling practise to attempt to do so. There are a few techniques which lead to a reduction of the need for NULL-able measures and possibly even to their complete elimination. While converting NULLs to zeros has a performance benefit because of certain optimisations, it also loses it in many cases because calculations operate over a larger set of values; in addition the performance hit of retrieving and rendering zeros in reports and PivotTables may be unnecessary. Therefore, it is advisable to test the performance and usability benefits of Preserving or converting NULLs to BlankOrZero. Thomas Ivarsson has written an interesting article about the NullProcessing measure property on his blog.
I am offering two approaches, which should be always considered when dealing with measure columns containing NULLs.
Adding Extra Dimensions
Let’s assume that we have a fact table with a structure similar to this:
Date Product Actual Amount Forecast Amount
201101 Bikes 100 NULL
201101 Bikes NULL 105
By adding a Version dimension we can achieve a structure like this:
Date Product Version Amount
201101 Bikes Actual 100
201101 Bikes Forecast 105
This way we eliminate the NULL measure values and we get maintainability and arguably a usability boost as well.
Splitting Fact Tables
Another way to improve the data mart model is to ensure that we have more fact tables containing a smaller number of measures. Let’s illustrate this concept with a simple example:
Date Product Amount Exception Amount
201101 Bikes 100 NULL
201102 Bikes 120 10
201103 Bikes 110 NULL
201104 Bikes 130 NULL
In this case, Exception Amount could be an extra cost which is relevant in rare cases. Thus, it may be better moved to a separate table which contains less data and may lead to some space savings on large data volumes. Additionally, it would also allow for the same analytics in SSAS if we implement it as a separate Measure Group. In example, we could rebuild out model like this:
Date Product Amount
201101 Bikes 100
201102 Bikes 120
201103 Bikes 110
201104 Bikes 130
Date Product Exception Amount
201102 Bikes 10
Even though we have two tables and one extra row in this case, depending on how rare the Exception Amount is, the savings of an extra column may be worth it.
Sometimes it is not possible, or correct to apply these modelling transformations. In example, we could have a valid design:
Date Product Sales Amount Returns Count
201101 Bikes 100 10
201102 Bikes 120 30
201103 Bikes 110 NULL
201104 Bikes 130 20
Here adding Measure Type dimension would help us with eliminating the NULL, but would also mean that we would have to store two different data types – dollar amounts and counts in the same measure, which we should definitely avoid. Also, since we may have a very few months with NULLs in the Returns Count column and the ratio of non-NULLs to NULLs may be low, we would actually lose the benefits of the second approach.
From Analysis Services point of view, it is better to use NULL, again depending on the non-NULL-to-NULL ratio we have. The more NULLs we have, the better it is to not convert the NULLs to zeros as calculations which utilise NONEMPTY will have to work over a larger set of tuples. Also, here we may notice the difference between using NONEMPTY and EXISTS – the second one will not filter out tuples which have associated rows in the fact table even when they are NULL. It also depends on the usage scenarios – if a user filters out non-empty cells in Excel, the Returns Count for 201103 will appear as 0 if we store zeros (because of the way NON EMPTY, which Excel uses, works), or if we have set the measure to convert NULLs to zeros. In the opposite case (when we have NULLs and the NullProcessing property is set to Preserve) Excel will filter out the empty tuples.
Regardless of the scenario, it is always a good idea to try to remove NULLs by creating a good dimensional model. If we are in a situation where NULLs are better left in the measure column of the relational fact table, we should consider whether we need the default zeros in our measures, or if we would be better off avoiding them altogether.
I would be interested to see if and why you do allow NULLs in your measure groups if the reason is not in this article. Also, it would be interesting to see if there are other good ways to redesign a model in order to avoid storing NULL measure values.
Boyan – I certainly take the approach that NULL values, where possible, should not be included in a measure. However a couple of projects I have worked on sees the NULL as a valid response. My dataset is based on Student Results for exams … so for this purpose a student can score a 100% or a 0%, but could also be exempt/deferred from sitting the exam for one reason or another, this introduces the NULL. We need to record who should have sat an exam, and we need to know what their score was, so a fact table with a student_id, date_id, exam_id, grade_id and score(%) is the most logical approach for us.
LikeLike