Avoiding Multiple Role-Playing Date Dimensions

If we investigate the Adventure Works SSAS solution, we’ll notice that whoever built it has chosen to use a role-playing Date dimension for various dates (e.g. Date, Ship Date, Delivery Date, etc.). I am not saying that instead, we should have multiple identical date dimensions – no, this would be a very bad idea. However, we could simplify things a bit by providing one Date dimension and then build a separate “Event” dimension, which represents the various events during the lifetime of an order. Possible dimension members would be Ordered, Shipped, Delivered, etc.

Let’s see how this can benefit us as developers, as well as our end-users. Firstly, often enough we want our cube to be able to answer a question like: “For this date, I want to see how many orders got ordered, shipped and delivered”. If we do use a Role-Playing date dimension like in Adventure Works, these are not very easy to answer. If we assume that our users use an ad-hoc query tool like an Excel pivot table, this sort of functionality is not possible out of the box and they have to separately pick each date dimension and sequentially replace it with the others to get different counts of orders – there is no way to get the data in one pivot table at the same time, as one order will usually get Ordered and then Shipped on different dates. Therefore, if we slice by both dimensions and choose the same date in both, we will get no results. To allow this, we can build a specific measure – in example “Orders Shipped” like this:

CREATE MEMBER CURRENTCUBE.[Measures].[Orders Shipped]
AS
    (LinkMember([Date].[Calendar].CurrentMember, [Ship Date].[Calendar]),
     [Measures].[Internet Order Count]);

This way our users can slice only by Date and the Orders Shipped measure will show the count of internet orders for the selected Date. This is however a bit hard on the users, as they would need to get a similar measure for each role-playing date dimension which they need to query like this, plus they would need to know which instance of it they should use for slicing. Furthermore, LinkMember is slow and better avoided. I am considering this sort of measures a “hack”, which should not be in a final implementation unless absolutely necessary. I have, so far, found that in most cases multiple Date dimensions lead to some sort of a scenario where someone asks for something which leads to LinkMember.

Another point against using multiple Date dimensions is the fact that an average date dimension would have approximately 2000-5000 (I will use 3000 for convenience) members. If we have 3 such dimensions, the complexity of our cube grows and the theoretical cube space expands to 3000^3 = 27,000,000,000, or 27 billion cells. You can see how it would grow exponentially with each incarnation of our Date dimensions. Of course, this is mostly sparse, however it still impacts performance. In general, adding dimensions increases complexity, which reduces performance.

There is another shortcoming of this approach. This is the case when an order is Ordered but not yet Shipped or Delivered. More often than not, developers choose to leave the dimension key in the fact table for the Shipped and Delivered dates as NULL, which in SSAS will get either converted to Unknown, or will be excluded altogether. I hate NULL dimension keys. Therefore, I always prefer to replace the NULL with something meaningful – in most cases with -1, but in the case of a Date dimension this becomes a bit different. The common approaches would be to either replace the date with an outlier value (e.g. 1900010 or 99991231), or with the first/last values in the dimension (e.g. 20000101 or 20200101). Querying this type of a fact table can also cause headaches. Asking “How many orders are Ordered but not yet shipped?” in MDX has to be replaced with asking “How many orders are Ordered but shipped on 99991231?” For tricky developers like us this is ok, but I would hate to explain this to Excel users.

Lastly, I want to discuss the issue of adding an additional “event” for the orders. In example, if we want to now record when an order gets Delivered, we have to add one more dimension to our fact table, change our cube to include that and potentially we would need to change some of our MDX code. This is a major inconvenience which is better avoided.

I mentioned an alternative. This is one Date dimension and an Event dimension (it could be better to call it Status in many cases). The Event dimension can contain only a few members – in our case Ordered, Shipped and Delivered. In the fact table we record each event. Therefore, when an order gets placed, we insert one row with the applicable Date and Event (in my example – Ordered). Then, when an order gets Shipped, we insert another row for the same order but with a different date and with an Event of “Shipped”.

Let’s see if this eliminates the problems above (hint: yes, it does):

Now users can simply drop our count of orders measure in the data area in a pivot table, then filter by a date and place the Event dimension on rows – the pivot table displays the relevant counts for the selected date. Thus, we eliminated the ambiguous bit.

There is also no need to use unusual dimension members for events which have not happened – we simply have no fact rows for them. Therefore, when we want to query the cube for orders which are ordered but not yet shipped, we can just ask for ones which are empty when sliced by the Shipped member of the Event dimension:

CREATE MEMBER CURRENTCUBE.[Measures].[Ordered but not Shipped]
AS
    FILTER([Order].[Order].[Order],
                     ([Event].[Event].[Event].[Ordered], [Measures].[Order Count]) > 0 AND
                     ([Event].[Event].[Event].[Shipped], [Measures].[Order Count]) = 0).COUNT;

We also completely avoid using LinkMember and the performance issue around the cube complexity, since we have only one Date dimension and a few Event dimension members. Instead of exponentially, the potential cube space grows linearly with each Event (instead of 3000^3, we get 3000*3 for three Events).

As for the last issue – when we want to add a new event, we can do this by simply adding one more Event dimension member and start inserting rows in the fact table with its key. As simple as that.

Therefore, yes, we have eliminated the problems I discussed.

Potential drawbacks (there are always some):

  1. The fact table will grow faster as we would be storing multiple rows per order. Well, this should not normally be a huge issue, but in some cases it could be. From a pure SSAS point of view, this is a non-issue but it should be considered in some scenarios. A mitigating factor is the decrease in row size in result of the elimination of the multiple Date keys but this is not likely to offset the extra rows in most cases.

  2. Storing amounts against orders becomes a bit more difficult in this case, because we may have to duplicate the amounts for each event and use something like LastNonEmpty to not sum the amounts over time, as well as make the Event dimension non-aggregatable, so amounts do not get aggregated for two different events. Also, we could possibly move the amounts to their own fact table.

  3. If we query the fact table in T-SQL as well, this becomes a very inconvenient model as simple tasks as finding the number of dates between an order is Ordered and Shipped cannot be done with a simple DATEDIFF() call. This is not really a problem in SSAS and we are talking about SSAS in this post but you should keep it in mind.

Stefan Riedel, a colleague of mine helped me with identifying some problems – thanks to him this post is not as incomplete as it could have been. There could be more, and I would welcome you to comment on this post with possible problems and solutions, which could help other readers implement this sort of a scenario better.

17 thoughts on “Avoiding Multiple Role-Playing Date Dimensions”

  1. Thanks for this post. I’ve been dealing with this issue for a few days on my first cube and I did arrive at the linkmember() solution. While it works I do see the performance suffering slightly. I think I’ll try this new idea today. I can already think of some advantages and how to deal with some possible problems (such as the inventory counts / aggregations and so on).

    What do you think about a fact dimension with 3 dates that currently has 2.5 million rows (1 for each item). Can create a named query to turn it into 3 rows for each item based on each of the 3 dates (may eventually change the ETL to do this at the DW if it works properly). The query i’m thinking of is essentially a union all with some extra fields such as date (only one) and a description of the event (received, processed, completed). To solve the aggregation issue I could use separate fields as a count (one for each date) and make the fields null for the mismatched dates (date received count would be null for the date shipped row for that item).

    Of course I’ll have to deal with a PK (current PK & date stamp & 1, 2, 3, etc for each date row maybe).

    I think this may just work. Thanks for letting me brainstorm in your comments and for your blog. Very helpful.

    Like

  2. Just a quick note to let you know that it worked perfectly. Now I can show received items on the same date dimension with completed items and any other date related item. Certainly helps to clean up many calculations and makes for an easier to navigate cube as well. Thanks for the tip!

    Like

    1. Good news 🙂 Thanks for the feedback. Also, since now you may have quite a few NULLs in your fact table, youmay benefit from changing the way SSAS handles them – which may give you a performance boost (for your calcs) and could also help you in not showing them as 0s. Have a look at this post:

      http://thomasivarssonmalmo.wordpress.com/2008/06/27/null-processing-of-measures-in-ssas2005/

      Nothing has changed in SSAS 2008 and R2, so the post is still relevant.

      Like

  3. Interesting idea. Our cube has used both approaches but not for any particular reason 🙂

    We have several date dimensions (role playing) joined to the fact table for things like date of service, invoice date, invoice payment date, etc.

    Separately we measure and track some event times for a workflow (patient treatment in hospitals) and this is done using your approach mainly because we cannot hardcode the types of events as each customer will slightly vary the types.

    Both work reasonably well although the user’s (and my) frame of mind are different when analysing the wo types of data. I’m going to experiment with this in my huge amount of spare time () and report back – might be some time away though 🙂

    Like

  4. Jason,

    Maybe in some cases it could be a good idea – especially when the same dimension is used more than 2-3 times. However, it might not be a such a great approach when considering unrelated dimensions with the same/similar structure as it could be confusing for end-users. Still, it is a good thing to have in mind when designing a dm.

    Like

  5. This is a SQL query from John Simon (my manager), which can be useful for building a view on top of a table containing multiple date dimensions. The good thing here is that if we use such a view, we get the best of both worlds – we can use SQL to easily query the relational table, and SSAS will happily use the output of the view as described in the post:

    ;WITH cte AS (
    SELECT ProductKey, SalesOrderNumber, OrderDateKey, DueDateKey, ShipDateKey
    FROM AdventureWorksDW2008R2.dbo.FactInternetSales
    WHERE SalesOrderNumber = ‘SO43953’)
    SELECT SalesOrderNumber, ProductKey, DateKey, EventType
    FROM cte
    UNPIVOT(DateKey FOR EventType IN (OrderDateKey, DueDateKey, ShipDateKey)
    ) u

    Like

  6. Great post! In switching to the event based approach I see how you can easily get counts of orders based on the event dimension. Would it also easily allow you to calculate durations/avg durations of the various events?

    I’ve taken the adventureworks approach and have a single fact row to track progress of patient care. Currently I’m calculating the durations in my ETL and storing them on my fact row. Seems to work fine but I do end up with empty durations. You’re approach looks a lot cleaner but I’m trying to get my head around how the durations between the event rows could be calculated (or maybe you would still store the durations with the events?)

    Like

  7. Hi Dan,

    Thanks for your comment!

    You can calculate durations per patient easily with end-start*patient. If you are doing it in the ETL now, you can probably continue doing it in the ETL and store them as 0 on the start event and a progression over the type of events. In example, if a duration of 1 day is encountered, in the etl against Start you’d have 0 and when next one comes in, you can update the start to 1, with 0 for the next one, etc. Then you have an easy way to calculate avg/sums since you can work with the event dimension and the duration measure…

    I guess it could be different in your case, but I think that many scenarios are well covered with this event-based approach.

    Like

  8. Hi Boyan,

    Great post. I’ve been struggling with the same problem, and was thinking around the same idea, but it just got clarified much better with this post. However, I have one more issue which this doesn’t exactly solve (I think). So in this example, when you have a separate event for orders (shipped, delivered, refunded). The way you mentioned allows you to see (based on a date filter) all the different types of events for a given date range. But how do you see related facts, such as, for e.g. for the month of april (so my date range is april 1 – april 30), I have 100 orders that were placed. And I want to see how many of those orders were refunded (regardless of the date that was refunded). So, even if the order was refunded in May, I want to see that order XXX was refunded when I see the purchased and refunded counts of April…. is there a different model for this, or an MDX query would get this data for me?

    Like

  9. Hi Mickey,

    You can achieve what you want by either: A. Adding a IsRefunded (naming is your preference, of course) flag to the Order dimension and updating it in the ETL. This will require some ETL work, but will be fast at query-time; B. Writing MDX to check if there is a Refunded event in your fact table for all dates (something like Exists(Existing ,) may help to illustrate the concept). This will be much slower at query-execution time but won’t require any ETL work.

    Like

  10. Thanks Boyan 🙂 I ended up doing a combination of the two, which is basically making a ‘named calculation’ field to represent the ‘IsRefunded’ field – the expression for this just looks up the fact table based on the order number to see if it was refunded. Do you think this will also be slow – I assume it will definitely be slower than having the physical column – but I’m thinking it may be a good compromise…. I will end up having close to a few million rows in the table…

    Like

  11. Well, the MDX queries should be fast, but you will probably suffer a bit while processing. Still – if it works for you (in terms of processing time) it is a good solution imho.

    Like

  12. Pingback: individual day calculations ,in MDX, with Start and End date in fact table - Programmers Goodies
  13. Working through this for the first time myself; what about using multiple Measure Groups instead? For example, create a new View in your warehouse called “Shipped” and select only those orders that have a Ship Date. Bring this view through to your SSAS dsv and into your cube, linking this “Ship Date” to your already existing Date dimension (which you may give a friendly name of “Reporting Date”). Now when you slice by Reporting Date you can pull the Ordered total from your original measure group, and the Shipped total from the new measure group and it should give the correct figures for each period. I think it is fairly user friendly on the client (Excel) as well. Comments?

    Like

Comments are closed.