Database Compression Effects on SSAS Processing

Just recently I was optimising the processing time for a rather large cube on my laptop. The hardware I used was not tremendously powerful (a mobile i5 CPU, 8Gb RAM and one 7200 RPM hard disk) and having a 500M rows fact table proved to be somewhat of a challenge for it.

The set up was fairly typical – a large fact table, with a few dimensions in a typical star schema. The goal was to get it to process as fast as possible.

After ensuring all best-practice advice from Microsoft (and in particular SQL CAT) was followed, I managed to get the processing down from 7381.075 seconds to 5801.536 s (approximately 21% improvement). However, one thing still did not look quite right – the CPU utilisation during the partition processing phase.

Considering my hardware configuration it was easy to deduce that the HDD would be more of a bottleneck than the CPU. There was only one HDD and everything, including the source database and the SSAS files, was on it. During the partition processing reading and writing is happening simultaneously, which is not good for performance. At the same time in my case the CPU utilisation was on around 75% with 25% to spare, which was not great. Ideally, I would like to see my CPU on 100%, or very close to that to ensure that I am utilising all of its power. However, I had no way to add more disks.

Luckily, in SQL Server 2008 (Enterprise Edition) we have the option to compress database tables, or partitions. We do get a performance hit on the CPU when we (de)compress the data, but it takes less space on the disk and thus requires less IOPS to read it. It sounded like a perfect fit for my problem. After estimating the benefits in terms of size row and page compression gave me, I decided to go with the heavier option since even its marginally better (5%) reduction of size over row-level compression was desirable. After waiting patiently for the compression operation to finish, the overall database size did drop to around 30% as predicted.

I run my SSAS processing script again and it completed in 4805.408 seconds – approximately 17% better than the previous run, which is a very significant improvement. The CPU utilisation went up to around 97%, which also meant that my blunder with the page vs. row compression produced, generally, good results.

One thing we must always consider is whether the bottleneck is in the IO. A few months later I replaced my hard disk with a new SSD (OSZ Vertex 2), which increased the IO performance of my laptop dramatically. When I tested SSAS processing tasks I could almost always see my CPU steady at 100% without any compression. If I were to compress my fact tables I would expect to get worse performance because my CPU would struggle to both decompress and process the data at the same time.

In conclusion, you must know what your bottleneck is before you opt for a solution like compression. It comes at a price, but the price may well be worth paying if your system is not well balanced.

Debugging SSAS Cubes – a Few Tips

There are a few techniques which can contribute to this topic and I will list some tips which seem to elude many people experiencing problems with their implementations.

1. Do not use the Cube Browser and Excel for troubleshooting

Yes, it is convenient to just drag-drop some hierarchies and measures in the Cube Browser, or in Excel, but when you have troubles with your results these tools will obscure the results. This is happening because every client tool has to generate MDX and by doing so they often generalise the approach and add “nice” language constructs like VisualTotals, or always use a subselect clause. There is nothing wrong with checking the end-result of your work with tools showing what your users see, but when your calcs don’t behave the way you expect (especially when filtering things), I would definitely stay away from GUI clients. Instead you can write some MDX in SSMS and pull exactly what you need in a clean and straight-forward way. We can play with swapping WHERE clauses for subselects, add calculations directly in the queries, etc – in brief, we get the full calculation power of MDX in our hands for troubleshooting. Additionally, next time you ask a question someone else you can start with “I have this MDX query…” instead of: “I have this pivot grid in Product X…” – it is much more likely that you will get a meaningful response.

2. Use SQL Server Profiler

If your MDX works perfectly fine against your cube, but when you use Excel 20xx you get funny results, there is no better way to find what the problem is than to run a trace with SQL Server Profiler, capture the MDX the tool has generated and then work with that to debug – again in SSMS. Different versions of Excel generate different MDX and the difference can be very important. When using other tools the differences become even larger. The only way to see what your tools does (other than talk to the people who have built it) is to capture its SSAS queries.

3. Ask for help

Once you have arrived at the conclusion that you need help (e.g. after exhaustively searching for the problem online), you can quickly get an answer at a few forums. My preferred one is MDSN, which has a dedicated SSAS section. Also, you can ask at the www.ssas-info.com forum section, or even at StackOverflow. Either way, you will need to know what the versions of each product you are using are (especially SQL Server version, service packs and updates applied) and ideally – post your MDX (if you have 10 pages of queries it would help to somehow abbreviate the repeating sections instead of copy-pasting the lot). Do not also forget that the people who are reading your questions do not know your dimensional model and what features of SSAS you are using (e.g. LastNonEmpty, unary operators, etc.) – make sure that you describe any specifics – otherwise everyone would assume a typical star with a straight-forward implementation in SSAS. These approaches, together with being courteous and good with spelling and grammar, will help attract fellow developers, who may know a little SSAS quirk, or MDX feature which resolves your issue.

4. Use Adventure Works

Replicating you problems in Adventure Works is another great way to ensure you get an accurate answer. Because typically everyone on the forums has access to Adventure Works, showing some MDX which results in an unexpected for you outcomes is the best way to pinpoint the problem eliminating all question marks which your specific model and data could be introducing. Additionally, you will probably get an answer which includes some sample code and you will be able to run it immediately and see the results.

SSAS to BISM – Recent Developments

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.

Load Test Your SSAS Instance

If you are wondering how performant your SSAS box will be with multiple users executing various queries in an extremely cool visual way, you now have the option to go to codeplex and download the new AS Performance Workbench by Rob Kerr from BlueGranite. It is awesome! And to prove it is – have a look at the following YouTube demo (6m 24s): http://www.youtube.com/watch?v=Kn4-3JTyDXM.

For any comments, requests and lod cheering you can also visit the project discussion page: http://asperfwb.codeplex.com/discussions.

Enjoy!

SSAS SCDs: Showing Active Dimension Members With No Data

Type 2 dimensions play well with SSAS because when we use them SSAS determines which members are relevant to which periods based on the data in the measure group we have. This is always the case – if we slice by a particular date, then all members from a SCD which do not link up to the data in the slice will yield empty values, which we can easily hide in most clients (or by using an MDX function like NON EMPTY). This works well in most cases. However, there is a small problem in a small subset of all usage scenarios – that is when we have no data against a member relevant for the same period. Because it still yields nothing when we go through the data in the cube, it gets hidden, too.

Let’s explore a simple case. If we have a SCD called Employee Type, which has two values: Internal and External for 2010, but three members Internal, Non-Internal and Contractor for 2008 and 2009; and then we have a measure called Employee Count, if we slice by Employee Type we get something like:

Internal          150
External           20
Non-Internal       15
Contractor         10

When we slice by a year we may get something like:

                 2008    2009    2010
Internal           50      50      50
External                           20
Non-Internal               15
Contractor          5       5

Note that if we slice by 2008 and we hide empty cells we would get:

                 2008
Internal           50
Contractor          5

However, what we may want to see on our report may be:

                 2008
Internal           50
Non-Internal
Contractor          5

Un-Hiding empties would actually show:

                 2008
Internal           50
External
Non-Internal
Contractor          5

Well, in SQL we would just use the EffectiveFrom and EffectiveTo dates in the dimension table to determine the correct results. In SSAS we are not that lucky. I would like to show a possible solution.

Firstly, for an SCD we would typically have the dates I just mentioned – EffectiveFrom and EffectiveTo for each row. I would typically exclude them from the dimension in SSAS, but in our specific case we need them. Therefore, we can add them and just hide them instead of excluding them completely. Once we have them set up in this way we could write a bit of MDX, using LinkMember:

WITH
MEMBER [Measures].[ActiveType]
AS
  IIF({{LinkMember([Type].[EffectiveFrom].CurrentMember,
                   [Date].[Date]):
        LinkMember([Type].[EffectiveTo].CurrentMember,
                   [Date].[Date])}*
       [Date].[Year].CurrentMember}.Count = 0,
      NULL,
      1)
SELECT
{
  [Date].[Year].[Year].&[2008]*
  [Measures].[Employee Count]} ON 0,
{
  NONEMPTY([Type].[Type].[Type],
           [Measures].[ActiveType])
} ON 1
FROM [MyCube]

Now we get exactly what we want (Non-Internal shown in 2008 but with no data):

                 2008
Internal           50
Non-Internal
Contractor          5

Please note that for large dimensions this is a very bad approach from performance point of view and should be avoided. Surprisingly enough, users rarely consider performance in their top 10 priorities, while functionality somehow always makes it there, so someone might find this technique useful in extreme user cases.

It would be also interesting if there is another approach to this scenario, which I may be unaware of – I am sure developers have hit (or have been hit) by this problem in the past and there must be other solutions, as well.

Note: If you try to replicate this in Adventure Works as I did, you will find that the keys in the Date dimension(s) and the keys for the Start and End Date in the Type 2 SCDs are not the same. The Date dimension uses integer keys, while the Start and End Date attributes use datetime. Therefore, LinkMember will not be able to match these cross-dimensional attributes and the above approach will not work. As a lesson from this exercise – Effective From and Effective To columns should be of the same data type as your date attribute key in the Date dimension table.

Default Measure in SSAS Cubes

When writing MDX it is always a good idea to know what the context which the query executes in is. If we do not explicitly specify a hierarchy member in a tuple SSAS replaces it with the default one, which way more often than not is the All member for dimensions. Because the All member is the default default member, it is easy to think that the default is always the All. This is untrue – the default can be easily changed through BIDS. A sinister (not really, but the word is cool) consequence of this is very apparent when we consider the Measure dimension.

Measures are in a dimension of sorts. It does not have an All member, so we can say it is non-aggregatable; however it has a default member, which is used whenever we do not explicitly specify a [Measures].[<member>] to be used. The most confusing part is when we omit the Measures member in a function call, and a prime example is the NonEmpty() function call which goes like this:

NonEmpty(<set>,<set>) -> NonEmpty([Customer].[Customer].[Customer], [Date].[Calendar].[CY 2007])

What we get here is a slight problem. NonEmpty still uses a measure – that is the default measure. And if we are not careful, we end up getting incorrect, or even invalid results (e.g. if the default measure is not related to one of the dimensions we may get the set of all members from the first set). However, if we do:

NonEmpty([Customer].[Customer].[Customer], ([Date].[Calendar].[CY 2007], [Measures].[Internet Sales Amount]))

We will get exactly what we want – the function uses the correct, or at least a known measure.

Often we do not know what is the default Measure member. To find it out we can just write this little query:

WITH
MEMBER [Measures].[dMemberName] AS
[Measures].DefaultMember.Member_Name
SELECT
{
[Measures].[dMemberName]
} ON 0
FROM [Adventure Works]

The result is the default Measure member name.

3×3 Worst Development Practices 2010 with SSIS, SSAS and SSRS

There are some practices which are outright bad and some which are sometimes bad. I will try to make a list of top three outright worst practices per the main components of SSIS, SSAS and SSRS which I have annoyed me most throughout 2010 (and previous years). I am sure there are many more, but these are ones which I have witnessed often enough to deserve some attention.

SSIS

Using default names for all components in a control flow and a data flow

I really dislike opening projects and seeing only generic names. I have built some in the past because of insufficient time to document everything, but I hated creating them and later leaving them like that. I have also seen developers with time on their hands carelessly omitting them. The worst practice is to actually rename them to an equally unintelligible, if better, language. Many ETL-out-of-the-box tools actually do that and make subsequent maintenance much harder.

Performing joins with Merge Join

Chaining 5 Merge Joins and placing Sort components between them is something SQL Server does faster and with much less resources. Therefore doing joins in the database is a no-brainer even if your preference lies with a fully SSIS-idised process.

Not using configurations

Lucky, in Denali we are getting a completely revamped project configurations approach. Even if imperfect, Package Configurations are a must for every SSIS project. Migrating projects, changing databases, setting variables are just too good to miss.

SSAS

Using MDX for everything

MDX has an aura of a language which allows you to do everything much faster than SQL. Well, this is not entirely true and in the majority of cases a change in the design will allow avoidance of MDX, which in turn removes a whole layer of complexity allowing tremendous improvements over any programmatical approach.

Importing too much dimensional data

Often developers underestimate the impact of unnecessarily large dimensions with many useless attributes. Here the best approach is to add more as you need them rather than import everything because we can. Cubes are smaller and less complex with less attributes, processing is faster, and the most important thing is that the users actually like having less clutter.

Leaving NULLs in the relational database

SSAS does not like NULLs. I have not seen a case when NULLs are necessary in dimension tables. While I would leave a NULL for a measure when there is no data for it, too many of these can be quite harmful and it may be a case of a bloated measure group, as well.

SSRS

Storing business logic in report queries

While we can, we should never store SQL and MDX logic which can be implemented in stored procedures or calculated measures in the reports. Maintaining SSRS queries is much harder and less transparent. There are performance implications around caching, too.

Merging data sets in SSRS

This is hard to be done, and when it is, it is always bad. Data sets should be merged in the queries – not in the reports.

Building massive reports

In many organizations reports are seen as a transport mechanism to Excel. Users generate large, hundreds of thousands of rows report with a large amount of columns and sometimes even basic calculations. All this can be done simply and easily on the database level without going through SSRS.

Analysing Twitter Trends

Recently, I built a pilot for a demo in Avanade based on Twitter trends entirely using Microsoft SQL Server 2008 R2 and Microsoft Office. Since we are hearing about “Social BI”, and other “Social” aspects of software, this may help introduce an approach to this topic from Microsoft BI point of view. 

The basic idea is: 

  1. Connect to search.twitter.com and get a feed which contains the results from a search
  2. Do #1 every now and then (in my case – every 10 seconds)
  3. Pick the last 15 feed items (tweets), and compare them to the last 15 items to determine any new tweets
  4. Store all new tweets in SQL Server
  5. Build a few dimensions (author, time, etc.)
  6. Build a categorisation dimension. In my case I am using Full-Text search to find certain keywords around my search strings (e.g. Windows – bad/good/excellent/terrible), so I can analyse the feedback the search term is receiving
  7. Build a SSAS cube on top of these dimensions and the feed history
  8. Build an Excel pivot-based report on top of the SSAS cube

Now, for the first 4 points, we can use SSIS. We cannot natively connect to RSS/ATOM feeds in SSIS, but we can easily build a Script Transform as a data source, which connects and pulls the feed items in a data flow. I found a very useful C# reference here: 

http://blogs.msdn.com/b/mattm/archive/2009/02/19/read-an-rss-feed-from-ssis.aspx 

And, I customised it a bit (99% is the same as in the example above), so it works with Twitter and also, so it does not include irrelevant info. Of course, depending on what we need to achieve we can change the code to support all RSS/ATOM features. A sample of my script is below: 

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.ServiceModel.Web;
using System.ServiceModel.Syndication;
using System.Xml;
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private string url = string.Empty;
    private SyndicationFeed feed = null;
    private XmlReader reader = null;
   public override void PreExecute()
    {
     //base.PreExecute();
   //here we are using a SSIS variable for the feed address, which we can populate
   //from a database table, thus providing an easily customisable interface to our
   //package
        reader = XmlReader.Create(Variables.Feed1);
        feed = SyndicationFeed.Load(reader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
        reader.Close();
    }
    public override void CreateNewOutputRows()
    {
        if (feed != null)
        {
            foreach (var item in feed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.id = item.Id;
                Output0Buffer.published = item.PublishDate.DateTime;
                Output0Buffer.title = item.Title.Text;
                Output0Buffer.authorname = item.Authors[0].Name;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

Once we can connect to our source, everything becomes pretty much a run-of-the-mill standard ETL. My package at present supports only 5 feeds and they get pulls simultaneously. This is of course just a demo and is easy to extend as required. A natural way to improve this solution is to implement some sort of logic (i.e. a looping sequence container), so you harvest more feeds. The importantly convenient feature of the feeds is their ID, which is very useful for comparing items. Apart from it, another important property of our tweets is that they may contain one or more keyword, which we are searching for. Therefore, we would like to count them for each feed (sometimes resulting in the same feed item getting counted 3 or more times). A good example of this is if we are searching for Microsoft, Vista and Windows7, and we receive something like: “Microsoft is doing well with Windows7 – it is so much better than Vista” – here we want to know that all three terms got hit and we want to show that there was a Tweet for each (multiplying the total count by 3). 

The following is an example of my SSIS items: 

 Control Flow: 

 

Data Flow:

 

I am doing a little trick here to compare my last to present feed items – tagging them as old in the start of the control flow, and then deleting old ones in the end. This is sort of a snapshot processing logic and works just fine when running only this package many times in a row (which I am doing in SQL Server Agent).The output of this package is one table with all feed items.

We don’t need much more for a simple implementation such as WardyIT’s #sqlpass Trends:

http://www.wardyit.com/sqlpass/

However, once we decide to progress a bit further, we can achieve much more. As I mentioned earlier, we can use something like Free-Text Search to analyse the content of the feeds. Utilising CONTAINS(‘something’ NEAR something), we can find keywords around the search terms. This can be very useful if we want to make a little bit of sense from the raw tweets. I prepared a table, which categorises 20 words in Positive/Negative and one level below (Strongly/Moderately/Slightly Positive or Negative) and I am matching these to the tweets. This allows me to see trends in like/dislike logic for each search term. For ad-hoc analysis and easy consumption in Excel, I also created the following SSAS solution:

 

I am processing my cube every hour, which allows me to see fairly up-to-date results from it. Alternatively, depending on requirements you can easily switch it to ROLAP (considering the very low complexity of the cube and the relatively low amount of fact data) and get real-time analytics.Of course, the result matters the most and this is a little (un)useful dashboard in Excel, which showcases a few of the available analytical options we have from this very simple implementation:

 

Please don’t slam me for the choice of search terms and some visualisation items (e.g. the line chart x-axis) – I know that some are not quite relevant, but it is just a quick demo, right? J

Thoughts on BISM, SSAS and MDX

I missed PASS this year but I am addicted to all blog posts coming from the attendees. I am reading, sending links to colleagues and mostly thinking about the future – both with disappointment (rarely), and excitement (99% of the time). There were two very significant blog posts by Chris Webb and Teo Lachev in the last couple of days and I would like to share my thoughts on the future of SSAS.

Chris and Teo are both moderately unhappy (that’s how I got the mood from what they wrote) with the future of SSAS because of the shift of focus from “traditional” M/H/R OLAP to the in-memory Vertipaq engine, as well as the switch from MDX to DAX. My initial feeling was similar – I like MDX and the way SSAS works. Furthermore, after all, SSAS is the best OLAP tool on the market right now and it got there after a long evolution. As Donald Farmer mentioned on Tech Ed in Australia this year, SSAS is the best-selling OLAP tool for Oracle. It also is the best-selling OLAP tool in general. Leaving a tool like that and moving on to a completely new technology is a stunning move in my opinion.

There are a few things, which I think are important to consider before declaring it as a wrong move. First of all, in today’s affordable 64bit world, there are a few limits for in-memory OLAP. Also, MDX, if powerful, different, interesting and sometime elegant, is HARD. I know (personally) literally 3-4 developers in Australia who know how to write good MDX and grasp the concepts behind it. The vast majority of feedback on MDX is that it is “too hard”. Therefore, the two main selling points of SSAS – ease of building analytics and performance are hard to improve on without radical action. The only way to make Microsoft OLAP easier to use is to move on from MDX and the easiest way to improve performance is to go in-memory. This is why I am very optimistic about the future of SSAS and BISM. If small-size competitors like QlikView can build a good product (technologically), which can in some cases outperform SSAS, I am very confident that the SSAS team with its vast expertise on building the most successful OLAP tool in the world can beat them over a few years. It is better to start earlier rather than later.

On the flipside, the handling of the whole matter is puzzling and disorienting. Chris had a few really good points about the problems that may arise from such a swift transition. How do we sell traditional pre-Vertipaq SSAS when it is getting replaced with a new technology, which will require a couple (at least) releases to beat everyone else and become Enterprise-ready? This is a question I would like to get an answer for, as I am an advocate of Microsoft BI and apart from the minor inconvenience to my base for arguments with advocates of other BI suites/products will no doubt cause a major headache when talking to prospective clients. This is in fact my only concern.

To recap – I am very happy about the ongoing evolution of SSAS and I would be very happy when Vertipaq becomes the SSAS of in-memory BI, when we have an easier language to query it and when the stack gets better integration between its components. However, the transition is too abrupt for my taste. The “security through obscurity” approach to marketing I am observing may not be able to convince enterprise customers with complex needs and strategic plans to adopt a tool in transition. I hope that Microsoft does put more effort in a seamless migration path from SSAS to Vertipaq, so that the #1 spot of Microsoft OLAP technologies gets preserved through this drastic change.

Jumping straight to Amazon now, to buy a book or two on DAX – a great opportunity for me to get a head start in DAX and PowerPivot before we get the technology in SSAS J My previous concerns about the PowerPivot adoption rates because of a few little things is now replaced with excitement over the fact that we will get this new technology on a full-featured client-server architecture (if not fully-featured in Denali, then soon after that).

UPDATE: Please note Amir Netz’s response to Chris’ post in the comments – it explains the intent behind BISM and paints a brighter future for MOLAP and MDX.

UPDATE #2:TK Anand has a blog post in regards to this topic here.

UPDATE #3:Chris Webb’s follow up with a more optimistic tone here.

Avoiding Multiple Role-Playing Date Dimensions

If we investigate the Adventure Works SSAS solution, we’ll notice that whoever built it has chosen to use a role-playing Date dimension for various dates (e.g. Date, Ship Date, Delivery Date, etc.). I am not saying that instead, we should have multiple identical date dimensions – no, this would be a very bad idea. However, we could simplify things a bit by providing one Date dimension and then build a separate “Event” dimension, which represents the various events during the lifetime of an order. Possible dimension members would be Ordered, Shipped, Delivered, etc.

Let’s see how this can benefit us as developers, as well as our end-users. Firstly, often enough we want our cube to be able to answer a question like: “For this date, I want to see how many orders got ordered, shipped and delivered”. If we do use a Role-Playing date dimension like in Adventure Works, these are not very easy to answer. If we assume that our users use an ad-hoc query tool like an Excel pivot table, this sort of functionality is not possible out of the box and they have to separately pick each date dimension and sequentially replace it with the others to get different counts of orders – there is no way to get the data in one pivot table at the same time, as one order will usually get Ordered and then Shipped on different dates. Therefore, if we slice by both dimensions and choose the same date in both, we will get no results. To allow this, we can build a specific measure – in example “Orders Shipped” like this:

CREATE MEMBER CURRENTCUBE.[Measures].[Orders Shipped]
AS
    (LinkMember([Date].[Calendar].CurrentMember, [Ship Date].[Calendar]),
     [Measures].[Internet Order Count]);

This way our users can slice only by Date and the Orders Shipped measure will show the count of internet orders for the selected Date. This is however a bit hard on the users, as they would need to get a similar measure for each role-playing date dimension which they need to query like this, plus they would need to know which instance of it they should use for slicing. Furthermore, LinkMember is slow and better avoided. I am considering this sort of measures a “hack”, which should not be in a final implementation unless absolutely necessary. I have, so far, found that in most cases multiple Date dimensions lead to some sort of a scenario where someone asks for something which leads to LinkMember.

Another point against using multiple Date dimensions is the fact that an average date dimension would have approximately 2000-5000 (I will use 3000 for convenience) members. If we have 3 such dimensions, the complexity of our cube grows and the theoretical cube space expands to 3000^3 = 27,000,000,000, or 27 billion cells. You can see how it would grow exponentially with each incarnation of our Date dimensions. Of course, this is mostly sparse, however it still impacts performance. In general, adding dimensions increases complexity, which reduces performance.

There is another shortcoming of this approach. This is the case when an order is Ordered but not yet Shipped or Delivered. More often than not, developers choose to leave the dimension key in the fact table for the Shipped and Delivered dates as NULL, which in SSAS will get either converted to Unknown, or will be excluded altogether. I hate NULL dimension keys. Therefore, I always prefer to replace the NULL with something meaningful – in most cases with -1, but in the case of a Date dimension this becomes a bit different. The common approaches would be to either replace the date with an outlier value (e.g. 1900010 or 99991231), or with the first/last values in the dimension (e.g. 20000101 or 20200101). Querying this type of a fact table can also cause headaches. Asking “How many orders are Ordered but not yet shipped?” in MDX has to be replaced with asking “How many orders are Ordered but shipped on 99991231?” For tricky developers like us this is ok, but I would hate to explain this to Excel users.

Lastly, I want to discuss the issue of adding an additional “event” for the orders. In example, if we want to now record when an order gets Delivered, we have to add one more dimension to our fact table, change our cube to include that and potentially we would need to change some of our MDX code. This is a major inconvenience which is better avoided.

I mentioned an alternative. This is one Date dimension and an Event dimension (it could be better to call it Status in many cases). The Event dimension can contain only a few members – in our case Ordered, Shipped and Delivered. In the fact table we record each event. Therefore, when an order gets placed, we insert one row with the applicable Date and Event (in my example – Ordered). Then, when an order gets Shipped, we insert another row for the same order but with a different date and with an Event of “Shipped”.

Let’s see if this eliminates the problems above (hint: yes, it does):

Now users can simply drop our count of orders measure in the data area in a pivot table, then filter by a date and place the Event dimension on rows – the pivot table displays the relevant counts for the selected date. Thus, we eliminated the ambiguous bit.

There is also no need to use unusual dimension members for events which have not happened – we simply have no fact rows for them. Therefore, when we want to query the cube for orders which are ordered but not yet shipped, we can just ask for ones which are empty when sliced by the Shipped member of the Event dimension:

CREATE MEMBER CURRENTCUBE.[Measures].[Ordered but not Shipped]
AS
    FILTER([Order].[Order].[Order],
                     ([Event].[Event].[Event].[Ordered], [Measures].[Order Count]) > 0 AND
                     ([Event].[Event].[Event].[Shipped], [Measures].[Order Count]) = 0).COUNT;

We also completely avoid using LinkMember and the performance issue around the cube complexity, since we have only one Date dimension and a few Event dimension members. Instead of exponentially, the potential cube space grows linearly with each Event (instead of 3000^3, we get 3000*3 for three Events).

As for the last issue – when we want to add a new event, we can do this by simply adding one more Event dimension member and start inserting rows in the fact table with its key. As simple as that.

Therefore, yes, we have eliminated the problems I discussed.

Potential drawbacks (there are always some):

  1. The fact table will grow faster as we would be storing multiple rows per order. Well, this should not normally be a huge issue, but in some cases it could be. From a pure SSAS point of view, this is a non-issue but it should be considered in some scenarios. A mitigating factor is the decrease in row size in result of the elimination of the multiple Date keys but this is not likely to offset the extra rows in most cases.

  2. Storing amounts against orders becomes a bit more difficult in this case, because we may have to duplicate the amounts for each event and use something like LastNonEmpty to not sum the amounts over time, as well as make the Event dimension non-aggregatable, so amounts do not get aggregated for two different events. Also, we could possibly move the amounts to their own fact table.

  3. If we query the fact table in T-SQL as well, this becomes a very inconvenient model as simple tasks as finding the number of dates between an order is Ordered and Shipped cannot be done with a simple DATEDIFF() call. This is not really a problem in SSAS and we are talking about SSAS in this post but you should keep it in mind.

Stefan Riedel, a colleague of mine helped me with identifying some problems – thanks to him this post is not as incomplete as it could have been. There could be more, and I would welcome you to comment on this post with possible problems and solutions, which could help other readers implement this sort of a scenario better.