MDX Subselects – Some Insight

Recently I answered a question on StackOverflow, which may be an interesting, common case and understanding it correctly helps understanding subselects better.

Let’s examine the following MDX query:

Here we effective place the All member of the Customer.Education hierarchy on columns, and an ordered set of the countries in the Customer dimension on rows. We have a subselect and a slicer. The slicer contains the Internet Order Count measure (which places is it in context) and the slicer restricts the query to look at customers with Partial High School education only.

When we execute the query we get exactly what we expect.

Now, let’s change this to:

The difference here is the All member in the tuple used as a second argument of the Order function. What we get is a different order (note that Germany and France are on different positions on rows). It seems like the set is ordered by the total amount for all customers without taking their education in consideration – the subselect does not make a difference. However, the measure value is the same as in the first query. So, what causes this?

To answer, we need first to understand what a subselect does. As Mosha posted a while ago, a subselect (which is really the same as a subcube) does implicit exists with the sets or set expressions on each axis and also applies visual totals “even within expressions if there are no coordinate overwrites”.

Instead of explaining the same as what Mosha has already spent the effort to explain, I will advise you to read this post thoroughly. Then, you would understand what the “implicit exists” does. In our case, it is not as important as the visual totals part, so I will concentrate on it.

The reason why the set gets ordered by orders made by customers with partial high school education is the visual totals. It takes place within the Order expression. The visual totals also restrict what we see in the query results, as it applies to the slicer axis measure, as well. However, in this case the catch is in the “if there are no coordinate overwrites” part of the story. The visual totals in Order does not get applied because we explicitly overwrite the Customer.Education hierarchy member within the tuple in the Order function:

Therefore, the set of countries gets ordered by the Internet Order Count for All Customers without taking into account the subselect. However, the measure on the slicer axis still gets the visual total logic applied to it, and this causes the result set to be for those customers who have Partial High School education.
If we re-write the second statement to:

We can see that the NON VISUAL keyword (SSMS can’t recognise this syntax and underlines with a red squiggly line) changes the way the measure is displayed. We see the total amount for the slicer, as well.

Similarly, if we re-write the first query with NON VISUAL we get:

Here both visual totals are excluded and both the Order and the result set are done for customers with any education.

Hopefully this will clarify the logic applied when we use subselects to some extent.

Colour Codes in SSAS

When we apply colour formatting to measures in SSAS we get a “strange” colour code for our expressions from the BIDS colour-picker. For example, if we want to colour a certain measure in Red, we can write:

CREATE MEMBER CURRENTCUBE.[Measures].[Red Measure] AS
This = 1,
FORE_COLOR=255;

If we use the colour picker we can get more complex numbers for more complex colours. If we want to apply the colour with RGB of  (128,128,192), the colour picker generates the code 12615808 for use in the FORE_COLOR function.

Constructing our own color code is not all that hard. In fact, all we have to do is the following:

1. Multiply the Blue decimal code by 65536 (which is 2^16)
2. Multiply the Green decimal code by 256 (yes, 2^8)
3. Add the numbers from step 1 and 2 and the decimal code for Red

In other words, we can use the following formula:

c = r + 256*g + 65536*b

Where c is the colour code we need and r,g,b are the decimal codes (between 0 and 255) for each colour channel.

If we apply this to our (128,128,192) colour, we get:

c = 128 + 256*128 + 65536*192 = 128 + 32768 + 12582912 = 12615808

And this is exactly what we get from the colour picker control in BIDS.

As Chris Webb points out in the comments section below, we can also write FORE_COLOR=RGB(128,128,192) instead of calculating the colour code manually, which is way more practical. 🙂

I wish such little bits and pieces are better documented on MSDN as the current page in regards to the BACK_COLOR and FORE_COLOR properties is fairly basic: http://msdn.microsoft.com/en-us/library/ms145991.aspx.

UPDATE (01/07/2011):

Considering that RGB() is a VBA function, I tested the performance we get from using it. I have a simple measure, which I am formatting with an IIF expression. I tested three scenarios – with no FORE_COLOR formatting, with FORE_COLOR and numeric codes, and FORE_COLOR with RGB function. Each query was run 3 times on warm cache and returned a result set of  500,000+ cells. The results were as follows:

No FORE_COLOR: 2.3 seconds

FORE_COLOR=12615808: 5.3 seconds

FORE_COLOR=RGB(128,128,192): 17.7 seconds

In other words, applying a VBA function for colour formatting severely impacts the performance of our queries. Thus, if you really need to format the colour of your measures, you may be better off doing it the hard way by converting the RGB values to the colour codes the same way the colour-picker control does in BIDS.

UPDATE2 (04/07/2011):

As Valentino Vranken commented below, you can also use hex codes for the colours. However, there is a small twist. The codes need to be in BGR, not in RGB format. So, Red becomes 0000FF. The actual format of the hex expression is FORE_COLOR=”&Hxxxxxx“. For Red we can use FORE_COLOR=”&H0000FF”. There is no performance difference between hex and decimal codes, so it is up to your preference whether you go with one or the other. Converting 0000FF to decimal also shows the same decimal number (255) as performing the math above and if you already know the BGR hex you can simply convert it to a decimal.

Database Compression Effects on SSAS Processing

Just recently I was optimising the processing time for a rather large cube on my laptop. The hardware I used was not tremendously powerful (a mobile i5 CPU, 8Gb RAM and one 7200 RPM hard disk) and having a 500M rows fact table proved to be somewhat of a challenge for it.

The set up was fairly typical – a large fact table, with a few dimensions in a typical star schema. The goal was to get it to process as fast as possible.

After ensuring all best-practice advice from Microsoft (and in particular SQL CAT) was followed, I managed to get the processing down from 7381.075 seconds to 5801.536 s (approximately 21% improvement). However, one thing still did not look quite right – the CPU utilisation during the partition processing phase.

Considering my hardware configuration it was easy to deduce that the HDD would be more of a bottleneck than the CPU. There was only one HDD and everything, including the source database and the SSAS files, was on it. During the partition processing reading and writing is happening simultaneously, which is not good for performance. At the same time in my case the CPU utilisation was on around 75% with 25% to spare, which was not great. Ideally, I would like to see my CPU on 100%, or very close to that to ensure that I am utilising all of its power. However, I had no way to add more disks.

Luckily, in SQL Server 2008 (Enterprise Edition) we have the option to compress database tables, or partitions. We do get a performance hit on the CPU when we (de)compress the data, but it takes less space on the disk and thus requires less IOPS to read it. It sounded like a perfect fit for my problem. After estimating the benefits in terms of size row and page compression gave me, I decided to go with the heavier option since even its marginally better (5%) reduction of size over row-level compression was desirable. After waiting patiently for the compression operation to finish, the overall database size did drop to around 30% as predicted.

I run my SSAS processing script again and it completed in 4805.408 seconds – approximately 17% better than the previous run, which is a very significant improvement. The CPU utilisation went up to around 97%, which also meant that my blunder with the page vs. row compression produced, generally, good results.

One thing we must always consider is whether the bottleneck is in the IO. A few months later I replaced my hard disk with a new SSD (OSZ Vertex 2), which increased the IO performance of my laptop dramatically. When I tested SSAS processing tasks I could almost always see my CPU steady at 100% without any compression. If I were to compress my fact tables I would expect to get worse performance because my CPU would struggle to both decompress and process the data at the same time.

In conclusion, you must know what your bottleneck is before you opt for a solution like compression. It comes at a price, but the price may well be worth paying if your system is not well balanced.

Debugging SSAS Cubes – a Few Tips

There are a few techniques which can contribute to this topic and I will list some tips which seem to elude many people experiencing problems with their implementations.

1. Do not use the Cube Browser and Excel for troubleshooting

Yes, it is convenient to just drag-drop some hierarchies and measures in the Cube Browser, or in Excel, but when you have troubles with your results these tools will obscure the results. This is happening because every client tool has to generate MDX and by doing so they often generalise the approach and add “nice” language constructs like VisualTotals, or always use a subselect clause. There is nothing wrong with checking the end-result of your work with tools showing what your users see, but when your calcs don’t behave the way you expect (especially when filtering things), I would definitely stay away from GUI clients. Instead you can write some MDX in SSMS and pull exactly what you need in a clean and straight-forward way. We can play with swapping WHERE clauses for subselects, add calculations directly in the queries, etc – in brief, we get the full calculation power of MDX in our hands for troubleshooting. Additionally, next time you ask a question someone else you can start with “I have this MDX query…” instead of: “I have this pivot grid in Product X…” – it is much more likely that you will get a meaningful response.

2. Use SQL Server Profiler

If your MDX works perfectly fine against your cube, but when you use Excel 20xx you get funny results, there is no better way to find what the problem is than to run a trace with SQL Server Profiler, capture the MDX the tool has generated and then work with that to debug – again in SSMS. Different versions of Excel generate different MDX and the difference can be very important. When using other tools the differences become even larger. The only way to see what your tools does (other than talk to the people who have built it) is to capture its SSAS queries.

3. Ask for help

Once you have arrived at the conclusion that you need help (e.g. after exhaustively searching for the problem online), you can quickly get an answer at a few forums. My preferred one is MDSN, which has a dedicated SSAS section. Also, you can ask at the www.ssas-info.com forum section, or even at StackOverflow. Either way, you will need to know what the versions of each product you are using are (especially SQL Server version, service packs and updates applied) and ideally – post your MDX (if you have 10 pages of queries it would help to somehow abbreviate the repeating sections instead of copy-pasting the lot). Do not also forget that the people who are reading your questions do not know your dimensional model and what features of SSAS you are using (e.g. LastNonEmpty, unary operators, etc.) – make sure that you describe any specifics – otherwise everyone would assume a typical star with a straight-forward implementation in SSAS. These approaches, together with being courteous and good with spelling and grammar, will help attract fellow developers, who may know a little SSAS quirk, or MDX feature which resolves your issue.

4. Use Adventure Works

Replicating you problems in Adventure Works is another great way to ensure you get an accurate answer. Because typically everyone on the forums has access to Adventure Works, showing some MDX which results in an unexpected for you outcomes is the best way to pinpoint the problem eliminating all question marks which your specific model and data could be introducing. Additionally, you will probably get an answer which includes some sample code and you will be able to run it immediately and see the results.

Tip: Deployment Options in BIDS

The default options in BIDS do not always make perfect sense and one of the annoying defaults for me is the configuration of deployment options for a SSAS project. If we create a new project and then check its properties we see the following under the deployment tab:

By default the Processing Option is Default (surprise!). We also have two other choices:

In most cases we do not want to process the deployed database after a change has been made but by default BIDS will send a process command to SSAS whenever a process is required to bring the objects which are modified up. If we change the Default to Do Not Process we avoid this and the project gets deployed only.

The relevant MSDN page states:

By default, Analysis Services will determine the type of processing required when changes to objects are deployed. This generally results in the fastest deployment time. However, you can also choose to have either full processing or no processing performed with each deployment.”

I beg to disagree – I do not believe that it results in fastest deployment time. In fact it is just between Do Not Process and Full Process when it comes to deployment time only.

The default can be particularly annoying when we really want to only deploy and we right-click on the project and click Deploy while developing but we almost always get a partial process after a structural change. I have seen developers clicking Process instead, waiting for the deployment to finish (no automatic process if we do this) and then quickly pressing <Esc> to cancel the processing afterwards. Not a major breakthrough, but may optimise a bit the way we work, so I thought it may be a good idea to put this little issue in the spotlight for a moment.

SSAS to BISM – Recent Developments

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.

Load Test Your SSAS Instance

If you are wondering how performant your SSAS box will be with multiple users executing various queries in an extremely cool visual way, you now have the option to go to codeplex and download the new AS Performance Workbench by Rob Kerr from BlueGranite. It is awesome! And to prove it is – have a look at the following YouTube demo (6m 24s): http://www.youtube.com/watch?v=Kn4-3JTyDXM.

For any comments, requests and lod cheering you can also visit the project discussion page: http://asperfwb.codeplex.com/discussions.

Enjoy!

MDX Gotchas Presentation Materials

I had the pleasure to present in front of one of the biggest and best SQL Server User Groups in Australia and as Greg Linwood proudly (justified so) said – in the world. As I promised I have uploaded the presentation materials to this blog and you can find links to the slides, MDX scripts I run and some useful links below. As a side announcement, Darren Gosbell is now officially in charge of the Melbourne SQL Server User Group. I am sure that his passion for SQL Server will lead to great results and we will be able to enjoy a strong stream of speakers in the coming years.

To everyone who attended tonight – Thank You for coming and for giving me great reviews!

Useful Resources

The Many-To-Many Revolution by Marco Russo

SQL Server 2008 Performance Guide by SQL CAT and the SSAS Product Team

VBA Functions in SSAS by Irina Gorbach

Performance Improvements in Analysis Services on MSDN

Measure expressions: how performant are they? By Christian Wade

MSDN SQL Server Analysis Services Forum

TechNet Tech Center: Analysis Services – Multidimensional Data

SSAS-Info

Books

Microsoft SQL Server 2008 MDX Step by Step – Beginner/Intermediate

MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and Hyperion Essbase – Beginner/Intermediate

The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset – Beginner/Intermediate

Expert Cube Development with Microsoft SQL Server 2008 Analysis Services – Intermediate/Advanced

Microsoft SQL Server 2008 Analysis Services Unleashed – Expert

Slides and Scripts

[listyofiles folder=”wp-content/MDX Gotchas”]

Please let me know if you need more reference materials or if you think that some that I have missed should go in this post.

Microsoft BI TechCenters: Learning and Resources

Apart from subscribing to the blogs and reading the books in the Sites and Blogs and Books sections of this blog you can also visit the following TechCenters on TechNet and MSDN:

SSAS – Multidimensional Data

http://msdn.microsoft.com/en-us/sqlserver/cc510300.aspx

SSAS – Data Mining

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx

PowerPivot for Excel

http://technet.microsoft.com/en-us/bi/ff604673.aspx

You can find a good collection of whitepapers, articles and links to various resources.

Avoiding NULLs in SSAS Measures

In a recent discussion I made a statement that many data marts allow and contain NULLs in their fact table measure columns. From the poll responses I am getting here, I can see that more than half of everyone voting does have NULLs in their measures. I thought the ratio would be smaller because in many models we can avoid NULLs and it could be a best modelling practise to attempt to do so. There are a few techniques which lead to a reduction of the need for NULL-able measures and possibly even to their complete elimination. While converting NULLs to zeros has a performance benefit because of certain optimisations, it also loses it in many cases because calculations operate over a larger set of values; in addition the performance hit of retrieving and rendering zeros in reports and PivotTables may be unnecessary. Therefore, it is advisable to test the performance and usability benefits of Preserving or converting NULLs to BlankOrZero. Thomas Ivarsson has written an interesting article about the NullProcessing measure property on his blog.

I am offering two approaches, which should be always considered when dealing with measure columns containing NULLs.

Adding Extra Dimensions

Let’s assume that we have a fact table with a structure similar to this:

Date        Product    Actual Amount    Forecast Amount
201101      Bikes      100              NULL
201101      Bikes      NULL             105

By adding a Version dimension we can achieve a structure like this:

Date        Product    Version        Amount
201101      Bikes      Actual         100
201101      Bikes      Forecast       105

This way we eliminate the NULL measure values and we get maintainability and arguably a usability boost as well.

Splitting Fact Tables

Another way to improve the data mart model is to ensure that we have more fact tables containing a smaller number of measures. Let’s illustrate this concept with a simple example:

Date        Product    Amount    Exception Amount
201101      Bikes      100       NULL
201102      Bikes      120       10
201103      Bikes      110       NULL
201104      Bikes      130       NULL

In this case, Exception Amount could be an extra cost which is relevant in rare cases. Thus, it may be better moved to a separate table which contains less data and may lead to some space savings on large data volumes. Additionally, it would also allow for the same analytics in SSAS if we implement it as a separate Measure Group. In example, we could rebuild out model like this:

Date        Product    Amount
201101      Bikes      100
201102      Bikes      120
201103      Bikes      110
201104      Bikes      130

Date        Product    Exception Amount
201102      Bikes      10

Even though we have two tables and one extra row in this case, depending on how rare the Exception Amount is, the savings of an extra column may be worth it.

Sometimes it is not possible, or correct to apply these modelling transformations. In example, we could have a valid design:

Date        Product    Sales Amount    Returns Count
201101      Bikes      100             10
201102      Bikes      120             30
201103      Bikes      110             NULL
201104      Bikes      130             20

Here adding Measure Type dimension would help us with eliminating the NULL, but would also mean that we would have to store two different data types – dollar amounts and counts in the same measure, which we should definitely avoid. Also, since we may have a very few months with NULLs in the Returns Count column and the ratio of non-NULLs to NULLs may be low, we would actually lose the benefits of the second approach.

From Analysis Services point of view, it is better to use NULL, again depending on the non-NULL-to-NULL ratio we have. The more NULLs we have, the better it is to not convert the NULLs to zeros as calculations which utilise NONEMPTY will have to work over a larger set of tuples. Also, here we may notice the difference between using NONEMPTY and EXISTS – the second one will not filter out tuples which have associated rows in the fact table even when they are NULL. It also depends on the usage scenarios – if a user filters out non-empty cells in Excel, the Returns Count for 201103 will appear as 0 if we store zeros (because of the way NON EMPTY, which Excel uses, works), or if we have set the measure to convert NULLs to zeros. In the opposite case (when we have NULLs and the NullProcessing property is set to Preserve) Excel will filter out the empty tuples.

Regardless of the scenario, it is always a good idea to try to remove NULLs by creating a good dimensional model. If we are in a situation where NULLs are better left in the measure column of the relational fact table, we should consider whether we need the default zeros in our measures, or if we would be better off avoiding them altogether.

I would be interested to see if and why you do allow NULLs in your measure groups if the reason is not in this article. Also, it would be interesting to see if there are other good ways to redesign a model in order to avoid storing NULL measure values.