Database Compression Effects on SSAS Processing

Just recently I was optimising the processing time for a rather large cube on my laptop. The hardware I used was not tremendously powerful (a mobile i5 CPU, 8Gb RAM and one 7200 RPM hard disk) and having a 500M rows fact table proved to be somewhat of a challenge for it.

The set up was fairly typical – a large fact table, with a few dimensions in a typical star schema. The goal was to get it to process as fast as possible.

After ensuring all best-practice advice from Microsoft (and in particular SQL CAT) was followed, I managed to get the processing down from 7381.075 seconds to 5801.536 s (approximately 21% improvement). However, one thing still did not look quite right – the CPU utilisation during the partition processing phase.

Considering my hardware configuration it was easy to deduce that the HDD would be more of a bottleneck than the CPU. There was only one HDD and everything, including the source database and the SSAS files, was on it. During the partition processing reading and writing is happening simultaneously, which is not good for performance. At the same time in my case the CPU utilisation was on around 75% with 25% to spare, which was not great. Ideally, I would like to see my CPU on 100%, or very close to that to ensure that I am utilising all of its power. However, I had no way to add more disks.

Luckily, in SQL Server 2008 (Enterprise Edition) we have the option to compress database tables, or partitions. We do get a performance hit on the CPU when we (de)compress the data, but it takes less space on the disk and thus requires less IOPS to read it. It sounded like a perfect fit for my problem. After estimating the benefits in terms of size row and page compression gave me, I decided to go with the heavier option since even its marginally better (5%) reduction of size over row-level compression was desirable. After waiting patiently for the compression operation to finish, the overall database size did drop to around 30% as predicted.

I run my SSAS processing script again and it completed in 4805.408 seconds – approximately 17% better than the previous run, which is a very significant improvement. The CPU utilisation went up to around 97%, which also meant that my blunder with the page vs. row compression produced, generally, good results.

One thing we must always consider is whether the bottleneck is in the IO. A few months later I replaced my hard disk with a new SSD (OSZ Vertex 2), which increased the IO performance of my laptop dramatically. When I tested SSAS processing tasks I could almost always see my CPU steady at 100% without any compression. If I were to compress my fact tables I would expect to get worse performance because my CPU would struggle to both decompress and process the data at the same time.

In conclusion, you must know what your bottleneck is before you opt for a solution like compression. It comes at a price, but the price may well be worth paying if your system is not well balanced.

Debugging SSAS Cubes – a Few Tips

There are a few techniques which can contribute to this topic and I will list some tips which seem to elude many people experiencing problems with their implementations.

1. Do not use the Cube Browser and Excel for troubleshooting

Yes, it is convenient to just drag-drop some hierarchies and measures in the Cube Browser, or in Excel, but when you have troubles with your results these tools will obscure the results. This is happening because every client tool has to generate MDX and by doing so they often generalise the approach and add “nice” language constructs like VisualTotals, or always use a subselect clause. There is nothing wrong with checking the end-result of your work with tools showing what your users see, but when your calcs don’t behave the way you expect (especially when filtering things), I would definitely stay away from GUI clients. Instead you can write some MDX in SSMS and pull exactly what you need in a clean and straight-forward way. We can play with swapping WHERE clauses for subselects, add calculations directly in the queries, etc – in brief, we get the full calculation power of MDX in our hands for troubleshooting. Additionally, next time you ask a question someone else you can start with “I have this MDX query…” instead of: “I have this pivot grid in Product X…” – it is much more likely that you will get a meaningful response.

2. Use SQL Server Profiler

If your MDX works perfectly fine against your cube, but when you use Excel 20xx you get funny results, there is no better way to find what the problem is than to run a trace with SQL Server Profiler, capture the MDX the tool has generated and then work with that to debug – again in SSMS. Different versions of Excel generate different MDX and the difference can be very important. When using other tools the differences become even larger. The only way to see what your tools does (other than talk to the people who have built it) is to capture its SSAS queries.

3. Ask for help

Once you have arrived at the conclusion that you need help (e.g. after exhaustively searching for the problem online), you can quickly get an answer at a few forums. My preferred one is MDSN, which has a dedicated SSAS section. Also, you can ask at the www.ssas-info.com forum section, or even at StackOverflow. Either way, you will need to know what the versions of each product you are using are (especially SQL Server version, service packs and updates applied) and ideally – post your MDX (if you have 10 pages of queries it would help to somehow abbreviate the repeating sections instead of copy-pasting the lot). Do not also forget that the people who are reading your questions do not know your dimensional model and what features of SSAS you are using (e.g. LastNonEmpty, unary operators, etc.) – make sure that you describe any specifics – otherwise everyone would assume a typical star with a straight-forward implementation in SSAS. These approaches, together with being courteous and good with spelling and grammar, will help attract fellow developers, who may know a little SSAS quirk, or MDX feature which resolves your issue.

4. Use Adventure Works

Replicating you problems in Adventure Works is another great way to ensure you get an accurate answer. Because typically everyone on the forums has access to Adventure Works, showing some MDX which results in an unexpected for you outcomes is the best way to pinpoint the problem eliminating all question marks which your specific model and data could be introducing. Additionally, you will probably get an answer which includes some sample code and you will be able to run it immediately and see the results.

The Case for an Azure DataMarket Date Table

Since the release of PowerPivot Excel pros and power users have been encouraged to learn and “play” with the add-in. There is one little thing from the world of BI which we (as long-standing BI professionals) are used to but apparently troubles our new friends from the Excel world – the Date table. In the SSAS Multidimensional world we have BIDS which can generate a date table in a variety of formats. Still, much more common is the custom Date table, which we build through a SQL script as it enables us to dynamically generate it for a range of dates. I have also used Excel in the past for quick and dirty solutions. Although all these scenarios are very “workable” for database professionals, when it comes to Excel power-users fiddling with databases is far from ideal. Luckily, there is a better way.

Azure DataMarket

The answer I am proposing is the new DataMarket, which as an added bonus (in cases when we cannot use third-party products) is Microsoft-owned. It is integrated very well within PowerPivot since the last update and allows selecting subsets of the data (e.g. we do not need to import everything available in the data set). Because the data is in a feed format we can connect and pull data we want anytime.

The DataMarket is the vehicle but it needs a good data set to transport. With the Microsoft SSAS team being busy with new releases and unable to chase this up with the Azure DataMarket team, I tried contacting the latter directly to no avail. It would be very simple to create a sample, test and if all goes well – we could easily expand the feed to include lots of necessary columns which could simplify any PowerPivot implementation. As a start, a minimum of a calendar hierarchy with 4-5 levels should suffice, but the possibilities are very exciting. We could have multiple financial/fiscal calendars, public holidays, weekends, leap years taken care of and many other Date properties built right into the feed. If customisation is required (as it probably will be in many cases), PowerPivot developers have the Excel and DAX to play with the data through formulas and change various properties like member names and formats.

In my opinion with a miniscule development effort Microsoft can win on both new fronts – PowerPivot and the Azure DataMarket. After all, both products need more exposure and a popular Date feed will definitely help in this direction (not to mention how much easier it would be for developers to “get into” PowerPivot-based BI implementations).

PS: I have emailed the DataMarket team a few months ago and I received no reply. This remains so even after Julie Strauss from the SSAS team followed up on this – it seems like someone is sleeping on the job…

PPS: I just created a Microsoft Connect suggestion – please vote if you feel like this is a good idea. Also, if you feel even more inclined to act you can email the DataMarket guys directly asking them to pay attention through the links provided on their Contact Us page.

7 Tools You Want on Your BI Dev Workstation

Without promising gimmicks like 80% reduction in development effort and time to deliver (as if…), the following tools are either free or relatively cheap and offer a productivity boost while allowing (and actually in some cases promoting) good development practises. Of course there are many others available but the following are my personal preferences, which I have found to be particularly valuable while developing (in no particular order).


BIDS Helper

When it comes to free tools for Microsoft BI development BIDS Helper usually tops the list. It is a very powerful tool making many advanced development tasks easy.

Who makes it: Darren Gosbell, Greg Galloway, John Welch, Darren Green, Scott Currie
License: Free
Link:
http://bidshelper.codeplex.com/


SQL Compare and SQL Data Compare

Red Gate is renowned for their great products and the top two in my opinion are SQL Compare and SQL Data Compare. These two allow robust and simplified migration between environments and help with creating transactional scripts for deploying to servers guarded by zealous DBAs.

Who makes it: Red Gate
License (1 Developer): SQL Compare (Standard) – $395; SQL Data Compare (Standard) – $395
Trial: Yes
Links:
http://www.red-gate.com/products/sql-development/sql-compare/
http://www.red-gate.com/products/sql-development/sql-data-compare/


BI Documenter

For those who do not like writing lengthy documents describing their work Pragmatic Works offers a great tool which extracts metadata from the various BI components in a solution and organises it very neatly in a cross-referenced (think HTML links between pages) fashion. It does it so well that I have had clients completely satisfied in terms of documentation by its output.

Who makes it: Pragmatic Works
License (1 Developer): $395
Trial: Yes
Link:
http://pragmaticworks.com/Products/Business-Intelligence/BIDocumenter/Default.aspx


AS Performance Workbench

For load testing of SSAS instances and cubes the free AS Performance Workbench provides a simple, easy and visual interface and eliminates the need to create a complex testing framework.

Who makes it: Rob Kerr
License: Free
Link:
http://asperfwb.codeplex.com/


ASCMD

ASCMD is for SSAS what SQLCMD is for SQL Server – a command-line utility allowing execution of XMLA and MDX scripts. It provides a rich set of functionality for performing maintenance and testing tasks.

Who makes it: Microsoft
License: Free
Link:
http://msftasprodsamples.codeplex.com/


MDX Studio

Started by Mosha Pasumansky, MDX Studio is the most powerful MDX analysis tool available. It provides best practise advice and includes many MDX tuning features not available in the toolset included in SQL Server and Windows.

Who makes it: Microsoft
License: Free
Link:
http://cid-74f04d1ea28ece4e.skydrive.live.com/browse.aspx/MDXStudio/v0.4.14


Team Foundation Server

TFS is a must if development is done by more than one person. While merging changes in BI projects is not as useful (or even possible) as in .NET coding projects, maintaining versions, branching, etc is still a must. TFS also includes bug tracking, requirements management, and a bunch of other features well worth exploring.

Who makes it: Microsoft
License: 5 Developers – $499
Trial: Yes
Link:
http://www.microsoft.com/visualstudio/en-us/products/2010-editions/team-foundation-server/overview

SSIS Balanced Data Distributor – Comparison

Microsoft just released a new Data Flow transformation for SSIS – Balanced Data Distributor. Reading the installation pages, we can see that there are a few interesting things about it, among which – it is a multithreaded transform, which uniformly splits a data stream in multiple outputs. I decided to give it a quick test and see how it performs in comparison to a straight insert and a Script Component which splits the input stream in two, as well.

First the tests show performance with input of a SQL Server table (OLE DB Input) and output to a raw file. The input is a TOP 20000000 * select from a large table. The results are as follow:

1. Straight insert: 32 seconds

2. Balanced Data Distributor: 36 seconds

3. Script Component: 57 seconds

4. Conditional Split: 42 seconds



Note that the Conditional Split divides the stream based on the remainder of a division of an integer field, while the Script Component does it based on the row number. The Conditional Split may or may not be useful in a number of cases – when we have non-numeric data only, or when the range of the numeric data is not wide enough to split in the number of streams we would like to (e.g. Gender Key can be 1 or 2, while we may want to split in 10 parallel data streams). Therefore, its functionality is not equivalent to the BDD transform (thanks to Vidas Matelis for asking me to include it in the case study).

The second tests show how fast it is with reversed input and output (raw file to SQL Server table) and everything else identical.

  1. Straight insert: 56 seconds
  2. Balanced Data Distributor: 1 minute and 47 seconds
  3. Script Component: 1 minute and 57 seconds
  4. Conditional Split: 1 minute and 54 seconds

Over 40M rows the difference between the BDD transform and the Script Component – 1:16 vs 2:13 (vs 1:24 for the equivalent Conditional Split), or 57 seconds difference when inserting in a raw file. In general, the overall performance improvement seems to be around 35-45% in that direction. Since I am inserting in my local SQL Server in the same table the parallel split does not seem to be beneficial even though the destination is slower than the source. In fact the straight insert outperforms the parallel data flows in both cases. If we were to insert into a partitioned table over different filegroups hosted on separate drives the results could be quite different. Either way, in my opinion it is a nice and welcome addition to our arsenal of SSIS data flow components.

Edit: Len Wyatt from the SQL Performance Team at Microsoft has provided a link to his post with great bit of detail about the BDD transform in the comments below. Please take a minute and have a look if interested.

Tip: Deployment Options in BIDS

The default options in BIDS do not always make perfect sense and one of the annoying defaults for me is the configuration of deployment options for a SSAS project. If we create a new project and then check its properties we see the following under the deployment tab:

By default the Processing Option is Default (surprise!). We also have two other choices:

In most cases we do not want to process the deployed database after a change has been made but by default BIDS will send a process command to SSAS whenever a process is required to bring the objects which are modified up. If we change the Default to Do Not Process we avoid this and the project gets deployed only.

The relevant MSDN page states:

By default, Analysis Services will determine the type of processing required when changes to objects are deployed. This generally results in the fastest deployment time. However, you can also choose to have either full processing or no processing performed with each deployment.”

I beg to disagree – I do not believe that it results in fastest deployment time. In fact it is just between Do Not Process and Full Process when it comes to deployment time only.

The default can be particularly annoying when we really want to only deploy and we right-click on the project and click Deploy while developing but we almost always get a partial process after a structural change. I have seen developers clicking Process instead, waiting for the deployment to finish (no automatic process if we do this) and then quickly pressing <Esc> to cancel the processing afterwards. Not a major breakthrough, but may optimise a bit the way we work, so I thought it may be a good idea to put this little issue in the spotlight for a moment.

SSAS to BISM – Recent Developments

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.