Introducing Project DateStream (CodePlex)

I recently blogged about The Case for an Azure DataMarket Date Table. I finished the blog post with a bit of a critique of the DataMarket team at Microsoft, which I can now wholeheartedly apologise for. This is because since my last post I was contacted by Max Uritsky who is a Group Program Manager on the Windows Azure Marketplace DataMarket team (a long name for a team, yes). He and Belinda Tiberio managed to help me with creating and hosting a new Date feed. Not only they helped with making it available for free on the DataMarket website, but also gave me a 1Gb free SQL Azure database for the project. A big “thank you” goes to Julie Strauss from the SSAS team for making the contact, as well.

To summarise, the DateStream project is a free date table available as a feed and intended to be used by PowerPivot BI users. As most, if not all, BI projects include a date table, the goal is to provide an easy-to-use, correct and simple mechanism for creating such tables.

After some deliberations on the format of the feed we decided that it would be best to split the feed in a number of localised Basic tables and one Extended table. Currently we have only two Basic (US and EN) versions and a fairly straight-forward Extended one. However, the plan is to inlclude more and more column in the Extended one and provide extra localised (Italian, French, Spanish, German, Russian, etc.) tables. When I am saying “we” I mean fellow SSAS professionals, which I had the pleasure to discuss this idea (among which Marco Russo, Thomas Kejser, Willfried Faerber and Grant Paisley).

The CodePlex page of the project contains more details about the feed and the tables and also allows for commenting on existing features, as well as requesting new ones. It can be found at http://datestream.codeplex.com/.

The actual feed can be used directly from https://datamarket.azure.com/dataset/1542c52d-d466-4094-a801-9ef8fd7c6485.

Note the logo – it was created by Daniele Perilli from SQL BI with the assistance provided by Marco Russo. Thanks to them we did not have to resort to my graphic design skills, which definitely is a win for the DataMarket website.

One note – please let me know if the performance you get from the feed is not satisfactory (please remember that once pulled, the feed does not need to be refreshed as the data will remain valid forever). If many people agree that it is too slow I could potentially host it on an alternative location as well. It is possible to download a CSV version from the DataMarket as a workaround, which also allows removing unnecessary date ranges.

The Case for an Azure DataMarket Date Table

Since the release of PowerPivot Excel pros and power users have been encouraged to learn and “play” with the add-in. There is one little thing from the world of BI which we (as long-standing BI professionals) are used to but apparently troubles our new friends from the Excel world – the Date table. In the SSAS Multidimensional world we have BIDS which can generate a date table in a variety of formats. Still, much more common is the custom Date table, which we build through a SQL script as it enables us to dynamically generate it for a range of dates. I have also used Excel in the past for quick and dirty solutions. Although all these scenarios are very “workable” for database professionals, when it comes to Excel power-users fiddling with databases is far from ideal. Luckily, there is a better way.

Azure DataMarket

The answer I am proposing is the new DataMarket, which as an added bonus (in cases when we cannot use third-party products) is Microsoft-owned. It is integrated very well within PowerPivot since the last update and allows selecting subsets of the data (e.g. we do not need to import everything available in the data set). Because the data is in a feed format we can connect and pull data we want anytime.

The DataMarket is the vehicle but it needs a good data set to transport. With the Microsoft SSAS team being busy with new releases and unable to chase this up with the Azure DataMarket team, I tried contacting the latter directly to no avail. It would be very simple to create a sample, test and if all goes well – we could easily expand the feed to include lots of necessary columns which could simplify any PowerPivot implementation. As a start, a minimum of a calendar hierarchy with 4-5 levels should suffice, but the possibilities are very exciting. We could have multiple financial/fiscal calendars, public holidays, weekends, leap years taken care of and many other Date properties built right into the feed. If customisation is required (as it probably will be in many cases), PowerPivot developers have the Excel and DAX to play with the data through formulas and change various properties like member names and formats.

In my opinion with a miniscule development effort Microsoft can win on both new fronts – PowerPivot and the Azure DataMarket. After all, both products need more exposure and a popular Date feed will definitely help in this direction (not to mention how much easier it would be for developers to “get into” PowerPivot-based BI implementations).

PS: I have emailed the DataMarket team a few months ago and I received no reply. This remains so even after Julie Strauss from the SSAS team followed up on this – it seems like someone is sleeping on the job…

PPS: I just created a Microsoft Connect suggestion – please vote if you feel like this is a good idea. Also, if you feel even more inclined to act you can email the DataMarket guys directly asking them to pay attention through the links provided on their Contact Us page.

SSAS to BISM – Recent Developments

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.

Many-To-Many Relationship Modelling in PowerPivot: Workaround

So far PowerPivot does not support many to many relationships. It is still early days and I am sure that things will change soon. Moreover, there are a few ways to implement many-to-many relationships for measures in DAX. Two notable posts on the subject can be found on Marco Russo’s and Alberto Ferrari’s blogs:

http://sqlblog.com/blogs/marco_russo/archive/2009/12/07/many-to-many-relationships-in-powerpivot.aspx

http://sqlblog.com/blogs/alberto_ferrari/archive/2010/10/19/powerpivot-and-many-to-many-relationships.aspx

Unfortunately, if we have multiple measures, the approach is somewhat cumbersome as we need to build numerous DAX measures. I have been playing around with modelling options and I just came across an idea which, while not perfect, can help us to mitigate to some extent the lack of this crucial bit of functionality. In fact, “not perfect” is quite an understatement – the more appropriate expression would be “a hack”, but it could still be useful in some scenarios – hence this post!

To put some background, let’s assume we have this simplistic model:

An account table

A customer table

And a fact table

Here we have two accounts, A with an Amount of 100 and B with 50.

Now, let’s assume we have the following mapping table between Accounts and Customers:

Account A maps to customers X and Y; Account B to X only.

Now, instead of trying to directly map the relationships here, we add the Amount to the mapping (bridge) table:

And we then create our relationships in PowerPivot. The problem here is that when we slice by Account A (with a key of 1), we would get the 100 Amount doubled. If we had more than two customers mapped to the account, it could have been tripled, or in fact multiplied by N, where N is the number of mapped customers.

The trick I am offering is adding a negative amount to this impromptu fact table against the same account (A) and against a Z customer with a key of -1, which acts as an unknown/adjustment bucket. Therefore, the Customer table would become:

And our fact/bridge table:

Now, if we create relationships between this table, the account and customer tables we would see the following:

And

Well, we do get one extra row with negative amounts, but all else seems just fine. The total is correct and the amounts per account and customer are also correct. If we have more than one measure, we will have to negate each of the additional measures as well if we want to achieve the same behaviour.

I also created a slightly more complicated model, which included years:

Here we have to compensate for two rows with negative amounts but the experience of using this model is not much different than the previous:

Some other common scenarios could also work quite well. I have tried with two M2M relationships and while the number of rows gets progressively larger, the model still seems to work. Another common scenario would be to have more than 2 attributes in either of the tables (whoa!), and that seems to work very well, too. The key is to remember to negate/adjust for each mapping after the first one. If we had a third node in our X,Y table mapped to A, we would have two additional rows in the fact table – one with a positive and one with a negative amount.

Unfortunately I have not come up with a way to simplify the data modelling through DAX, so at this stage I would do it in the source. However, it should not be too difficult to achieve this in SQL, for example.

I would be interested to see if anyone comes up with scenarios where this could not work and cases in DAX where the additional rows are a problem (beyond the “too much data”, of course).

The Single Table Model in PowerPivot

In my last post I examined a normalised vs a denormalised model in PowerPivot. In some cases, though, users will invariably avoid this de/normalisation “stuff” and import a single table in PowerPivot. After all, PowerPivot targets Excel users, and Excel users are used to using large workbooks – in most cases a large extract provided by their friendly DBA or database developers. This is why the scenario where PowerPivot becomes a tool to overcome the 1 million rows limitation in Excel will be quite common. But how does it perform, is it wise to do this and when? I will try to answer some of the questions in this post.

Performance

To test performance I mashed up the data in my PTest environment and put it in a single table. Of course, comparing the space on disk between the normalised, denormalised and the single table approaches there was a massive difference with the single table being by far the largest, followed by the denormalised model and the normalised being the smallest. This is what we would expect and is one of the reasons for normalising at the first place.

In a database the single table would be very inefficient since it will lead to lots of IO in many scenarios. However, when imported in PowerPivot the sizes compare like this (variation from denormalised given in brackets):

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

PTest_Normalised

5.2 Gb RAM (-0.3Gb)
3.3 Gb File (-0.2Gb)

PTest_SingleTable

4.3 Gb RAM (-1.2Gb)
2.8 Gb File (-0.7Gb)

Obviously the elimination of many distinct keys leads to significant space savings. In fact, the single table approach is by far the most efficient when comparing memory utilisation and disk space (when saving the Excel file).

After I did my standard slicer testing, I got extremely good performance out of my single table. No relationships whatsoever seem to be a fast approach to PowerPivot. So, if there were no considerations we could jump into a conclusion that a single table is the best possible option for PowerPivot. Well, the next few section of this post will show why this is actually not the case.

Usability

Let’s quickly add another hypothetical table to our model. What if we need to add some more data and we decide to ask our friendly DBA to give us another extract with more of the same? Now we have two massive tables and we want to analyse the data from both of them. We hit a serious problem – slicing by a slicer built from one of the tables does not work with the other one. This is an obvious scenario for SSAS developers. We need to have a relationship between the tables we use for slicing. In other words, if we want to slice both tables by Date, we need to have a common table which we base the slicer on. Our model should look like the following diagram:

We do not have this relationship and trying to add one directly between the two big tables fails because we do not have a column with distinct values. This is the most obvious showstopper when considering using a single large table I can see.

Size

When we have one table memory utilisation seems very good. However, when we have more than one of these behemoths in memory, we can reasonably expect that due to duplication of attribute data (e.g. we have to store our Products and Customer Names a number of times in memory), we will have a problem – unnecessary duplication of the same data. This is especially true for high-cardinality attributes – that is where we have many distinct values, like in my sample Order Number attribute (20+ million distinct values). In such cases separating these in their own “dimension table” would save memory and disk space.

DAX

How these models compare when building DAX calculations is a topic on its own and I will soon show some comparisons which should answer two questions – which is the most convenient and intuitive model to work with and which one is the fastest. For now it would suffice to say that always working over a large set of values in different tables could be expected to be the slowest (however, I have not done sufficient testing to confirm this yet).

In conclusion, when doing ad-hoc analytics over some extracts which would definitely not need to be mixed up with others, the single table approach works very well with PowerPivot. It certainly extends the functionality Excel offers natively. However, if we are building extensible models, which are to be shared, enhanced and would form the heart of our Team BI, we should avoid the single table because of the other considerations listed above.

Why Choose PowerPivot?

If you are on the market for self-service, or in-memory BI tools you have some options. You have to consider functionality, cost and the future. If you are a SQL Server and/or SharePoint and/or Microsoft Office user, PowerPivot should be a top-of-the-list contender. I will discuss a few points to show why.

Cost

First and foremost, it is cheap. To utilise PowerPivot you need Excel and possibly SharePoint. If you need to empower your users with the capability to expand Excel and do heavy analytics on a workstation PC, PowerPivot is essentially free for Excel. If you want to let them collaborate and share their work, then SharePoint comes to the mix. If you do have SharePoint Enterprise in your organisation, then PowerPivot is, again, free. A free self-service BI platform – not a bad option, is it? Surely the cheapest.

Functionality

Functionality-wise, the outstanding “feature” is the integration with Excel. How many other self-service BI tools out there allow you a seamless integration with Excel? When Excel users become PowerPivot users, they have all the capabilities of Excel, plus PowerPivot. They can pick Top/Bottom 10, flick to percentage representation of values, use the charting functionality of Excel the way they are used to; they can also utilise the rest of the Excel functionality they love. With a little bit of DAX knowledge they can build new calculations on top of massive data sets. Writing [Quantity]*[Price] gives us [Sales Amount]. Simple, isn’t it? Furthermore, if you prepare a nice, well-referenced datamart for them you do not need to worry about the lack of knowledge of SQL – all modelling gets done in Excel in a very familiar for users environment – spreadsheets with rows and columns. Data can be previewed, filtered and ordered; new columns can be added with Excel-like syntax – a paradise for moderately Excel-savvy users.

Once ready with the model, if users want to share their work they can simply publish to SharePoint. From there other users can either browse the workbooks (if they have a browser – right…), or if they are interested in more on-the-fly analysis they can connect to the workbooks through Excel and slice/dice the data just the way they do with SSAS cubes. No need for client installations and no need for powerful workstations. In fact, to connect to published models they only need Excel 2003 and Windows XP. The minimum hardware requirements for those are, well…minimum.

As for IT Services departments – they still can manage the situation. They can monitor, advise and service – precisely their purpose. While a standalone, isolated and incompatible server could be a problem, the sort of manageability and visibility BISM and PowerPivot offer will, no doubt, appeal to ITS.

The Future

Let’s zap to the future. Microsoft has made a strong commitment for a multitude of future enhancements. The models will be available in SQL Server Analysis Services and DAX will get massively enhanced. There will be numerous enhancements on the modelling side, querying side, engine side, etc, etc – all in all – BISM and PowerPivot have a great future. In fact, from what I can sense, BISM and DAX will become more and more powerful and if you commit to spending your money on another product I can guarantee that you will be thinking back and regret this step, especially if you like Excel and SQL Server. Just think back of where SSAS was in 2000 and where it is now. Well, by what it looks like at this moment, we’ll have a similar situation after Denali and ahead of it. Better than ever will be the integration between the components in the Microsoft BI stack, too. With Crescent and SSRS reading BISM models easier, the pegs will fit together even more seamlessly.

Am I advertising Microsoft BI? Yes. Am I objective – maybe no, but allow me to be excited about it. From the poll on the top right of this blog, where I asked how you feel about the recent BISM announcements, I noticed that many people are either “Angry”, or “Excited” about the new developments around SSAS. I wonder what will the reactions be when Denali ships and more people get hand-on experience with BISM. Will there be as many “Angry” people out there? I doubt it. Will everyone get ecstatic – well, maybe not, but I believe that a lot of users will get more Excited/Happy about it.

PowerPivot Tips and Tricks – First Findings

I just finished a little project with PowerPivot together with James Beresford, and I feel compelled to share some of the lessons I learned during these exciting if a little bit stressful few days. There are a few things I have heard in the past in regards to good PowerPivot model design, but I’d like to re-iterate some, and maybe add some more to what other bloggers have already mentioned. As it is usual with my posts – the following is not an exhaustive list, but may resonate well with newbie PowerPivot developers like me (which at this stage is probably 99% of Total).

Integer Keys

I found out the importance of having integer dimension keys. The sample model I was given to work with had floats as primary keys for all dimensions. This has always been a pretty bad choice, however, with PowerPivot avoiding this becomes even more important as these get loaded in memory and unless your workstation has heaps of it, keeping the size of the solution smaller is extremely important. Therefore, even though PowerPivot should work just fine with other data types, I would strongly recommend choosing integers for your keys.

Previewing and Filtering

Always have a look at the table you are importing into PowerPivot and make sure you deselect all the columns from your tables which you do not actually need. They can always be selected later on if needed. As it also is with SSAS – starting from a small model and extending it later is much better than starting from getting every possible column in and then removing them. This is even more valid with PowerPivot as you will use up a lot of memory to retrieve everything from the back-end database and unless you really don’t care about memory this is not a good thing.

Co-existence with SQL Server

I know for a fact that PowerPivot and SQL Server can co-exist peacefully. However, since SQL Server was running on the same box, I had to limit its memory usage to 150Mb so it doesn’t cause issues with PowerPivot.

Moving logic to the database

Instead of trying to do more in PowerPivot, if you are designing the solution and/or building the model – make sure that you move join logic, or similar in the back-end database. There is no point in importing tables in PowerPivot when the relationships between them are one-to-one. Instead – you can join them in a view, and then import them as one piece, thus avoiding some relationships in Vertipaq, which will improve performance and will make it easier to work with the model. In example, I had a 1-1 relationship through a large intermediate table (3+mil rows), which I managed to get rid of completely in the database back-end by applying a simple join.

Many To Many Relationships

If you can avoid them – do it. However, if you can’t, you can read Marco Russo’s, Alberto Ferrari’s and this PowerPivot FAQ posts to get more info on how to implement them. Unfortunately, we have to create multiple calculated members in DAX to avoid the problem and if we have many measures, this can become a bit tedious.

Saving Regularly

Yes, you should save your work, especially when you notice that memory gets scarce. Maybe not every 5 minutes, but at least every 30min should be a norm. Since once memory is very low PowerPivot starts having issues with itself (i.e. becomes a bit unstable), and you can easily lose some work. It does take some time to save a large model on disk, but it may be well worth it. This may sound as a no-brainer, but it is easy to stop saving our work because of the time required to do so. Therefore, it is possible (as it was in my case) to get a funny message asking you to Continue or Cancel and you are not too sure what’s the better choice after a couple of hours of work…

Ordering Dimensions in PowerPivot

UPDATE (2011-07-20): Indeed, with the release of CTP3 of SQL Server Denali we can order the dimension members by another column effectively making this post obsolete. Please ignore the advice below if you are working with a version of PowerPivot and BISM >= Denali CTP3.

I am sure that this post will get outdated very soon, but until that happens it is a neat little trick that PowerPivot users may need to know.

While in Analysis Services we can order our attribute hierarchies any way we want, in PowerPivot we are stuck with the default sort order, which can be annoying in some cases. The other day while building a PowerPivot model (which performed GREAT by the way), I had to order a band dimension, which had members like “Band 1-4”, “Band 5-10”, “Band 11-14”, “Band 15-19”,…, “Band 100-109”, etc. Alphabetically this got ordered very wrong as all bands starting with “Band 1…” got bunched up in the beginning followed by “Band 2…” The users were very unhappy and after searching for a solution I got a bit disheartened because I could not find a solution anywhere online.

The following is Adventure Works 2008 in Excel 2010 (through PowerPivot) and the default ordering of the English Month Name attribute in the Date dimension:

What I noticed was that the actual order which I wanted was reflected by the primary key of the table, which went 1,2,3,4,… So, I un-hid the key from the Pivot Table, renamed it to Band Order and showed users how they can nest it in front of the Band attribute. Because I had one key per band I did not have to worry about having multiple unordered bands under the order attribute.

Furthermore, if placed on columns, the Order attribute row can be hidden for sheets which remain static (it’s ok if they have slicers):

Otherwise, if it is on rows, we can use the Classic Pivot Table view and then hide the order column, which gives a different, but equally useful result:

As a side note, when ordering date attributes like in my examples above, you can use the Sort Options -> More Options functionality in Excel like described in a PowerPivot Team’s post. I used the month as an example and the technique described here should be used when having trouble with other attributes. You can also sort them manually by dragging and dropping them in the pivot table as Dan English explains here.

Importing Azure DataMarket feeds in PowerPivot

I just had a quick play with the new PowerPivot add-in for Excel and I tested the import with a few data feeds from the new Azure Marketplace. Straight away, I have a few lessons learned points which may be good to share:

  1. The instructions of how to import the feeds through PowerPivot are slightly incorrect. Instead of “Getting Started with this Dataset”, you should click on “Explore this Dataset”. Then, on the bottom-left there is a drop-down, which allows you to select PowerPivot. From here onwards it’s staright-forward.

  2. Wait for the a feed to finish before importing the next one. It may take a while with a large feed and I was impatient (on a 64bit version for reference), which crashed my Excel.

  3. Importing the dataset through the Service Root URI (trying to paste this in PowerPivot) will result in a Bad Request 400 even though the connection tests are successful. Marco Russo blogged about adding a slash in the end of the URI here.

And a quick suggestion – it would be good to see the size of the feed we will be importing in the Explore Dataset page. Since it displays the top 100 items only, there is no way of telling if there are 101 or 1000000000 feed items – which in term doesn’t allow us to prepare for a long wait.

Apart from these, I had not other issues and I am happily browsing the datasets in PowerPivot to the amazement of my .NET colleagues.