Is Excel 2013 Power View hurting SSAS?

Since the release of Office 2013 Preview, there has been an awful lot of commentary both in the BI blogosphere and in all media mostly praising the new look and features Microsoft are giving us. I agree that the new suite is getting a much needed face-lift (after all 2007 -> 2010 did not deliver much in this regard), and the mandatory for post-2010 releases “touch-enabled” interface. As a BI guy, the most interesting thing for me in the new Office is Excel, and within Excel, Power View (not completely ignoring the new Pivot Table options like adding new calcs natively, without using the OLAP PivotTable Extensions plug-in).

There are deficiencies in Power View as it stands now (maybe that is why it is an add-in, not enabled by default) but it is a nice data exploration tool which even Stephen Few told me was going in the right direction. It still doesn’t allow us to do things like Top 10, % of Total, etc. which I love in Excel Pivot Tables, but I am hoping that we’ll get all these and more at some point. Together with maps, small multiples and highlighting, Power View does seem like a potential killer application, making other tools like Tableau seem within reach.

However (here we go), in my opinion it has one drawback which really spoils the good things: it doesn’t work on SSAS Multidimensional cubes. I am sure that if someone from the Microsoft SSAS or SSRS teams reads this they wouldn’t be surprised. After all they have heard this numerous times since the release of Power View (even when it was Crescent). Chris Webb wrote about it and caused some stir, it was discussed and the issue was parked as “will be fixed at some point”. I don’t want to resurrect it in its previous format since we now know that a fix is coming (as publicly stated by Microsoft at PASS and other forums), but I would like to point out that last week after demoing Excel 2013 in front of some decision-makers in the organisation I currently work at, I had to answer questions like “So, we can’t use Excel with SSAS at all?” and “Do we have to upgrade all our cubes to in-memory models now?”. This made me think: we can’t cover the whole feature set of SSAS MD in SSAS Tabular, yet we can’t use the new end-user tools on SSAS MD as well. Basically, I am left with one option only – to recommend SSAS MD and Excel Pivot Tables as the only strategy for corporate BI competitive with other vendors like IBM and SAP. Furthermore, since we can’t use DirectQuery on Teradata, I can’t also say that SSAS Tabular is better than SSAS MD as an analytical platform on top of Teradata. Yes, one day when we get it working it will trump all other options, and the Teradata guys I work with are genuinely excited about the future, but it is unfortunately the future, while other options are there now. And the future in IT always looks bright, while the present is a different story altogether.

So, is Excel 2013 Power View hurting SSAS MD? Maybe not directly, but the more we promote Power View (hint: as it’s getting deployed on every workstation we hardly need to promote it), the more we also highlight the deficiencies in SSAS Tabular and make SSAS MD look like the neglected sibling. As for my demonstrations of Microsoft BI, I would probably treat the two components of SSAS – MD and Tabular as two separate stacks. MD – the best out there for corporate BI, and Tabular – the best for team/personal BI, not to be mixed together. To those who can’t wait to get their hands on a powerful, modern analytical tool working on the iPad and use it on a powerful, robust, widely used analytical engine which works on less than terabytes of memory (let me spell it out: Power View +mobile on SSAS MD), we have to tell: “at some point in the future, hopefully”.

PS: Yes, it does sound like yet another whiny post, but we have been waiting for a very long time (in IT-terms) since the inception of Crescent to get this bloody issue fixed and it is hard to imagine mighty Microsoft struggling to catch up behind Cognos (yes, they have mobile – terrible implementation, but works), Tableau (last time I checked they were a tiny company with overpriced products) and QlikView (ugly and slow from what I’ve seen). It’s not like Microsoft weren’t aware of this, and it’s not like it’s happening for the first time – remember how well PerformancePoint did being unable to connect to anything but SSAS MD and having a lacking feature set, plus SSRS to SSAS integration has always been a pain point. The new, quicker, release cycle at Microsoft seems to be delivering the same amount of features in less-complete releases, which doesn’t seem to help with inspiring confidence in its vision and ability to execute. However great these new ingredients are, the dish doesn’t taste that well when some are lacking…tends to get cold quickly too.

Range Queries with Azure DataMarket Feeds

By default the Azure DataMarket does not allow range queries. In fact, the only way we can filter a data feed is through specifying one or more values for the “queryable” fields specified for it. There is not technical reason behind not allowing range queries as both the back-end (presumably SQL Azure, or SQL Server) and the OData protocol support them. Fortunately, there is a way to consume a range of the values in a column of a data feed in PowerPivot. It is not straight-forward and I do not think that the target audience of both self-service BI/PowerPivot and the DataMarket itself would appreciate the complexity, but it could be useful anyway.

If we want to pull all three tables from the DataMarket we can simply use https://api.datamarket.azure.com/BoyanPenev/DateStream/ as the URL in PowerPivot:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Otherwise, we can pick each one with a URL like (for the BasicCalendarEngish table):

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish

If we filter the data set on the DataMarket website to request only the data for 2010 we get the following URL:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20eq%202010

Note the last bit:

?$filter=YearKey%20eq%202010

This is simply the URL/OData encoded ?$filter=YearKey = 2010

In OData we can also use other operators, not just = (or eq). For ranges these are gt (greater than), ge (greater than or equal to), lt (less than) and le (less than or equal to). We can also use and and or operators to combine different predicates. For a more thorough list, please refer to http://www.odata.org/developers/protocols/uri-conventions. If we replace the ” = 2010″ with ” < 2010″ and then encode the URL, we do indeed get all years prior to 2010. Things get slightly more complicated when we have a more complex scenario. In example, when building a date table we may want to include all years between 2000 and 2030. To do that, we would have to write something like:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey >= 2000 and YearKey <= 2030

encoded, the same looks like this:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20ge%202000%20and%20YearKey%20le%202030

Here space is %20 and the math comparison operators have been replaced with the OData operators (in red).

If we paste this in PowerPivot and hit “Next”:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

…we get exactly what we expect – a table with 30 years.

Things get more complicated if we include the datetime DateKey in the URL. For a single date (e.g. 1900-01-01), we have to use:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′

After Applying URL encoding we get:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27

Where %27 is apostrophe and %3a is a colon (for a list of ASCII characters and their URL encoded form we can refer to http://www.w3schools.com/tags/ref_urlencode.asp).

Now, to combine the two we would need to write:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′ or (YearKey >= 2000 and YearKey <= 2030)

Encoded this becomes:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27%20or%20%28YearKey%20ge%202000%20and%20YearKey%20le%202030%29

This monstrous-to-write URL string returns 30 years of data + 1 day.

I suppose this approach can be classified as a workaround, as I have not seen any documentation on PowerPivot referring to any options for filtering data from the Azure DataMarket. However, in my opinion, this should be a feature of the DataMarket itself as it would make it easier/possible for users with any tool to get just the data they need and even possibly reduce the load on the site service since it will no longer be necessary to export everything and then attempt to apply a filter.

Introducing Project DateStream (CodePlex)

I recently blogged about The Case for an Azure DataMarket Date Table. I finished the blog post with a bit of a critique of the DataMarket team at Microsoft, which I can now wholeheartedly apologise for. This is because since my last post I was contacted by Max Uritsky who is a Group Program Manager on the Windows Azure Marketplace DataMarket team (a long name for a team, yes). He and Belinda Tiberio managed to help me with creating and hosting a new Date feed. Not only they helped with making it available for free on the DataMarket website, but also gave me a 1Gb free SQL Azure database for the project. A big “thank you” goes to Julie Strauss from the SSAS team for making the contact, as well.

To summarise, the DateStream project is a free date table available as a feed and intended to be used by PowerPivot BI users. As most, if not all, BI projects include a date table, the goal is to provide an easy-to-use, correct and simple mechanism for creating such tables.

After some deliberations on the format of the feed we decided that it would be best to split the feed in a number of localised Basic tables and one Extended table. Currently we have only two Basic (US and EN) versions and a fairly straight-forward Extended one. However, the plan is to inlclude more and more column in the Extended one and provide extra localised (Italian, French, Spanish, German, Russian, etc.) tables. When I am saying “we” I mean fellow SSAS professionals, which I had the pleasure to discuss this idea (among which Marco Russo, Thomas Kejser, Willfried Faerber and Grant Paisley).

The CodePlex page of the project contains more details about the feed and the tables and also allows for commenting on existing features, as well as requesting new ones. It can be found at http://datestream.codeplex.com/.

The actual feed can be used directly from https://datamarket.azure.com/dataset/1542c52d-d466-4094-a801-9ef8fd7c6485.

Note the logo – it was created by Daniele Perilli from SQL BI with the assistance provided by Marco Russo. Thanks to them we did not have to resort to my graphic design skills, which definitely is a win for the DataMarket website.

One note – please let me know if the performance you get from the feed is not satisfactory (please remember that once pulled, the feed does not need to be refreshed as the data will remain valid forever). If many people agree that it is too slow I could potentially host it on an alternative location as well. It is possible to download a CSV version from the DataMarket as a workaround, which also allows removing unnecessary date ranges.

Building a Date Table

Date tables in themselves are nothing new. Every data warehouse and business intelligence project includes one of those and typically it is one of the first tables which we find on our list of implementation tasks. Surprisingly, however, I am finding that in many implementations the omnipresent date table cannot support some analytics. In this post I will outline a few considerations which we need to take into account when building date tables in general, and some which apply specifically to SSAS.

UPDATE (9th October 2011): If you are a PowerPivot user (or would not mind using an OData feed) there is a simple way to get a date/calendar table as there is a free one available on the Azure DataMarket. I recently blogged about this new project in my Introducing Project DateStream (CodePlex) post.

UPDATE 2 (16 October 2011): John Simon recently published a nice T-SQL script which can be used to create and populate  a date table on his blog. Have a look if you need one!

Grain, Key and Format

In general, date tables are on a daily/date grain. This is because the lowest level of reporting we need to do is very often on full days. Even if we need to analyse on time intervals smaller than a day, we should still build a Date and a Time table to support this. Why? Because if we pack it all in the same table we end up with too many rows and this is something we typically want to avoid in any dimension. As we have a row per day, one of our columns is inevitably based on the calendar date. It is also typically the dimension key. As with all other dimension tables, we need to be able to join the table to our fact tables and this is done on our key. To keep our fact tables as small as possible, we need the smallest possible data type for our keys, and in the case of a Date this is usually an integer. However, unlike with other dimensions, we do not have to have a meaningless integer for our surrogate key because it does not give us any advantage. It is very easy to convert between a datetime and an int and in SQL Server datetime is 8 bytes, while int is 4. In SQL 2008+ we have another type suitable for our date table key – date, which is 3 bytes. However, for compatibility reasons, it is frequently better to stick to an int. Luckily, we also have the very convenient ANSI/ISO 8601 standard to follow in order to make our integer dates easier to work with. If we do encode them with the YYYYMMDD format, (thus the 15th of January 2011 becomes 20110115), we can easily sort on the column and compare two dates with the standard > and < operators. Additionally, we also escape the ambiguities around date formats in the USA and the rest of the world.

Attributes

Building the date key column is a good start but to allow for better analytical capabilities we need to add other attributes. As in most cases we need to analyse our data not only on days, but also on larger intervals (e.g. Months, Quarters and Years), we need to add columns for each of them. A typical requirement is to support both Financial and Calendar date hierarchies. As with dimensions it is ok to have many columns, we can add one for each combination of the type of calendar and period (e.g. FinancialMonth and CalendarMonth, FinancialQuarter and CalendarQuarter, etc.). Other interesting attributes we may want to have materialised in the table are the likes of Weekends, Public Holidays (even though we need to maintain these), Solstice/Equinox and PhaseOfMoon (thanks for Thomas Kejser for mentioning these in an online conversation recently). Basically, for a fully functional date table we need to consider anything which could be of business value without going to extremes.

Of course, when working with SSAS we also want to have an integer key for each attribute and possibly a common name for it. This multiplies the number of columns we need by two – one Id and one Name column for each of the attributes we have decided to implement. Some attention needs to be spared when determining the formats of each attribute Id. If we are to be building a hierarchy out of a collection of attributes we want those to form a nice natural hierarchy. That is – each child member should not have two parents with different Ids. To illustrate the concept, let’s consider two different months – January 2011 and January 2012. While they are the first month of the calendar year, they represent different entities when reporting. We do not aggregate data to January only, but to January in each respective year. Therefore, if we were to write a SQL query to get data for January 2011 and our Ids for both of these members are 1, we would also need to use the Year attribute to get the values we need. Effectively, our hierarchy would have two children with identical Ids of 1 with different parents (with Ids of 2011 and 2012). This is a particular problem when working with SSAS as it definitely prefers unique Ids for each separate attribute member. There is a very easy solution – we can include both the Year and the Month number in the MonthId. In our case, January 2011 gets an Id of 201101 and January 2012 – 201201. Now we don’t have the same issue. Similarly, we must pay attention when we construct the Ids of other attributes which participate in our hierarchies like Quarter (YYYYQQ) and Half Year (YYYYHH).

Weeks

The week is a special case in the calendar. While we can say that Days roll up to Months, which roll up to Quarters, which in turn roll up to Half Years and Years, Weeks are a different story. If we are not using a special calendar like a 4-4-5 calendar, we are dealing with a real-world entity, which does not naturally belong to only one larger period. In most cases we do not have to worry about the week falling between Day and Month as business understands that this is not a very easy to work with hierarchy. However, business users very often are ignorant about the fact that weeks do not roll up nicely to years, too. We can again read ISO 8601, which also deals with the Week->Year problem. Basically, the ISO has come up with a simple solution – if we have a week which has dates in two years, we count it towards the year which contains more dates (or, the year which contains the Thursday of the week). Why is this important? Well, we can simply split a week in two, however, this means that certain weeks in our table contain less than 7 days. If we compare such weeks with a Weekly Growth calculation we will notice a problem – the amounts aggregated to them are smaller than usual. Similarly, on a graph showing weekly amounts, we have a dip as the two parts of the same week are plotted as separate points. If the users do not care, then it is not a problem, but it is something we should consider and ask about when building the week attribute in our date table.

Ranges

The date table has a limited amount of rows and we have to make a decision on how many are enough. Some developers build date tables with ranges all the way from 1900 to 2100, or even further. As we have roughly 365.25 days per year (note how 1900 is not a leap year, not is 2100 – something Excel doesn’t know), for 10 years of data we end up with ~3652 rows. With 100 years we have 36525 rows. It is quite doubtful that a data warehouse will contain data for all these years. In most cases it contains a bit of historical data and is designed to keep data for another 20-50 years. To optimise the performance of the queries using the date table it is a good idea to have a dynamic range of dates. In our ETL we can easily keep expanding the range as needed, or collapse it if we ever purge fact data. One thing which is often overlooked when picking the date range is the fact that many times queries depend on the completeness of the date range. A major mistake, which we must avoid are basing the date table on fact data and allowing gaps in the range, and having incomplete top-level periods.

The first problem is easy to explain. Zealously trying to keep the date dimension as small as possible by reducing the number of days (i.e. rows) to only the applicable ones for our fact data can introduce gaps in the overall table. In example, if we have no data for 26th of January because our business was closed for Australia Day, and we “optimise” the date table to exclude this day from the date table, all analytics which depend on counting the number of dates in January will be skewed. An average calculation doing Sales/NumberOfDaysPerMonth will divide by 30, not 31. One of the reasons for having a date table at the first place is to avoid such problems. Hence, gaps in the date table must be avoided.

Secondly, we must also ensure that all top-level periods (i.e. Year usually is the top level in the Calendar Hierarchy) must be also complete. It is not acceptable to cut off the date range with a month, so we have a constant number of dates. Even if our fact data is implemented over a fixed window of 24 months (rolling), we should not do the same with the date table. In example, if we are in September 2011 and we have fact data for 24 months prior to this (e.g. September 2009 – September 2011), the date table should also include data for months prior to that (e.g. January 2009 – September 2011). The reason is the same as before – all calculations doing Amount/NumberOfPeriodsWithinAPeriod would be wrong for 2009. Furthermore, if we use something like window functions in SQL partitioning by year and counting the months within the period to compare equivalent ones will be incorrect. The first month of 2009 would be September and the first in 2010 – January. Because of such issues, it is best to keep all top level periods complete. A solution could be removing data from the date table once we are not interested in the complete top level period. In the case of the 24 rolling months, we can remove all of 2009 with all its periods/rows once we move to 2012 and we are interested in January 2010 – January 2012.

In SSAS all date functions like ParallelPeriod and ClosingPeriod work with relative periods. To get the ParallelPeriod for January 2010 in 2009, SSAS will determine that January 2010 is the first month in 2010, go back to 2009 and pick the first month there. If the first month is September 2009, we will get that returned.

Unknowns

As with other dimensions, we should also have an Unknown member in the date dimension. While with all other dimensions we usually pick a value like -1 for the Unknown member Id and Name of “Unknown”, with a date table things a slightly different as we often have columns which are some sort of a date/time data type and we cannot cast -1 to a date in a meaningful way. Therefore, in many cases we can pick a complete outlier, like 19000101, which we use for storing Unknowns. Depending on the scenario, we may not need to do that, but if we do, there is nothing wrong with doing that as long as we make sure that it is clear that it is a special case. If the data starts in 2001 and we pick 20000101 as an unknown value many users will be wondering why there is some data against that date.

Conclusion

The best date table is the one we don’t notice and take for granted. Given that it is not something that changes at all, the structure of the date table is the one common over many different implementations and whether in SQL Server, SSAS or PowerPivot, the fundamentals stay the same. It can be shared and is part of the base of virtually every BI solution. It, unfortunately, is built wrong very often, thus severely impairing the capability of the system it is a part of. With the arrival of self-service BI I could imagine an increasing need for practical advice on this specific data modelling technique and I hope this article helps with delivering a bit of it.

The Case for an Azure DataMarket Date Table

Since the release of PowerPivot Excel pros and power users have been encouraged to learn and “play” with the add-in. There is one little thing from the world of BI which we (as long-standing BI professionals) are used to but apparently troubles our new friends from the Excel world – the Date table. In the SSAS Multidimensional world we have BIDS which can generate a date table in a variety of formats. Still, much more common is the custom Date table, which we build through a SQL script as it enables us to dynamically generate it for a range of dates. I have also used Excel in the past for quick and dirty solutions. Although all these scenarios are very “workable” for database professionals, when it comes to Excel power-users fiddling with databases is far from ideal. Luckily, there is a better way.

Azure DataMarket

The answer I am proposing is the new DataMarket, which as an added bonus (in cases when we cannot use third-party products) is Microsoft-owned. It is integrated very well within PowerPivot since the last update and allows selecting subsets of the data (e.g. we do not need to import everything available in the data set). Because the data is in a feed format we can connect and pull data we want anytime.

The DataMarket is the vehicle but it needs a good data set to transport. With the Microsoft SSAS team being busy with new releases and unable to chase this up with the Azure DataMarket team, I tried contacting the latter directly to no avail. It would be very simple to create a sample, test and if all goes well – we could easily expand the feed to include lots of necessary columns which could simplify any PowerPivot implementation. As a start, a minimum of a calendar hierarchy with 4-5 levels should suffice, but the possibilities are very exciting. We could have multiple financial/fiscal calendars, public holidays, weekends, leap years taken care of and many other Date properties built right into the feed. If customisation is required (as it probably will be in many cases), PowerPivot developers have the Excel and DAX to play with the data through formulas and change various properties like member names and formats.

In my opinion with a miniscule development effort Microsoft can win on both new fronts – PowerPivot and the Azure DataMarket. After all, both products need more exposure and a popular Date feed will definitely help in this direction (not to mention how much easier it would be for developers to “get into” PowerPivot-based BI implementations).

PS: I have emailed the DataMarket team a few months ago and I received no reply. This remains so even after Julie Strauss from the SSAS team followed up on this – it seems like someone is sleeping on the job…

PPS: I just created a Microsoft Connect suggestion – please vote if you feel like this is a good idea. Also, if you feel even more inclined to act you can email the DataMarket guys directly asking them to pay attention through the links provided on their Contact Us page.

SSAS to BISM – Recent Developments

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.

Many-To-Many Relationship Modelling in PowerPivot: Workaround

So far PowerPivot does not support many to many relationships. It is still early days and I am sure that things will change soon. Moreover, there are a few ways to implement many-to-many relationships for measures in DAX. Two notable posts on the subject can be found on Marco Russo’s and Alberto Ferrari’s blogs:

http://sqlblog.com/blogs/marco_russo/archive/2009/12/07/many-to-many-relationships-in-powerpivot.aspx

http://sqlblog.com/blogs/alberto_ferrari/archive/2010/10/19/powerpivot-and-many-to-many-relationships.aspx

Unfortunately, if we have multiple measures, the approach is somewhat cumbersome as we need to build numerous DAX measures. I have been playing around with modelling options and I just came across an idea which, while not perfect, can help us to mitigate to some extent the lack of this crucial bit of functionality. In fact, “not perfect” is quite an understatement – the more appropriate expression would be “a hack”, but it could still be useful in some scenarios – hence this post!

To put some background, let’s assume we have this simplistic model:

An account table

A customer table

And a fact table

Here we have two accounts, A with an Amount of 100 and B with 50.

Now, let’s assume we have the following mapping table between Accounts and Customers:

Account A maps to customers X and Y; Account B to X only.

Now, instead of trying to directly map the relationships here, we add the Amount to the mapping (bridge) table:

And we then create our relationships in PowerPivot. The problem here is that when we slice by Account A (with a key of 1), we would get the 100 Amount doubled. If we had more than two customers mapped to the account, it could have been tripled, or in fact multiplied by N, where N is the number of mapped customers.

The trick I am offering is adding a negative amount to this impromptu fact table against the same account (A) and against a Z customer with a key of -1, which acts as an unknown/adjustment bucket. Therefore, the Customer table would become:

And our fact/bridge table:

Now, if we create relationships between this table, the account and customer tables we would see the following:

And

Well, we do get one extra row with negative amounts, but all else seems just fine. The total is correct and the amounts per account and customer are also correct. If we have more than one measure, we will have to negate each of the additional measures as well if we want to achieve the same behaviour.

I also created a slightly more complicated model, which included years:

Here we have to compensate for two rows with negative amounts but the experience of using this model is not much different than the previous:

Some other common scenarios could also work quite well. I have tried with two M2M relationships and while the number of rows gets progressively larger, the model still seems to work. Another common scenario would be to have more than 2 attributes in either of the tables (whoa!), and that seems to work very well, too. The key is to remember to negate/adjust for each mapping after the first one. If we had a third node in our X,Y table mapped to A, we would have two additional rows in the fact table – one with a positive and one with a negative amount.

Unfortunately I have not come up with a way to simplify the data modelling through DAX, so at this stage I would do it in the source. However, it should not be too difficult to achieve this in SQL, for example.

I would be interested to see if anyone comes up with scenarios where this could not work and cases in DAX where the additional rows are a problem (beyond the “too much data”, of course).

The Single Table Model in PowerPivot

In my last post I examined a normalised vs a denormalised model in PowerPivot. In some cases, though, users will invariably avoid this de/normalisation “stuff” and import a single table in PowerPivot. After all, PowerPivot targets Excel users, and Excel users are used to using large workbooks – in most cases a large extract provided by their friendly DBA or database developers. This is why the scenario where PowerPivot becomes a tool to overcome the 1 million rows limitation in Excel will be quite common. But how does it perform, is it wise to do this and when? I will try to answer some of the questions in this post.

Performance

To test performance I mashed up the data in my PTest environment and put it in a single table. Of course, comparing the space on disk between the normalised, denormalised and the single table approaches there was a massive difference with the single table being by far the largest, followed by the denormalised model and the normalised being the smallest. This is what we would expect and is one of the reasons for normalising at the first place.

In a database the single table would be very inefficient since it will lead to lots of IO in many scenarios. However, when imported in PowerPivot the sizes compare like this (variation from denormalised given in brackets):

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

PTest_Normalised

5.2 Gb RAM (-0.3Gb)
3.3 Gb File (-0.2Gb)

PTest_SingleTable

4.3 Gb RAM (-1.2Gb)
2.8 Gb File (-0.7Gb)

Obviously the elimination of many distinct keys leads to significant space savings. In fact, the single table approach is by far the most efficient when comparing memory utilisation and disk space (when saving the Excel file).

After I did my standard slicer testing, I got extremely good performance out of my single table. No relationships whatsoever seem to be a fast approach to PowerPivot. So, if there were no considerations we could jump into a conclusion that a single table is the best possible option for PowerPivot. Well, the next few section of this post will show why this is actually not the case.

Usability

Let’s quickly add another hypothetical table to our model. What if we need to add some more data and we decide to ask our friendly DBA to give us another extract with more of the same? Now we have two massive tables and we want to analyse the data from both of them. We hit a serious problem – slicing by a slicer built from one of the tables does not work with the other one. This is an obvious scenario for SSAS developers. We need to have a relationship between the tables we use for slicing. In other words, if we want to slice both tables by Date, we need to have a common table which we base the slicer on. Our model should look like the following diagram:

We do not have this relationship and trying to add one directly between the two big tables fails because we do not have a column with distinct values. This is the most obvious showstopper when considering using a single large table I can see.

Size

When we have one table memory utilisation seems very good. However, when we have more than one of these behemoths in memory, we can reasonably expect that due to duplication of attribute data (e.g. we have to store our Products and Customer Names a number of times in memory), we will have a problem – unnecessary duplication of the same data. This is especially true for high-cardinality attributes – that is where we have many distinct values, like in my sample Order Number attribute (20+ million distinct values). In such cases separating these in their own “dimension table” would save memory and disk space.

DAX

How these models compare when building DAX calculations is a topic on its own and I will soon show some comparisons which should answer two questions – which is the most convenient and intuitive model to work with and which one is the fastest. For now it would suffice to say that always working over a large set of values in different tables could be expected to be the slowest (however, I have not done sufficient testing to confirm this yet).

In conclusion, when doing ad-hoc analytics over some extracts which would definitely not need to be mixed up with others, the single table approach works very well with PowerPivot. It certainly extends the functionality Excel offers natively. However, if we are building extensible models, which are to be shared, enhanced and would form the heart of our Team BI, we should avoid the single table because of the other considerations listed above.

PowerPivot Data Modelling for Performance

PowerPivot allows us more flexibility than SSAS UDM when it comes to data modelling. We are no longer constrained to dimensions and facts. As Thomas Ivarsson showed us in his PowerPivot Schema Flexibility post, we can avoid certain data manipulations when building a model in PowerPivot. The same holds true for relational database scenarios, as well. We can write SQL queries and analyse data without complying with any particular model. However, we have the concepts of a data mart and a star schema, which are not based on SSAS UDM but rather the other way around. In this post I will compare a normalised and a denormalised model in PowerPivot and reason around the relevance of data modelling and denormalisation with Microsoft in-memory technologies.

I prepared two SQL Server databases: PTest_Normalised and PTest_Denormalised. First I populated the normalised model with random data and then I moved that into the denormalised one. The two data sets are identical with the small exception of the date key, which in the normalised one is datetime, while in the denormalised model I converted it to a meaningful integer.

The specifics of each database are as follows:

PTest_Normalised

PTest_Denormalised

The largest table in the models is the table containing Order Details. It contains exactly 50 million rows. The next largest is Order Header (or Order in the second model), which contains 20 million rows. The rest are significantly smaller with the next largest being the Customer table with 1 million rows.

After setting the databases, I imported them into PowerPivot. Since I have implemented foreign keys and consistent names of my columns, PowerPivot detected and created all relationships for me as they should be (see the diagrams above). Some applicable measurements of the two models in PowerPivot:

PTest_Normalised

5.2 Gb RAM
3.3 Gb File

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

What this tells us is that there is not much of a difference between the two when stored in memory and on disk. The denormalised model is slightly larger, but the difference is not even close to what we have in the relational database. This is no doubt due to the fact that with columnar compression in memory the dimension key columns in the fact table are so well compressed that the advantages of normalisation are mostly imperceptible. Interestingly, if we want to save disk space and memory, normalising the model will do little to help us. Existing data marts not fitting in memory will not get much better if normalised, and based on this case study I would advise against normalising them as a solution to the problem (while buying more RAM would definitely help).

Once the two models are in PowerPivot we can start the fun and build some pivot tables.

First I added my only measure SalesAmount to the Values area and placed YearName in Slicers Vertical, MonthName and DateName on rows:

When I clicked on a particular year in the slicer the execution times were:

PTest_Normalised

Cold (first time click): 14s

Warm: (subsequent clicks on the same item): 3s

PTest_Denormalised

Cold: 2s

Warm: <1s

There is a large intermediary table (OrderHeader) in the normalised model, which would have influenced these results. To avoid this problem, I tried slicing the Sales Amount measure by Products. I modified the pivot table for my next test by substituting date for product:

The results:

PTest_Normalised

Cold: 4s

Warm: 3s

PTest_Denormalised

Cold: 4s

Warm: 3s

Clearly, avoiding the large intermediary table helped with performance. Here, however we see that if we have small chained relationships performance is not all that bad. Therefore, snowflake schemas seem to be quite alright with PowerPivot. In my observations most of the 3/4s responses went on retrieving the long lists of Subcategories and Product Names. If I eliminate this:

PTest_Normalised

Cold: <1s

Warm: <1s

PTest_Denormalised

Cold: <1s

Warm: <1s

Chains seemingly do not matter much, as long as the tables in them do not contain too many rows. How many is too many – well, it depends on your case, but I would be cautious with more than 1 million rows, unless your hardware allows it. Of course, if you examine my normalised model because a lot of relationships go through the Order Header table, slicing by many of the “dimension” attributes results in very sluggish query responses. Mentioning hardware, all my testing was done on a laptop with 8Gb RAM, i5 460M 2.53Ghz CPU and a 7200 RPM hard disk.

In conclusion, I think that the results from my little test case show that data marts are not irrelevant even when considering the speeds of in-memory technologies. Reading Kimball and understanding the concepts behind data marts will be important as data is data, and an optimal data model does play a significant (if not the most significant) part in a PowerPivot model just like in a relational database. While the memory required for importing the data in PowerPivot is almost the same, performance can be greatly reduced when modelling is not taken seriously. The flexibility which PowerPivot offers us is great, but in some scenarios it could potentially be a problem – especially when the users decide to serve themselves with too much data.

Microsoft BI TechCenters: Learning and Resources

Apart from subscribing to the blogs and reading the books in the Sites and Blogs and Books sections of this blog you can also visit the following TechCenters on TechNet and MSDN:

SSAS – Multidimensional Data

http://msdn.microsoft.com/en-us/sqlserver/cc510300.aspx

SSAS – Data Mining

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx

PowerPivot for Excel

http://technet.microsoft.com/en-us/bi/ff604673.aspx

You can find a good collection of whitepapers, articles and links to various resources.