Load Testing BI Solutions – When?

This year I came across two very different BI projects which had the common non-functional requirement to prove that they would handle an expected spike in the report generation load. Funny enough, in both cases the project teams got very concerned and came up with wildly inaccurate predictions of how many concurrent users we should be testing for. In the first case the problem was with the perception of “thousands of users”, while in the second, the team interpreted “monthly users” as “concurrent users”. The annoying part was that in the first case the team planned on building an ultra-massively overcomplicated queuing system to handle those spikes, and in the second case they were thinking of completely scrapping the ad-hoc functionality in the solution and resorting to report extracts distributed by email. The unreasonable expectations of the load lead to bad design choices – this is why it is important to remain calm and first check whether there is a problem at all.

Firstly, let’s agree that we are measuring report requests. To begin, we should know how many requests we get per a period of time (e.g. a month), and then how long it takes to generate a report. A typical scenario would be:

  • 1,000,000 report requests per month
  • 2 seconds to generate a report on average

What we need to do now if apply a bit of math:

1,000,000 / 20 = 50,000 requests per day (on average)

50,000 / 8 = 6,250 requests per hour (8 hours in a working day)

Since a report takes 2 seconds to generate, we can generate 1,800 reports in one hour. Therefore, with 6,250 requests, we would have 3.47 average concurrent users. Of course, this would be the case if we have a very uniformly split load. In reality this would not happen – instead, we will have peaks and dips in usage. A moderate peak is typically around 3x the average, while a heavy one would be at around 6x the average. To ensure that we can handle such peak periods, we should multiply our average concurrent users by 3 or by 6 depending on our load analysis. Let’s assume we have a very high peak load of 3.47 * 6 = 20.82, or approximately 21 concurrent users. This is the number we need to test in our case. Note that we had 1,000,000 report requests per month, but in our highest peak we expect to have only 21 concurrent users. I have not actually had a project where we have expected to have such a load (in both cases which prompted me to write this post we had between 2000-10000 users per month).

The moral of the story – don’t panic. In most reporting projects the user load is not high enough to warrant a full-scale load testing exercise; next time you hear talking about something like that, instead of rushing to cover unreasonable scenarios, try to calculate and confirm the need first.

DataMarket Updates: Speed, Portal and DateStream

It has been an eventful week for the Azure DataMarket. We had three new and exciting (for geeks like me) things happening in that corner of the Microsoft universe:

1. Speed!

There was an update to the Azure DataMarket a few days ago. It was, in my opinion, the best thing Microsoft could have done to their offering – tremendously increase its performance. While the DataMarket was previously plagued by unacceptably slow download speed, now it’s for feed standards blazingly fast. For comparison sake, I used to wait for more than 40 minutes when downloading an approximately 70k rows feed from the DataMarket prior to the update. Now, it is on my machine in around 5 – 8-fold increase in performance! Rumours have it that on faster-than-my-home-ADSL2+-networks we will be experiencing up to 20x better performance. It would be good to hear if this is actually correct for developers on such networks (please comment).

Next, range queries, hopefully…

2. Portal

While before the last couple of days anyone who wanted to publish data on the DataMarket had to contact the Microsoft team via email and ask how to get it done, we have just moved into the self-service space with a new portal allowing publishers to create and manage their feeds. The link to this new portal is:

https://publish.marketplace.windowsazure.com/

And, you can find some very helpful documentation about it here:

http://msdn.microsoft.com/en-us/library/windowsazure/hh563871.aspx

3. DateStream

Finally, I am proud to announce that the great DateStream feed got translated in four more languages:

Hebrew and Danish – thanks to Rafi Asraf

German

Bulgarian

The Italian translation (thanks to Marco Russo) is coming soon too, but missed this release unfortunately.

Feel free to explore them and let me know if anything needs to be changed to make them more correct/useful.

SSAS: Multiple SQL Queries in ROLAP Mode

Just recently I was working on a project where I had to build a SSAS ROLAP cube on top of a badly built data mart. Badly built in this case meant one where we encounter multiple referential integrity (RI) issues. Most importantly, the designers ignored the very basic principle that all dimension keys for each row must be present in the respective dimension tables. When in MOLAP mode, SSAS checks for such mismatches during processing. However, when a partition is in ROLAP storage mode, we don’t get a notification that anything is wrong and the cube processing operation succeeds. This situation has some consequences during execution time and I will try to illustrate those in this post and show a solution. Before I begin, I must say that if it wasn’t for Akshai Mirchandani’s (from the Microsoft SSAS dev team) and Greg Galloway‘s help, I would have probably spent quite some time figuring out what is happening. Thanks to them the problem got solved quickly and I got to understand the reason for what is happening.

In terms of set-up, I created two tables in SQL Server: Dim and Fact. The Dim table contained two members A and B, with keys of 1 and 2. Initially, the Fact table had two rows referencing the Dim table – Dim keys of 1 and 2, and a measure column called Amount with 1.0 and 2.0 as the amounts corresponding to A and B. No issues here. After that I created a SSAS solution, corresponding to this simple dimensional model. I switched the partition storage for the cube to ROLAP and processed the SSAS database. After that I ran the following query, which I used for all subsequent examples:

 

 

 

 

 

The result was as expected:

 

 

At the same time I had a SQL Server Profiler trace running, which showed:

 

We can see that SSAS has executed one SQL query retrieving data from the fact table. Nothing unusual thus far.

To spoil the party, I added one more row to the fact table with a dimension key of 3 and Amount of 3. Since I did not add a row in the dimension table with a key of 3, this broke the rules and if I had a foreign key constraint implemented between the fact and the dimension tables I would not have been able to do this. After cleaning the SSAS cache, I ran my query again. The result:

 

 

The actual error was, of course, a missing key. I was not surprised when I saw this on my original project. However, looking at Profiler we see a “weird” sequence of events:

 

SSAS runs multiple queries which result in errors. In this case we can see four of these ExecuteSQL events. All of them are followed by an error in a ReadData event. In this particular case we can see only four ExecuteSQL events. In the real-world, this scenario can get multiple times worse (in my case we saw 4667 queries run against the relational database in a few minutes) leading to a really significant drop in performance.

So, what is happening? According to Akshai, SSAS encounters an error while dealing with the results from the initial SQL query and is trying to recover by sending more queries. In some cases this can result in getting the error in the result set only for some cells.

Luckily, there is an easy way out of this situation (thanks to Greg for providing the tips). SSAS can automatically create an “unknown bucket” for each dimension and can assign to it all measure values which do not correspond to a dimension member. To get this result, we must ensure that each affected partition’s error configuration is set to something similar to:

 

 

 

 

 

 

 

 

 

Note that the KeyErrorAction is ConvertToUnknown, not DiscardRecord (which is the alternative). This must also be coupled with setting up each “incomplete” dimension to include an Unknown member:

 

 

 

 

 

 

 

 

 

 

It does not matter whether the UnknownMember is Visible or Hidden, as long as it is not None.

Back to our scenario. After setting these properties on the dimension and the partition I processed the SSAS database again and executed the query. The result:

 

 

 

and the profiler trace:

 

As we can see we eliminated the multiple queries. If we do not want to see the Unknown amount in the cube we can use a scope assignment:

 

 

Coupled with making the UnknownMember Hidden, we can completely obliterate traces of our underlying RI issues. Unless our users check the numbers, but then we can blame whoever designed the datamart! 🙂

Melbourne SQL Server Social Event: Short Notice!

A short notice for SQL Server enthusiasts/professionals in Melbourne. Stephen Few is in town and will attend the SQL Server Social Event tomorrow (17 Nov) at the Sherlock Holmes Inn in the CBD (415 Collins Street). Feel free to come and meet the community, talk about SQL Server and information visualisation..and possibly get your Stephen Few books signed.

Link for the event: http://sqlserversocial.eventbrite.com/

See you there!

Alternate Ordering of Attributes in SSAS

Sometimes we need to display attribute members in SSAS in a different order than the order of its name or key. For this purpose we have the option to use one of its attribute’s name or key. However, in some cases changing the order may break some calculation logic which depends on the initial order. The new ordering may also be inconvenient for writing MDX as using some functions of the language is easier (at least conceptually) when thinking of sets in ascending order. The best example which we can use to illustrate this problem is the Date dimension. While in most, if not all, cases the Date dimension is ordered in ascending order, sometimes users prefer to see the most recent date first and request us to change the order to descending. Doing so invalidates many time intelligence calculations like rolling and parallel periods, etc. Furthermore, fixing those requires inverting numbers to negative, or avoiding the use of functions like ClosingPeriod. All in all, a “small” change can lead to a big problem. We can, however, accommodate our ignorant users (which unknowingly get the benefit of reading default time series charts backwards – from right to left – when dragging-dropping descending dates in Excel, for example) without changing too much in our scripts. A little trick in the modelling can help and it is the reason for writing this post.

Let’s have a look at a simple Date dimension with one attribute – Date. Nothing unusual, with the Date being ordered by its Key (integer in this case) and with a name coming from another column in the Date table – DateName. When we create a simple slice in Excel we get the following:

 

 

 

 

 

 

 

 

Now we create a measure Rolling 3 Days Amount, which sums the last 3 days’ amount:

 

 

 

 

 

 

 

 

The MDX for this calculation is:

 

 

 

 

If we simply invert the order of the Date attribute by ordering it by another column in our Date table, which contains DateKey*-1 and refresh the Excel pivot table we get the following:

 

 

 

 

 

 

 

 

This is simply incorrect. A relatively small change in the MDX script can help us with this issue (e.g. changing the Lag to Lead), however in many cases we do not want to rebuild all the measures. Luckily, we can employ a different tactic. Instead of changing the script, we can change the structure of our dimension by adding an additional attribute which is not exposed to the users. (i.e. is hidden). This attribute will be based on the same column we use for our Date, but will not be ordered by the descending column. We can rename the original attribute (the one exposed to the users) to something like Date Desc, or a more user-friendly option, and hide the new one:

        

 

 

 

Everything else stays the same – our cube script does not need to be adjusted and its logic is correct:

 

 

 

 

 

 

 

 

A different approach could be to leave the old attribute named Date, so there is no change necessary in case of reports depending on the naming. This, however, requires a change of the cube script, which can be easily performed with using the BIDS Replace functionality (e.g. Ctrl+H).

Note that for this approach to work we need to make sure that the attribute exposed to the users is the dimension key attribute as changing its current member results in an (infamous) attribute overwrite where its related attributes, which are above it in the relationship chain) also change. If we expose the non-key date attribute our MDX logic will break as the changes to its current member will not affect the attributes below it (actually, it will set them to their All member).

Range Queries with Azure DataMarket Feeds

By default the Azure DataMarket does not allow range queries. In fact, the only way we can filter a data feed is through specifying one or more values for the “queryable” fields specified for it. There is not technical reason behind not allowing range queries as both the back-end (presumably SQL Azure, or SQL Server) and the OData protocol support them. Fortunately, there is a way to consume a range of the values in a column of a data feed in PowerPivot. It is not straight-forward and I do not think that the target audience of both self-service BI/PowerPivot and the DataMarket itself would appreciate the complexity, but it could be useful anyway.

If we want to pull all three tables from the DataMarket we can simply use https://api.datamarket.azure.com/BoyanPenev/DateStream/ as the URL in PowerPivot:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Otherwise, we can pick each one with a URL like (for the BasicCalendarEngish table):

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish

If we filter the data set on the DataMarket website to request only the data for 2010 we get the following URL:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20eq%202010

Note the last bit:

?$filter=YearKey%20eq%202010

This is simply the URL/OData encoded ?$filter=YearKey = 2010

In OData we can also use other operators, not just = (or eq). For ranges these are gt (greater than), ge (greater than or equal to), lt (less than) and le (less than or equal to). We can also use and and or operators to combine different predicates. For a more thorough list, please refer to http://www.odata.org/developers/protocols/uri-conventions. If we replace the ” = 2010″ with ” < 2010″ and then encode the URL, we do indeed get all years prior to 2010. Things get slightly more complicated when we have a more complex scenario. In example, when building a date table we may want to include all years between 2000 and 2030. To do that, we would have to write something like:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey >= 2000 and YearKey <= 2030

encoded, the same looks like this:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20ge%202000%20and%20YearKey%20le%202030

Here space is %20 and the math comparison operators have been replaced with the OData operators (in red).

If we paste this in PowerPivot and hit “Next”:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

…we get exactly what we expect – a table with 30 years.

Things get more complicated if we include the datetime DateKey in the URL. For a single date (e.g. 1900-01-01), we have to use:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′

After Applying URL encoding we get:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27

Where %27 is apostrophe and %3a is a colon (for a list of ASCII characters and their URL encoded form we can refer to http://www.w3schools.com/tags/ref_urlencode.asp).

Now, to combine the two we would need to write:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′ or (YearKey >= 2000 and YearKey <= 2030)

Encoded this becomes:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27%20or%20%28YearKey%20ge%202000%20and%20YearKey%20le%202030%29

This monstrous-to-write URL string returns 30 years of data + 1 day.

I suppose this approach can be classified as a workaround, as I have not seen any documentation on PowerPivot referring to any options for filtering data from the Azure DataMarket. However, in my opinion, this should be a feature of the DataMarket itself as it would make it easier/possible for users with any tool to get just the data they need and even possibly reduce the load on the site service since it will no longer be necessary to export everything and then attempt to apply a filter.

Introducing Project DateStream (CodePlex)

I recently blogged about The Case for an Azure DataMarket Date Table. I finished the blog post with a bit of a critique of the DataMarket team at Microsoft, which I can now wholeheartedly apologise for. This is because since my last post I was contacted by Max Uritsky who is a Group Program Manager on the Windows Azure Marketplace DataMarket team (a long name for a team, yes). He and Belinda Tiberio managed to help me with creating and hosting a new Date feed. Not only they helped with making it available for free on the DataMarket website, but also gave me a 1Gb free SQL Azure database for the project. A big “thank you” goes to Julie Strauss from the SSAS team for making the contact, as well.

To summarise, the DateStream project is a free date table available as a feed and intended to be used by PowerPivot BI users. As most, if not all, BI projects include a date table, the goal is to provide an easy-to-use, correct and simple mechanism for creating such tables.

After some deliberations on the format of the feed we decided that it would be best to split the feed in a number of localised Basic tables and one Extended table. Currently we have only two Basic (US and EN) versions and a fairly straight-forward Extended one. However, the plan is to inlclude more and more column in the Extended one and provide extra localised (Italian, French, Spanish, German, Russian, etc.) tables. When I am saying “we” I mean fellow SSAS professionals, which I had the pleasure to discuss this idea (among which Marco Russo, Thomas Kejser, Willfried Faerber and Grant Paisley).

The CodePlex page of the project contains more details about the feed and the tables and also allows for commenting on existing features, as well as requesting new ones. It can be found at http://datestream.codeplex.com/.

The actual feed can be used directly from https://datamarket.azure.com/dataset/1542c52d-d466-4094-a801-9ef8fd7c6485.

Note the logo – it was created by Daniele Perilli from SQL BI with the assistance provided by Marco Russo. Thanks to them we did not have to resort to my graphic design skills, which definitely is a win for the DataMarket website.

One note – please let me know if the performance you get from the feed is not satisfactory (please remember that once pulled, the feed does not need to be refreshed as the data will remain valid forever). If many people agree that it is too slow I could potentially host it on an alternative location as well. It is possible to download a CSV version from the DataMarket as a workaround, which also allows removing unnecessary date ranges.