Improving performance and maintainability by keeping logic out of our reports

As Reporting Services allows us to implement some quite complex logic, report authors are often tempted to build a lot of it in their reports. Having some simple calculations there does not generally cause harm, as simple + and – operations specific to a report should probably be built in it. However, I have seen aggregations, complex queries and code hacks, which would normally be performed fairly easily in Analysis Services or SQL Server built in Reporting Services.

There are a few distinct benefits we can get from keeping logic out of our reports.

  • Firstly, by having our MDX scripts and SQL Queries (stored procedures) in the back-end allows us to use them in different reports with no need to maintain them by painfully synchronizing them with every change. Also, this allows business MDX-ignorant users to use complex measures, increasing the value of the BI solution for them.
  • Then, normally, our database server is usually maintained by people who have skills in SQL or MDX, and can analyse and optimise the scripts there instead of relying on report authors to do so. Even if we have BI professionals maintaining the whole solutions, it is far more convenient to have most of the logic stored on the same server. Whether it is a performance bottleneck, or a bug, it is far easier to fix it by tracing either cube calculations or stored procedures than logic stored in a report.
  • Sending a large amount of data between the back-end server and SSRS is increasing the stress on the network, which can be critical when multiple requests with millions of rows in results are passed to it.
  • Caching is important when different reports use the same stored procedures or MDX calculations.
  • And a last point: a database/OLAP server is just faster in calculations that SSRS.

In general, using stored procedures as SQL Server data sources instead of building queries in SSRS; or alternatively, keeping MDX in calculations in the OLAP cubes is the first step towards avoiding excessive code accumulation in SSRS. Then, avoiding anything more complex than simple sums and groupings in the reports and moving them to the stored procedures or MDX calculations is another desirable way to structure our solutions. Finally, even simple things like Actual – Budget variance measure should be built in MDX or SQL rather than in the reports. Following these simple rules and keeping ourselves disciplined when building reporting solutions improves their performance, maintainability and even usability.

Sponsored Post Learn from the experts: Create a successful blog with our brand new courseThe WordPress.com Blog

Are you new to blogging, and do you want step-by-step guidance on how to publish and grow your blog? Learn more about our new Blogging for Beginners course and get 50% off through December 10th.

WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.

Post-mortem thoughts on PerformancePoint Planning and the future of Microsoft BI planning and forecasting

As everyone must be well aware by now, Microsoft has discontinued PerformancePoint Planning and has decided to merge Monitoring & Analytics into SharePoint as PerofrmancePoint Services. Chris Webb and Nick Barclay already blogged about this and gave us some valuable thoughts/explanations on the subject.

In addition to what has already been said, I would like to add that maybe dumping Planning will not be such a great loss to anyone. The current market penetration is marginal and the successful implementations are not that many anyway. I have seen companies considering PP Planning and then abandoning it because of the complexities involved which translate directly into a high implementation cost comparable to a .NET implementation.

From a simplistic technological point of view, planning and forecasting is allowing users to input data, then manipulating the data according to some business rules, and then adding it to their BI systems in order to analyse and compare things from the past to things in the future. Currently, we can do this by either building a custom application, which handles all this, or we can use a third-party application handling it for us. I have had the chance to be involved in each scenario (once with a team of .NET developers and a few times with Calumo, which allows cube write-back or stored procedure write-back from Excel). The difficulties always come from the fact that the knowledge needed to accurately gather requirements, obscured by layers of business logic, and the completely different nature of planning and forecasting in comparison with analytics.

Analytics, or core BI is based on the presumption that we already have reliable data, of which our business clients want to make sense, thus gaining insight into their business. Planning and forecasting, in contrast, also involves allowing these same users to record their thoughts about the business and their projections about the future, and then analyse those just like their historical data. Therefore, planning and forecasting is more complex than pure analytics.

There is no tool in the Microsoft BI stack which can completely cover the requirements for a typical  business planning scenario. PerformancePoint Planning tried to encapsulate the planning logic into a set of models and rules, but it was too complex for both users and developers to understand, implement and then maintain. I have seen a number of successful planning and forecasting implementations with a third-party tool – Calumo. It is a fairly simple application (at least in comparison to PP Planning), which apart from some quite handy improvements over Excel for report authoring, has the very powerful functionality to allow users to input data straight back to their OLAP source (cube write-back), or to their data warehouse (stored procedure write-back). That is all that is needed for any planning and forecasting project and Micorosft should really have a look at what their partners are offering as a solution to planning instead of developing monstrosities like PerformancePoint Planning.

Building on top of SQL Server stored procedures and Analysis Services write-back, Microsoft can easily enhance their BI offering. All we need is a way to access this functionality through a front-end tool like Excel, SharePoint or Reporting Services.

Note: A couple of techniques which may be useful for planning and forecasting implementations are discussed in these posts:
Spreading Non-Transactional Data Over Time
Moving writeback data in the Fact tables and avoiding problems with changing column names

Speed up MDX query preparation in Reporting Services

We often need to build some advanced functionality into our reports and since the Reporting Services query designer does not provide a very advanced way to create MDX statements, our only option is to get our hands dirty and write some code ourselves. One of the things we must consider is the time that the query designer needs to validate our syntax, execute the query and determine what fields to be created based on the data set it receives from Analysis Services.

When we specify a number of query parameters, we have the option to give them default values:

image

If we do not do this, the query will be executed with the default dimension member instead of the a specific value as a parameter, which can be slow and unnecessary in the usual case where the default member is the (All) member. Instead, we should always specify a value which is to be used for limiting the sample data set. Ideally, is should be small but not empty. If the parameter is too limiting and the MDX query does not return any values the data fields will not be populated.

Specifying default values speeds up development time, especially when using complex cubes, where Reporting Services queries can be slow as its dimension cross joins can be very big.

Chris Webb also blogged in regards to this issue. I strongly recommend you read his post.

Moving writeback data in the Fact tables and avoiding problems with changing column names

While writeback functionality in SQL Server Analysis Services 2008 has changed significantly and writeback values are stored in the OLAP cubes, in SSAS 2005 the writeback values are stored in a relational table on the same server with the fact tables. When the writeback functionality is enabled for a partition, a new table is automatically created which bears a prefix of WriteTable. Its structure is fairly simple: it contains a column for each dimension and two audit fields.

The ROLAP nature of the writeback table makes it inefficient for storage of a large number of writeback records, and it is sometimes required to consolidate the data it contains with the fact table.

Normally we can write a stored procedure, which can do this for us. Because the values in the WriteTable are deltas there is a new row for each user change. In example, if we change 0 to 5, there will be one row in the writeback table, which shows 5 as a measure value. If then we change the new value of 5 to 2, there will be a new row with a measure value of -3. Therefore, it could be more efficient to perform a quick aggregation of the values in the WriteTable while moving them in the fact table. This could also be contrary to our requirements if we want to be able to trace all data changes.

In either case, we end up with a number of new rows and we can insert these into our fact table, after which we can truncate our WriteTable and process our cube. There is a potential pitfall here. If we do not set up properly the processing settings, we could destroy our WriteTable and have it re-created, which in turn introduces another pitfall – SSAS may change our column suffixes. In example, if we have a fact table with the following definition:

CREATE TABLE [Fact_IndicatorAmount](
[Fact_IndicatorAmount_Id] [int],
[ETL_Date] [timestamp],
[Indicator_Id] [int],
[Region_Id] [int],
[Scenario_Id] [int],
[Date_Id] [datetime],
[High] [float],
[Low] [float],
[Amount] [float]
)

The WriteTable may be created like this:

CREATE TABLE [WriteTable_Indicator Amount](
[High_0] [float],
[Low_1] [float],
[Amount_2] [float],
[Indicator_Id_3] [int],
[Region_Id_4] [int],
[Scenario_Id_5] [int],
[Date_Id_6] [datetime],
[MS_AUDIT_TIME_8] [datetime],
[MS_AUDIT_USER_9] [nvarchar](255)
)

Note how the column names are the same as the fact table column names, but are suffixed with _1, _2, etc. Unfortunately, these may change with the re-creation of the WriteTable. SSAS tends to assign the suffixes randomly. If that happens, our consolidation stored procedures will break.

The obvious step to avoid this is to set up our cube processing correctly, making sure that the WriteTable does not get re-created. To do this, we can select Use Existing writeback table in the Change Settings… dialog, which allows us to change cube processing settings:

image

We can also script this action and use it in our automated cube processing SQL Server job.

Even though this is a relatively intuitive and simple solution, I have always had problems with it because of manual cube processing performed by power users, which do destroy the writeback data together with the WriteTable structure and following from that, the code in my stored procedures.

Through the utilisation of some dynamic SQL and SQL Server system tables information, we can write a stored procedure which does not depend on the suffixes of the column names in the writeback table:

CREATE PROCEDURE [usp_Consolidate_WriteBack_to_Facts]
AS
BEGIN
SET NOCOUNT ON;

DECLARE @Column_High nvarchar(50),
@Column_Low nvarchar(50),
@Column_Amount nvarchar(50),
@Column_Indicator nvarchar(50),
@Column_Region nvarchar(50),
@Column_Scenario nvarchar(50),
@Column_Time nvarchar(50)

SET @Column_High = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘High%’
)

SET @Column_Low = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘Low%’
)

SET @Column_Amount = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘Amount%’
)

SET @Column_Indicator = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘Indicator%’
)

SET @Column_Region = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘Region%’
)

SET @Column_Scenario = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
INNER JOIN systypes
ON syscolumns.xtype=systypes.xtype
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘Scenario%’
)

SET @Column_Time = (
SELECT syscolumns.name
FROM sysobjects
INNER JOIN syscolumns
ON sysobjects.id = syscolumns.id
WHERE sysobjects.xtype=’U’
AND sysobjects.name like ‘Write%’
AND syscolumns.name like ‘Date%’
)

DECLARE @SQL_Command nvarchar(4000)
SET @SQL_Command = (‘
INSERT INTO [Fact_IndicatorAmount]
([High]
,[Low]
,[Amount]
,[Indicator_Id]
,[Region_Id]
,[Scenario_Id]
,[Date_Id])
SELECT ‘+ @Column_High +’
,’+ @Column_Low +’
,’+ @Column_Amount +’
,’+ @Column_Indicator +’
,’+ @Column_Region +’
,’+ @Column_Scenario +’
,’+ @Column_Time +’
FROM [WriteTable_Indicator Amount]’)

EXEC (@SQL_Command)

TRUNCATE TABLE [WriteTable_Indicator Amount]
END

What we are effectively doing here is getting the column names from the WriteTable and then constructing an INSERT statement based on these. It is dangerous to further automate this by a while loop, as the actual column names in the WriteTable can differ from the ones in the fact table. This could happen if the dimension table key names are different to the fact table key names.

Moving writeback rows through this stored procedure ensures that even if the WriteTable for a partition is re-created for some reason our code can handle it.

Check your spelling in SSAS 2008 MDX queries

I just found out that because of wrong/different spelling of a dimension name/attribute in SSAS 2008 the server may return the following message:
The set must have a single hierarchy to be used with the complement operator.

To fix the issue, we must ensure that all the dimensions in our queries are spelled exactly the way they are named in the cube (surprise?)…

I encountered this message by first building a Reporting Services report with data coming from SSAS. Then, a colleague misspelt a dimension name and instead of Natural Code he typed Natrual Code. Next time I needed to run the query in Query Designer I got the message. The query indeed used a complement operator on that dimension, exculding a particular account by its Natural Code. The query was something like:

SELECT {
[Measures].[Amount]
} ON COLULMNS,
{
([Account].[Major Account].ALLMEMBERS *
[Employee].[Employee Name].ALLMEMBERS)
} ON ROWS
FROM (
SELECT (-{[Account].[Natural Code].&[990]}) ON COLUMNS
FROM [Finance]
)

Reporting Designer in BIDS 2008 Column Visibility Bug

A small issue with BIDS I just found out about – if you want to modify the visibility expression of a column in Business Intelligence Development Studio 2008 you may find that the Fields list is empty and that Report Designer cannot see a Dataset binding for the column. Regardless of that, if you type in a Fields!<field_name> expression and you get a red curvy line indicating something is wrong with it, you will be able to deploy the report and the visibility of the column will get toggled by the expression. So it actually works but also could be annoying.

Note (2009-02-07): This appears if you have a tablix within a tablix and the outer tablix does not have a dataset associated with it.

Changes in SQL Server 2008 sysadmin group

There are some noteworthy changes in the way SQL Server 2008 handles security; apart from the single major improvement – the replacement of the Surface Area Configuration tool by Policy-Based Management.

One thing that surprised me today was that even though I was a Domain Admin and a member of the local Administrators group, SQL Server 2008 refused to let me log in. A login had to be explicitly created so I could access the instance. After some research, I found out that in SQL Server 2008 the local Windows administrators do not get mapped to the sysadmin role. Therefore, it is possible to get locked out of a server instance if there are no sysadmins on it. This is a feature, which separates more clearly SQL Server admins and Windows admins.

A further note on this topic. I would have not lost a small SQL Server war on a Dev environment recently if we were using SQL Server 2008 instead of SQL Server 2005. Now, being a Domain Admin does not necessarily win the battle for SQL Server permissions.

There is a TechNet page describing SQL Server 2008 Security Changes for further reference.

And another one, helping in case all system administrators are locked out.

Spreading Non-Transactional Data Along Time

In some cases we need to be able to analyse non-transactional data for discrete periods along a time dimension. An example of such a case is a collection of invoices, which have start and end dates for a period, but are not otherwise connected to a time axis. We may have such invoices with these properties:

Invoice Id
Start Date
End Date
Amount

One of the invoices may be:

Invoice Id: 34821432
Start Date: 2008-10-15
End Date: 2009-03-14
Amount: 15,000.00

and another one:

Invoice Id: 34934221
Start Date: 2008-12-01
End Date: 2009-05-30
Amount: 6,500.00

If the company we are building this for is daily deducting a fee for its services (e.g. funds management, software maintenance, etc.), we may have to be able to spread the amount in smaller periods, like months or days and then aggregate the smaller amounts along a time dimension.

To do this we have to first store the data in a relational table and then write some SQL to do the trick for us.

First, we should create a table valued function which returns all the dates at a specified granularity, such as days, from the Start to the End dates and the count of all the periods in between (in our case is is a count of days):

CREATE FUNCTION udf_Create_Daily_Date_Spread
(   
      @Start_Date datetime
    , @End_Date datetime
)
RETURNS @Daily_Spread TABLE (
      Date_Id datetime
    , Count_Of_Days int
)
AS
BEGIN
    DECLARE @Count int
    SET @Count = 0

    IF @Start_Date >= @End_Date
        RETURN

    WHILE @Start_Date <= @End_Date
    BEGIN
        INSERT INTO @Daily_Spread(Date_Id)
        SELECT @Start_Date

        SET @Start_Date = DATEADD(d, 1,@Start_Date)
        SET @Count = @Count + 1
    END

    UPDATE @Daily_Spread
    SET   Count_Of_Days = @Count

    RETURN
END

After having created these functions, we can use the CROSS APPLY statement to create the even spread:

SELECT             Invoice_Id
                        ,Start_Date
                        ,End_Date
                        ,cdds.Date_Id
                        ,Amount/cdds.Count_Of_Days
FROM Invoice_Source inv
CROSS APPLY udf_Create_Daily_Date_Spread(inv.Start_Date, inv.End_Date) cdds

After running the sample data through this code, we will get an even spread for both invoices and we will be able to attach a time dimension to them.

Even though the data size may explode after such a manipulation, Analysis Services provides an excellent way of handling even the largest sets of data. If storage is a problem, we can always choose to break down our data in less periods – instead of days, weeks or months.

Pre-Aggregated Data and Changing Dimensions

Normally when working with data we assume that we have it on the lowest possible grain and we are challenged by the need to aggregate it along various hierarchies. By the nature of BI, we get to work with large sets of detailed data collected by an existing system.

But what happens when we are given the aggregations and we need to work with these instead? Recently Nick Barclay and I were involved in such an implementation. Nick designed a Health and Safety Dashboard and after that I built it with Reporting Services, PerformancePoint, Analysis Services and SQL Server 2005 relational databases. We were told in the beginning of the project that the aggregation business rules are so complicated that rebuilding them for the purposes of the dashboard was way out of scope. I had not had experience with pre-aggregated data and I did not foresee a major problem, which became apparent after a few months of development when the Business Unit hierarchy changed significantly.

Aggregation Rules

In contrast with typical business scenarios when an SSAS function like Sum and some custom roll-ups works perfectly well, the Health and Safety data needed to be aggregated in a complex way and the organisation we built the dashboard for had already invested in a system managing these aggregations. In example, a simple rule would be – if we have more than 1 major incidents in a Business Unit, it gets an Amber score for Safety Compliance. If it has more than 4, it becomes Red. In turn, its parent is the same – if the sum of all major incidents for its descendants is greater than 1, it becomes Amber and with more than 4 – Red. There were also quite a few dependencies between various dimensions and reluctantly we agreed to work with the data the way it was. The following diagram shows the way an aggregate for Major Incidents works:

Major Incidents Aggregation 

Japan doing badly on Major Incidents also makes Asia and Global look bad.

The problem

The actual problem comes from the fact that our dimension hierarchy can change and if it does the aggregations do not make sense historically. We could have a change in our hierarchy:

path2709

and we may need to generate the following simple report for before and after the change:

majors_incidents_sample

The historical data suggests that something was wrong with China as the only child of Asia before the change in hierarchy, while in fact, the culprit then (and now) was Japan. As we would not be able to see how the hierarchy looked before, we would not be able to analyse the data we have accordingly. Also, we cannot compare data for Asia after the change to data for Asia from before the change along the new hierarchy.

Possible solutions

In my project, the client actually agreed that it is alright to have this handicap in the system and a possible solution was a complete historical rebuild of the aggregations followed by a complete reload of the system data after every hierarchy change. Lucky for us, the data set was not too big and was not expected to grow too much.

Another solution would be to use a Slowly Changing Dimension, so we can show the hierarchy as it was when the aggregations were made – this would improve the analytical value of the data because business users would be able to see why a Business Unit like Asia was Red in the context of an outdated hierarchy.

The best solution would be to build our own aggregations and work with those, as then we are gaining the ultimate flexibility to represent data in the exact way the business users need it, but unfortunately sometimes it is very hard to convince them that spending twice the funds on the same problem is a good thing.

Ordering Dimensions: Recursion vs Loops in T-SQL

Recently I had to build a function, which returned dimension members from a parent-child relational table. I needed to be able to order the members hierarchically and I decided to use a recursive CTE call.
Usually, we have a parent-child dimension table with structure similar to this:
Dim_Business
Business_Skey int
Parent_Business_Skey int
 
with a root node with Skey of -3 and Parent_Business_Skey of -3.
Let’s assume we have a hierarchy like this:
All (Id = -3, P_ID = -3)
-GLOBAL (Id = 1, P_ID = -3)
–Europe (Id = 2, P_ID = 1)
—UK (Id = 3, P_ID = 2)
—France (Id = 4, P_ID = 2)
—Spain (Id = 5, P_ID = 2)
–North America (Id = 6, P_ID = 1)
—USA (Id = 7, P_ID = 6)
—Canada (Id = 8, P_ID = 6)
—Mexico (Id = 9, P_ID = 6)
 
If we do something like:
SELECT Business_Skey
FROM Dim_Business
ORDER BY Parent_Business_Skey ASC, Business_Skey ASC
 
We will get:
-3 (All)
1 (GLOBAL)
2 (Europe)
6 (North America)
3 (UK)
4 (France)
5 (Spain)
7 (USA)
8 (Canada)
9 (Mexico)
Obviously, this hierarchy is incorrect, because we want to see the leaf nodes under their respective parents.
We can recursively create order, which concatenates the parent ids:
WITH b_ord(bid, bord)
AS
(
SELECT  Business_Skey AS bid
 , CONVERT(nvarchar(1000), Business_Skey) AS bord
FROM Dim_Business
WHERE Business_Skey = -3

UNION ALL

SELECT  Business_Skey AS bid
 , CONVERT(nvarchar(1000), bord + ‘|’ + CONVERT(nvarchar, Business_Skey))
FROM Dim_Business db
 INNER JOIN b_ord bo
 ON db.Parent_Business_Skey = bo.bid
WHERE db.Business_Skey <> bo.bid
)
SELECT *
FROM b_ord
ORDER BY bord ASC
 
The result of the CTE query is:
-3 -3 (All)
1 -3|1 (GLOBAL)
2 -3|1|2 (Europe)
3 -3|1|2|3 (UK)
4 -3|1|2|4 (France)
5 -3|1|2|5 (Spain)
6 -3|1|6 (North America)
7 -3|1|6|7 (USA)
8 -3|1|6|8 (Canada)
9 -3|1|6|9 (Mexico)
 
and the order is correct.
 
Because the code needed to go in a function, invoked by a number of stored procedures, .NET application and various reports, I needed the code to be quick and 
light. As some dimensions had a large number of members (50000+), which could grow with time, the code needed to implemented in a careful way. So, I decided 
to compare the recursive CTE function to a WHILE loop and a temporary table implementation:
DECLARE @c int
DECLARE @num_of_nodes int

SET @num_of_nodes = (SELECT  COUNT(*) FROM Dim_Business)

CREATE TABLE #order(
  skey int
 , ord nvarchar(1000)
 , lvl int
)

INSERT INTO #order
SELECT  Business_Skey
 , CONVERT(nvarchar(1000), Business_Skey)
 , 1
FROM Dim_Business
WHERE Business_Skey = -3

SET @c = 2

WHILE @c > 0
BEGIN
 INSERT INTO #order
 SELECT  Business_Skey
 , CONVERT(nvarchar(1000), ord + ‘|’ + CONVERT(nvarchar, Business_Skey))
 , @c
 FROM Dim_Business db
 INNER JOIN #order o
 ON db.Parent_Business_Skey = o.skey
 AND o.lvl = @c – 1
 WHERE db.Business_Skey <> o.skey

 SET @c = @c + 1
 
 IF (SELECT COUNT(*) FROM #order) = @num_of_nodes
 SET @c = 0
END

SELECT  skey AS bid
 , ord AS bord
FROM #order
ORDER BY ord ASC

DROP TABLE #order
 
After comparing the results in Client Statistics and the IO reads, the 1st recursive query performs more than 20% worse than the WHILE loop query. It also trails the non-recursive query in the count of logical reads,read-ahead reads and scan counts.
It seems like in SQL Server 2005 calling WITH recursively does not work as good as coding set-based operations through WHILE loops.