Using the DSV to its Full Potential

The Data Source View in Analysis Services is a very powerful abstraction of the data source and it can help us overcome some scenarios in an easy and clean way. Many times we look for MDX or programmatic solutions to problems, which can be tackled best in our data. While for complex tasks we would be better off extending the ETL process, some simple ones can and should be implemented in the DSV.

 

As an introduction to the topic I would like to explain briefly what the DSV actually is. It can be conceptualised as a database view on top of the data source. By default all tables which we need for building the Analysis Services database (typically dimensions and facts) are appearing in the DSV as table bindings (exactly as if we do a SELECT * FROM Table). If we have no foreign keys defined in our database, SSAS will not show us the relationships in the DSV. However, we can define logical relationships in the DSV, thus connecting the tables on related columns, which are then used for automatically determining dimension relationships to the measure groups.

 

There are two important ways to modify the DSV, which allow us to add more columns to the existing tables and to modify the way the existing columns are shown:

Named Queries

 

 

 

If we right-click on a table in the DSV, we can select to replace the table with a Named Query. A Named Query is essentially a T-SQL statement, which is equivalent to a database view definition. By utilising Named Queries we can alter the way we see the tables and their column in SSAS. In example, we could concatenate columns, implement CASE logic, etc. Named Queries can be thought of as equivalent to database views.

 

Named Calculations

 

A named calculation is a SQL statement which adds a column to a table without modifying the table binding. It gives us an easy way to define a new column without changing the whole query. The statement defining the column is in T-SQL and it behaves the same way as a new column in a Named Query (or a SELECT statement). If we just want to add one more column (e.g. Display Order, Code+Description concatenation, etc.), we can simply define a Named Calculation. Also, as the name suggests, Named Calculations can be commonly used for defining a leaf-level calculation without modifying a large fact table’s SELECT statement in a Named Query.

 

The column we define here appears in both the DSV table and in the Dimension Designer window:

 

These two DSV functions can be used in many scenarios. Most importantly, there are a few when they yield better performance, faster development and easier maintenance:

Leaf-level calculations

If we have the common requirement to perform leaf-level calculations and then aggregate this up the hierarchy, as opposed to aggregating and then calculating, the best way to do this is in a SQL statement on the fact table. Alternatively, we can do this in and MDX statement:

 SUM(DESCENDANTS(Dim.CurrentMember,,LEAVES), MeasureCalc)

However, it comes at a price. Since SSAS would have to do the calculation for each leaf and then sum this up the hierarchy, this could take a long time to perform. Also, SSAS would not be able to use pre-processed aggregations and the calculations will be done at execution time. To avoid this we could add a new column to the fact table and do the calculation there (in SQL), using the column as a new measure in the cube, which can then be aggregated by SSAS as any other measure. The performance gain is usually substantial and using a Named Query or a Named Calculation should always be the preferred option.

Description Attributes

Often we need to perform a concatenation between different dimension attributes, which we can use as a Description attribute while slicing the cube, or when providing reports from the SSAS database. A very easy way to achieve such a requirement is to use our DSV and concatenate the column we need in a new column in the dimension table, which we can expose as a new attribute in the dimension. A task such as concatenating an Account Code and Account Description into an Account Long Description (i.e. [Account Code] + ‘-‘ + [Account Description]) becomes very easy to implement within the DSV without modifying the ETL or any tables.

Composite Keys

Sometimes we need to build unique keys for attribute column in a dimension. A good example is a Date dimension, which does not have unique keys for non-leaf levels such as Month. Often developers have Month Key of 1,2,3-12. This does not make a good Month key in SSAS as it is not unique for higher levels such as Year, Quarter, etc. There are a number of ways to tackle this common scenario. While the recommended approach would be to build a concatenation between Year-Quarter-Month as a Month Key in the dimension table, we can also achieve this by either selecting all of the columns as key columns for the attribute in the dimension attribute properties. However, this would give us a concatenated key in MDX and this could sometimes be undesirable. A yet simpler and cleaner solution is to concatenate the relevant columns in the DSV by using a Named Query. Instead of the typical

SELECT col1, col2,.., MonthKey, colx, coly, coly FROM DimDate

we can write

SELECT col1, col2,…,YearKey+QuarterKey+MonthKey AS MonthKey, colx, coly, coz FROM DimDate

This way we can use the MonthKey column directly as a key for our Month attribute.

While this is useful for a Date dimension, it can also be useful for any other composite key definition in our dimensions.

Other possible applications of DSV Named Queries and Named Calculations are the implementation of

  • Sort Order attribute, in cases when we need custom sort of the dimension attributes
  • Restricting the data which comes into the cube dynamically based on a certain condition (think of a Date dimension, which includes only relevant periods)
  • Combining tables – by a SQL join
  • Replacing 0s with NULLs (the opposite can be done automatically in SSAS) for our measures

Basically, in a DSV we can “correct” our data to make it suitable for our cube without changing the ETL.

Last but not least, we can also transform tables to conform to a star-schema-like design. If we want to show a proof of concept on top of a normalized OLTP database, we could avoid the ETL complexities, as well as building a datamart, and use SQL to join/split tables in dimension and fact tables, which are suitable for cube development. While this could work in post-POC scenarios, it would be better to take a cautious approach to it as there are many scenarios when it would either not work, or will be too slow.

And a word of warning – your DSV could become slow because of over-use of complex Named Queries. This could be painful when minimising cube processing time is crucial, or when the DSV starts timing out and queries take hours to execute. Luckily, in most cases we can simply move these large queries forward – to the ETL where we have more time and better tools (e.g. SSIS).

Advertisements

Average Aggregation in Analysis Services

In SSAS we do not have a measure Average aggregation type. We do have AverageOfChildren (or Average over time), however it is semi-additive and works only along a Time dimension. Fortunately, we have Sum and Count, and since Average = Sum / Count, we can build our own Average aggregation when we need one. 

To do that: 

  1. Create a measure using the Sum aggregation type (which is also the default). In our example, let’s call it Amount.
  2. Create a Count of Non-Empty Values (or Count of Rows) measure. In example – [Measure Count].
  3. Create the actual calculation – [Measures].[Amount]/[Measures].[Measure Count]

We can either create a calculated measure, which performs the MDX calculation above: 

CREATE MEMBER CURRENTCUBE.[Measures].[Average Amount]
AS
[Measures].[Amount]/[Measures].[Measure Count]
,NON_EMPTY_BEHAVIOR = {[Measures].[Measure Count]}
,VISIBLE=1; 

, or if we really do not need the Sum base measure, we can set it to be replaced by the calculation with a SCOPE statement: 

SCOPE([Measures].[Amount]);
  This = [Measures].[Amount]/[Measures].[Measure Count];
NON_EMPTY_BEHAVIOR(This) = [Measures].[Measure Count];
END SCOPE; 

Voila! We have created a measure simulating an Average aggregation type. Then, we can hide the Count helper measure and from user point of view there is no evidence of our effort. 

Since the count will never be 0, we do not have to say “If not 0, divide, else – do not” and the NON_EMPTY_BEHAVIOR query hint may in fact improve performance, since the calculation will not be performed when the Count measure is NULL (instead of resulting in NULL/NULL=NULL). 

Mosha has previously blogged about NON_EMPTY_BEHAVIOR and division by zero and I strongly recommend reading his post. 

Another important consideration, which depends on the business scenario is the type of the Count aggregate. It could be Count of rows (Row Bound) or Count of non-empty values (Column Bound). The difference is whether we want to include or exclude the empty values from our aggregate. Either way, the described technique will work equally well. 

I realise that this is a well-known approach, but since it is hard to find the solution online I thought it may be interesting for some less-experienced developers.

Pre-Aggregated Data and Changing Dimensions

Normally when working with data we assume that we have it on the lowest possible grain and we are challenged by the need to aggregate it along various hierarchies. By the nature of BI, we get to work with large sets of detailed data collected by an existing system.

But what happens when we are given the aggregations and we need to work with these instead? Recently Nick Barclay and I were involved in such an implementation. Nick designed a Health and Safety Dashboard and after that I built it with Reporting Services, PerformancePoint, Analysis Services and SQL Server 2005 relational databases. We were told in the beginning of the project that the aggregation business rules are so complicated that rebuilding them for the purposes of the dashboard was way out of scope. I had not had experience with pre-aggregated data and I did not foresee a major problem, which became apparent after a few months of development when the Business Unit hierarchy changed significantly.

Aggregation Rules

In contrast with typical business scenarios when an SSAS function like Sum and some custom roll-ups works perfectly well, the Health and Safety data needed to be aggregated in a complex way and the organisation we built the dashboard for had already invested in a system managing these aggregations. In example, a simple rule would be – if we have more than 1 major incidents in a Business Unit, it gets an Amber score for Safety Compliance. If it has more than 4, it becomes Red. In turn, its parent is the same – if the sum of all major incidents for its descendants is greater than 1, it becomes Amber and with more than 4 – Red. There were also quite a few dependencies between various dimensions and reluctantly we agreed to work with the data the way it was. The following diagram shows the way an aggregate for Major Incidents works:

Major Incidents Aggregation 

Japan doing badly on Major Incidents also makes Asia and Global look bad.

The problem

The actual problem comes from the fact that our dimension hierarchy can change and if it does the aggregations do not make sense historically. We could have a change in our hierarchy:

path2709

and we may need to generate the following simple report for before and after the change:

majors_incidents_sample

The historical data suggests that something was wrong with China as the only child of Asia before the change in hierarchy, while in fact, the culprit then (and now) was Japan. As we would not be able to see how the hierarchy looked before, we would not be able to analyse the data we have accordingly. Also, we cannot compare data for Asia after the change to data for Asia from before the change along the new hierarchy.

Possible solutions

In my project, the client actually agreed that it is alright to have this handicap in the system and a possible solution was a complete historical rebuild of the aggregations followed by a complete reload of the system data after every hierarchy change. Lucky for us, the data set was not too big and was not expected to grow too much.

Another solution would be to use a Slowly Changing Dimension, so we can show the hierarchy as it was when the aggregations were made – this would improve the analytical value of the data because business users would be able to see why a Business Unit like Asia was Red in the context of an outdated hierarchy.

The best solution would be to build our own aggregations and work with those, as then we are gaining the ultimate flexibility to represent data in the exact way the business users need it, but unfortunately sometimes it is very hard to convince them that spending twice the funds on the same problem is a good thing.