Inferred Members Implementation Best Practise

Just last week a colleague of mine and I did a very thorough research on the best implementation practises in regards to Inferred Members. We went through a few web sites and we also had a look at previous implementations through SSIS. There are quite a few resources on the subject and the best two methods we found were based on Project REAL and a SQL CAT recommendation. So, in the end we agreed that we can recommend three approaches, which have their advantages and disadvantages and could be all successfully used when we need to have Inferred Member (late arriving dimensions = early arriving facts) support in a solution.

Our findings can be summarised in the following table:

Method Pros Cons
Stored procedures No need to use SSIS SlowComplicated developmentNeed to link source and target data server instances.
Script Component (SSIS) FastHigh reusability .NET skills required
Double Lookup (SSIS) FastHigh reusability Performance issues with multiple dimensions

I will go through each of these methods and provide some more details about them.

1. Stored Procedures

If SSIS data flow tasks are not an option for our ETL we can implement inferred member support through SQL code. The basic idea behind this method is:

  1. Find new dimension members in the fact data and add them to the dimension
  2. Move the fact data from source to target (i.e. staging to datamart) with surrogate keys from the dimension table

If we have our source and target data on separate instances of SQL Server it becomes quite inconvenient to utilise this approach. We need to link the two instances to use tables from both tables in the same stored procedure. This is a major issue and it is easily avoided by using SSIS.

2. SQL Server Integration Services

A general design of SSIS solutions can be represented with the following diagram:

Inferred Members Small

We can implement this in the following ways:

2.1. Script Component (Project REAL)

The Data Flow task for this approach is:

image

After looking up missing states, we pass them to a Script Component, which hashes the misses and hits the database only when a genuinely new miss occurs. We have a stored procedure in place which simply adds the new dim member in the dimension table in the database and returns the new surrogate key, which then gets sent to the fact table.

Using .NET we can efficiently hash all values in a object collection and we can also handle both character and numerical data.

Because we are using a fully cached lookup, this data flow item is case-sensitive, therefore we should make sure we equalise the case in both fact and reference table data before we compare the two. We should also make sure that in this case the Script Component is case-insensitive, because if it is we will end up with multiple different rows in our dimension table for each case variation of our inferred members.

Additionally, the Script Component task should be built either accepting a parameter for the dimension name, or it can read its name (in our implementation) and find the first word in it to determine the dimension it is used for. In the above diagram, the Script Component task is handling the State dimension, therefore its name starts with State – . This makes the implementation of multiple Script Components for multiple dimensions very easy – all we need to do is change its name and it just works. There can also be some .NET code for auditing purposes. This is also fairly easy to implement and an entry level of .NET should be sufficient for development and maintenance. A sample of this approach can be found in Project REAL and it is thoroughly discussed in the SSIS paper produced with it.

2.2. Double Lookup (SQL CAT)

image

Here, we are doing the same as before – matching fact rows against a State dimension. If a mismatch is found in the first (fully cached) lookup, we pass the row to the second one. The New StateSK is a partially cached lookup. Its purpose is similar to a hash table – it caches new rows, and when there is a brand new mismatch it adds it to the database by executing a stored procedure. Then we Union All our inferred members with the rest of the fact data.

Because the second lookup task is utilising partial lookup, it is case-insensitive and case variations of the same character term will not lead to multiple dimension rows for the same member. However the first fully-cached lookup is case-sensitive, so we should make sure that both source and reference data is in the same case because that would be more efficient. Furthermore, In SQL Server 2005 partial caching must have a specified memory limit and if we have a large number of late arriving dimension members, we may run out of memory for them. In that case SQL Server 2005 will start discarding the least used values from its cache, which may have performance implications. The latter problem is overcome in SQL Server 2008.

Another problem with this approach is the use of Union All tasks. These are semi-blocking and may impact performance when used multiple times in our ETL package.

For much more detailed description, including samples you can go to:

SQL CAT – Assigning surrogate keys to early arriving facts using Integration Services

From our testing and research we reached the conclusion that using Script Component is the best approach, closely followed by Double Lookups. The stored procedure approach is slow, hard to maintain and may be impossible to implement in a production environment. Using SSIS with .NET proves to be efficient, convenient and fast.If avoiding .NET is preferable, handling inferred members in SSIS is a very good alternative.

Custom Rounding and Truncation of Numbers in MDX

Article published in SQL Server Central on 2009/03/26

In some scenarios we need to be able to round or truncate decimals to achieve correct calculation results. In SQL we have ROUND, which can do either of these. It rounds like we are used to – 0.5 rounds up to 1, can round up or down and we rarely get a project where as a part of the requirements we are implementing our own rounding or truncation algorithm.

However, in MDX we have Round() which performs a “strange” operation – bankers’ rounding, which our business users have usually not been exposed to, and if we decide to truncate to an integer number through casting with Int or cInt, we also get some strange results. To illustrate the problem with MDX please consider the value of these expressions:

Round(2.15, 1) = 2.2
Round(2.25, 1) = 2.2
Round(2.35, 1) = 2.4
Round(2.45, 1) = 2.4

Int(1.9) = 1
Int(-1.9) = -2
cInt(1.9) = 1
cInt(-1.9)= -2

These are usually considered wrong, because they are not the obvious results. Even though they are mathematically well founded, if we round 2.25 to 2.2, our users will come back at us with wrong numbers on their reports. Same goes for “trimming” -1.9 to -2.

To resolve the first problem with rounding, we can use our own math formula:

Fix([Measures].[???] * Factor + 0.5 * Sgn([Measures].[???])) / Factor

Where Factor is the rounding factor – 1 for 0 decimal places, 10 for 1 and so on (defined by 1/Factor). Of course, Factor of 0 will give us Div by 0 error. (Reference: http://support.microsoft.com/kb/196652)

If we have the Excel function libraries intalled on our server, we can also simply use Excel!Round() as Chris Webb advises.

As for the Int and cInt in MDX, we can use the Fix() VBA function to remove decimal places:

Fix(1.9) = 1
Fix(-1.9) = -1

Also, for truncation of values to a certain decimal point in MDX, we can use the following formula:

Fix(<value>*10^1)/10^1

All we need to adjust in order to change the decimal places is to replace 10^1 with another power of 10 – in example, to truncate 3.156 to 3.15 we can use: Fix(3.156*10^2)/10^2. To make things simpler, in all our formulas the power of 10 is what determines how many decimal paces we need to round to; negative powers will give us rounding to tens, thousands and so on. If we use ^0 we will round to whole numbers.

Using these we can avoid bankers’ rounding and some strange results with converting to integers in MDX.