An ETL Framework from BI Monkey

Today I found out about a cool ETL framework being developed by my colleague in Sydney – James Beresford, a.k.a BI Monkey. He is also writing a blog, which I have seen and read in the past.

Check out his take on what an ETL should look like. He told me that the current version (v1) is suitable for smaller projects and that a new one v2 is coming out soon – so if you like v1 definitely stay posted for v2! And in case you have suggestions – he is pretty cool to chat to and very keen to get the framework further, so don’t hesitate to contact him!

Do Business Analysts make good dimensional modellers??

Recently I had the (dis)pleasure of working with Business Analysts, who also thought that they are good in dimensional modelling. so, I had to implement BI solutions (including cubes) on top of their database design. I will show an example (about 95% the same as the actual design), where the idea of letting BAs go into dev territory does not yield the best results:

 

This “dimensional model” was created by an experienced BA. Some “features” are missing here:
1. The fact table had EffectiveFrom and EffectiveTo dates
2. The relationships between some Dim Tables were 1-1 ?!
3. The Time dim (the only one properly implemented on its own – on the bottom of my example) had columns like: DateTimeName nvarchar(100), DateTimeKey nvarchar(100), YearName nvarchar(100), etc..
4. The Some Tables on the top had nothing to do with the rest (in fact a colleague of mine reckons they are there to fill in the white space on the top of the A3 printout)

Another design, which is better, but still pretty bad showed up after my training on Dimensional Modelling (1hr to go through EVERYTHING, including M2M relationships, Parent-Child hierarchies, Type 2 dimensions, etc):

Obviously, the designer (a developer actually) did grasp some concepts. However, my explanation of a star schema must have been not too clear..

Hope that you had some fun with these two diagrams..and I am sure many developers get in a similar situation, especially when someone else designs their databases. But two points:

1. Ask the BAs to analyse the business and their requirements – not to design the database
2. 1 hour of training on dimensional modelling will not make you an expert

Re: Re: MDX ORDER Riddle

Firstly, a Thank You to Tomislav Piasevoli for his answer. Tomislav, I was surprised not to find you on my blogroll (a bad miss on my side) – something which I corrected a few minutes ago.

 I usually prefer not to think in SQL terms when it comes to MDX but I will agree that the first part of Tomislav’s answer makes sense and does illustrate the concept quite well. I was looking for a more “programatical” explanation, which can be found in the second bit. I think that it is worth re-iterating on the concept of context. For the unaware – if you do not explicitly mention an attribute member in an intersection, and there is no reference to a member anywhere in your query, it will get included with its “default member” in that intersection. In most cases this actually defaults to the All member.

 However, I think that the actual behaviour of ORDER is inconsistent in this case. We should not need to include explicitly the current member from the set being ordered, and the current member being evaluated in the SSAS internal loops should be part of the query context in all cases. This seems to be taken care of when the attributes are not directly related. I would imagine that when they are it is easier, and in most cases more efficient to just pull the value for the aggregate Subcategory (in our example), which gets summed up to the related attribute upwards (Category) anyway, however, there is no guarantee that this would be correct (again – as we can see in our example). I think that the ORDER function should either get better documented, or it should get changed to always include the currently evaluated member in the value tuple we are ordering by (seems natural to sort a set by the values for that set members).

 I am still a bit unconvinced that this “by design” behaviour is actually correct, but my understanding of the SSAS ways increased a bit, so I guess I would be able to tackle similar problems a bit easier in the future. Also, Tomislav, the last paragraph sounded mysteriously interesting and I am looking forward to your next post on the subject.

Update: If you follow the comments on this post you will find a very good article (link provided by Darren Gosbell) on Attribute Relationships , which explains this issue in depth. Also, Chris Webb just posted an article on his blog about a similar issue – similar in its cause rather than its symptoms.

MDX ORDER Riddle

I just found this question on MSDN forums:

I wonder if anyone has run into this before.

I’m trying to produce a report in which all rows are sorted by a specific
column. For example the following works fine:

WITH
SET [RowSet] AS [Product].[Subcategory].Levels(1).Members
SET [ColumnSet] AS [Date].[Calendar].Levels(1).Members
Select
Order([RowSet], [ColumnSet].Item(0), BDESC) ON ROWS,
[ColumnSet] ON COLUMNS
FROM [Adventure Works]

All product sub-categories are sorted by CY 2001.

However when the rows and columns of the report contain related non-leaf
attributes of the same hierarchy then the sort does nothing, as illustrated
in this example:

WITH
SET [RowSet] AS [Product].[Subcategory].Levels(1).Members
SET [ColumnSet] AS [Product].[Category].Levels(1).Members
Select
Order([RowSet], [ColumnSet].Item(0), BDESC) ON ROWS,
[ColumnSet] ON COLUMNS
FROM [Adventure Works]

Is such sort possible at all, may be using different MDX functions, or is
this a known limitation for related attributes?

Interestingly enough it works when the leaf-level attribute
([Product].[Product]) is mixed with a non-leaf level attribute (e.g.
[Product].[Category]).

If you think about it for a while, you’ll see that it is not easy to comprehend what is happening here at a first glance. Some light is shed on it by Deepak Puri on MSDN forums – please mark his answer as useful if you think it is:

http://social.msdn.microsoft.com/Forums/en-US/sqlanalysisservices/thread/cde6f082-6dbd-4368-a20a-e64ea21de2bc

You could also write:

 WITH
SET [RowSet] AS [Product].[Subcategory].Levels(1).Members
SET [ColumnSet] AS [Product].[Category].Levels(1).Members
Select
Order([Product].[Subcategory].Levels(1).Members AS a, ([ColumnSet].Item(0), a.CurrentMember), BDESC) ON ROWS,
[ColumnSet] ON COLUMNS
FROM [Adventure Works]

So, the ordering tuple is defaulting to the All member of the Subcategories – therefore, we are trying to order the RowSet by the same value. If we do that on a non-directly related attribute, the ordering expression is in fact getting evaluated within the current RowSet member..which seems natural. So all in all, it looks like a peculiarity in the ORDER function – most likely because it would make some sense in terms of performance. Please let me know if you know why it happens.

Star-Join Optimisation – Prerequisites

A colleague of mine asked recently: Do we need to have foreign keys between our fact tables and dim tables in order to take advantage of the new SQL Server 2008 star-join optimisation. I decided to ask the question at the MSDN Forums. Just now I got a reply and I thought it may be good to share with everyone:

Charles Wang – MSFT

To use star join, indexes are required on tables. FK constraints are not necessary, but it is recommended that you have FK constraints defined since without it SQL Server must depend on heuristics to detect star schema query patterns. There are some restrictions for heuristics to determine a start join query. And it may not pick up a correct start join query plan under some cases. You can find the detailed information in the section “Star Join Heuristics” of this article, Data Warehouse Query Performance. The following is an extraction of the content:

Many physical designs for data warehouses follow the star schema but do not completely specify the relationships between the fact and dimension tables, as mentioned earlier for foreign key constraints, for instance. Without the foreign key constraints explicitly specified, SQL Server must depend on heuristics to detect star schema query patterns. The following heuristics are applied to detect star join query patterns

1. The largest of the tables participating in the n-ary join is considered the fact table. There are additional restrictions on the minimum size of the fact table. For instance, if even the largest table is not beyond a specific size, the n-ary join is not considered a star join.

2.All join conditions of the binary joins in a star join query have to be single column equality predicates. The joins have to be inner joins. While this might sound restrictive, it covers the vast majority of joins between the fact table and dimension tables on the surrogate key in typical star schemas. If a join has a more complex join condition that doesn’t fit the pattern described above, the join is excluded from the star join. A five-way join, for example, can lead to a three-way star join (with two additional joins later on), if two of the joins have more complex join predicates.

There we go – so we need indexes, while FK constraints would be good to have but not necessary. I also wrote to the CSS SQL Server Team, so if they come back with more information, I will update this post.