Do Business Analysts make good dimensional modellers??

Recently I had the (dis)pleasure of working with Business Analysts, who also thought that they are good in dimensional modelling. so, I had to implement BI solutions (including cubes) on top of their database design. I will show an example (about 95% the same as the actual design), where the idea of letting BAs go into dev territory does not yield the best results:

 

This “dimensional model” was created by an experienced BA. Some “features” are missing here:
1. The fact table had EffectiveFrom and EffectiveTo dates
2. The relationships between some Dim Tables were 1-1 ?!
3. The Time dim (the only one properly implemented on its own – on the bottom of my example) had columns like: DateTimeName nvarchar(100), DateTimeKey nvarchar(100), YearName nvarchar(100), etc..
4. The Some Tables on the top had nothing to do with the rest (in fact a colleague of mine reckons they are there to fill in the white space on the top of the A3 printout)

Another design, which is better, but still pretty bad showed up after my training on Dimensional Modelling (1hr to go through EVERYTHING, including M2M relationships, Parent-Child hierarchies, Type 2 dimensions, etc):

Obviously, the designer (a developer actually) did grasp some concepts. However, my explanation of a star schema must have been not too clear..

Hope that you had some fun with these two diagrams..and I am sure many developers get in a similar situation, especially when someone else designs their databases. But two points:

1. Ask the BAs to analyse the business and their requirements – not to design the database
2. 1 hour of training on dimensional modelling will not make you an expert

Re: Re: MDX ORDER Riddle

Firstly, a Thank You to Tomislav Piasevoli for his answer. Tomislav, I was surprised not to find you on my blogroll (a bad miss on my side) – something which I corrected a few minutes ago.

 I usually prefer not to think in SQL terms when it comes to MDX but I will agree that the first part of Tomislav’s answer makes sense and does illustrate the concept quite well. I was looking for a more “programatical” explanation, which can be found in the second bit. I think that it is worth re-iterating on the concept of context. For the unaware – if you do not explicitly mention an attribute member in an intersection, and there is no reference to a member anywhere in your query, it will get included with its “default member” in that intersection. In most cases this actually defaults to the All member.

 However, I think that the actual behaviour of ORDER is inconsistent in this case. We should not need to include explicitly the current member from the set being ordered, and the current member being evaluated in the SSAS internal loops should be part of the query context in all cases. This seems to be taken care of when the attributes are not directly related. I would imagine that when they are it is easier, and in most cases more efficient to just pull the value for the aggregate Subcategory (in our example), which gets summed up to the related attribute upwards (Category) anyway, however, there is no guarantee that this would be correct (again – as we can see in our example). I think that the ORDER function should either get better documented, or it should get changed to always include the currently evaluated member in the value tuple we are ordering by (seems natural to sort a set by the values for that set members).

 I am still a bit unconvinced that this “by design” behaviour is actually correct, but my understanding of the SSAS ways increased a bit, so I guess I would be able to tackle similar problems a bit easier in the future. Also, Tomislav, the last paragraph sounded mysteriously interesting and I am looking forward to your next post on the subject.

Update: If you follow the comments on this post you will find a very good article (link provided by Darren Gosbell) on Attribute Relationships , which explains this issue in depth. Also, Chris Webb just posted an article on his blog about a similar issue – similar in its cause rather than its symptoms.

MDX ORDER Riddle

I just found this question on MSDN forums:

I wonder if anyone has run into this before.

I’m trying to produce a report in which all rows are sorted by a specific
column. For example the following works fine:

WITH
SET [RowSet] AS [Product].[Subcategory].Levels(1).Members
SET [ColumnSet] AS [Date].[Calendar].Levels(1).Members
Select
Order([RowSet], [ColumnSet].Item(0), BDESC) ON ROWS,
[ColumnSet] ON COLUMNS
FROM [Adventure Works]

All product sub-categories are sorted by CY 2001.

However when the rows and columns of the report contain related non-leaf
attributes of the same hierarchy then the sort does nothing, as illustrated
in this example:

WITH
SET [RowSet] AS [Product].[Subcategory].Levels(1).Members
SET [ColumnSet] AS [Product].[Category].Levels(1).Members
Select
Order([RowSet], [ColumnSet].Item(0), BDESC) ON ROWS,
[ColumnSet] ON COLUMNS
FROM [Adventure Works]

Is such sort possible at all, may be using different MDX functions, or is
this a known limitation for related attributes?

Interestingly enough it works when the leaf-level attribute
([Product].[Product]) is mixed with a non-leaf level attribute (e.g.
[Product].[Category]).

If you think about it for a while, you’ll see that it is not easy to comprehend what is happening here at a first glance. Some light is shed on it by Deepak Puri on MSDN forums – please mark his answer as useful if you think it is:

http://social.msdn.microsoft.com/Forums/en-US/sqlanalysisservices/thread/cde6f082-6dbd-4368-a20a-e64ea21de2bc

You could also write:

 WITH
SET [RowSet] AS [Product].[Subcategory].Levels(1).Members
SET [ColumnSet] AS [Product].[Category].Levels(1).Members
Select
Order([Product].[Subcategory].Levels(1).Members AS a, ([ColumnSet].Item(0), a.CurrentMember), BDESC) ON ROWS,
[ColumnSet] ON COLUMNS
FROM [Adventure Works]

So, the ordering tuple is defaulting to the All member of the Subcategories – therefore, we are trying to order the RowSet by the same value. If we do that on a non-directly related attribute, the ordering expression is in fact getting evaluated within the current RowSet member..which seems natural. So all in all, it looks like a peculiarity in the ORDER function – most likely because it would make some sense in terms of performance. Please let me know if you know why it happens.

Star-Join Optimisation – Prerequisites

A colleague of mine asked recently: Do we need to have foreign keys between our fact tables and dim tables in order to take advantage of the new SQL Server 2008 star-join optimisation. I decided to ask the question at the MSDN Forums. Just now I got a reply and I thought it may be good to share with everyone:

Charles Wang – MSFT

To use star join, indexes are required on tables. FK constraints are not necessary, but it is recommended that you have FK constraints defined since without it SQL Server must depend on heuristics to detect star schema query patterns. There are some restrictions for heuristics to determine a start join query. And it may not pick up a correct start join query plan under some cases. You can find the detailed information in the section “Star Join Heuristics” of this article, Data Warehouse Query Performance. The following is an extraction of the content:

Many physical designs for data warehouses follow the star schema but do not completely specify the relationships between the fact and dimension tables, as mentioned earlier for foreign key constraints, for instance. Without the foreign key constraints explicitly specified, SQL Server must depend on heuristics to detect star schema query patterns. The following heuristics are applied to detect star join query patterns

1. The largest of the tables participating in the n-ary join is considered the fact table. There are additional restrictions on the minimum size of the fact table. For instance, if even the largest table is not beyond a specific size, the n-ary join is not considered a star join.

2.All join conditions of the binary joins in a star join query have to be single column equality predicates. The joins have to be inner joins. While this might sound restrictive, it covers the vast majority of joins between the fact table and dimension tables on the surrogate key in typical star schemas. If a join has a more complex join condition that doesn’t fit the pattern described above, the join is excluded from the star join. A five-way join, for example, can lead to a three-way star join (with two additional joins later on), if two of the joins have more complex join predicates.

There we go – so we need indexes, while FK constraints would be good to have but not necessary. I also wrote to the CSS SQL Server Team, so if they come back with more information, I will update this post.

Is 0.5 = 0.5 in SSIS?

Today I had an issue which was causing some peculiar results in my ETL. After approximately 1 hour of trying to find the exact problem, I managed to narrow it down to a Sort task. The Data Viewer which I put immediately before that task showed me:

ColumnA              ColumnB
BBHI                     0.5
LLHR                     0.5

The Sort did:

  1. Sort by ColumnB in Descending order
  2. Sort by ColumnA in Ascending order

I was surprised to see that the output was:

ColumnA              ColumnB
LLHR                     0.5
BBHI                     0.5

After trying to reverse the sort order on the two columns (first ColumnA and then ColumnB), I noticed that ColumnB changes the order of the two rows. The type of that column was Double Precision Float in SSIS and float in SQL Server. Since SQL Server also showed 0.5 for each column, I did not expect the reason for my wrong results to be the floating point number. And I was wrong. Clearly the 0.5 was not exactly 0.5 in one of the two columns. Luckily, I got advised by my client that if I multiply the number by 10,000,000,000 and then round to whole number, I would not have issues with losing any information. My fix was simple – a derived column – ColumnB_Int which did exactly that:

 (DT_I8)Round((ColumnB*10000000000.0),0).

Then I used that column in the sort instead of the original one. This did the trick.

Quick Wins and Quick Losses

I have been wondering of late – what a “Quick Win” implies and means. From my experience with BI projects, very often because of fairly uneducated target consumers and fierce competition, companies deliver quick and dirty solutions, hoping to attract attention and then sell more services. This practice is often referred to as a “Quick Win”. Of course, the actual intention is not bad, but when poorly executed it firstly wastes clients’ money and time and then also discourages them from pursuing a BI solution any further. In the case of a failure another term – a “Quick Loss” is more appropriate but never used.

 So, what determines the outcome?

 1. Scope

 Managing the scope is absolutely essential in a Quick Win scenario. We must convince the client that all the advanced functionality can be safely pushed back to the next full-blown release when we would have the time and money to build it properly. If we extend our Quick Win to build Dynamic Dimension Security, partition our cube, clean up the data, build dimension managing capabilities (MDS comes to mind), etc. we will most likely fail or at least jeopardise our chances of success. In this first crucial phase we need to concentrate on the core – building simple and robust system. Instead of having the usual scope creep, we should actually try to push for the opposite – scope cuts. Of course, this has to be carefully balanced with the actual need as cutting too much will leave us with an unusable result.

 2. Quality

 In my opinion, if we deliver a poor quality solution it will fail and no attempts to resuscitate it later would have any decent chance of success. So, when we are scoping out our project we must make sure we have time to build it well. Shortcuts would quite likely make us scrap it altogether at a later point of time and then rebuild it properly. Also, if we build an OLAP solution which is slow and buggy, we would hardly be able to convince our client that the next phase of the project will be any better.

 3. Analysis and Design

 Yes, it is a Quick Win and yes, it is a BI solution, but even these (contrary to some opinions) do need analysis and design. Spending a bit of time with the business users, the source system and with the server engineers can greatly improve the development experience. Without a design phase, it is hard to maintain a strict scope and attain high quality. A brief design document helps with remembering why we have done something the way we have and decoupling us (as developers) from the solution.

 4. Task Management

 I am not a project manager. However, when alone on a small project I find it very useful to track my progress and objectives by building a basic spreadsheet showing Tasks, Description, Time Allocated, etc. This way I can easily comprehend and explain how my development is going, and ask for more time before I hit a deadline if required. Also, a task sheet helps me to switch between tasks, or allocate them to other developers.

 5. Managing Client Expectations

 I have heard this phrase many times before, and it has usually been misused. Managing client expectations does not actually mean lying to the clients, neither it means promising too much. In my opinion, managing client expectations means exactly what it sounds like – don’t make your client too excited with what you cannot deliver and make them expect exactly what you can. It is good to keep the clients happy and optimistic for the future, but making them enthusiastic and then crushing their enthusiasm with a dud solution is unprofessional.

 This issue has been haunting me for a while. I have definitely not exhausted the topic and I am sure that many developers can add to this list their own thoughts, but I just hope I can spare some trouble or offer some hints for the less experienced readers of this blog.

Passing unCONSTRAINED Set and Member parameters between reports in Reporting Services

By default SSRS MDX queries get a StrToMember or StrToSet functions with a CONSTRAINED flag. However, many developers do not quite know why it is there or what it actually does. Books-On-Line contains this statements:

StrToMember

  • When the CONSTRAINED flag is used, the member name must be directly resolvable to a qualified or unqualified member name. This flag is used to reduce the risk of injection attacks via the specified string. If a string is provided that is not directly resolvable to a qualified or unqualified member name, the following error appears: “The restrictions imposed by the CONSTRAINED flag in the STRTOMEMBER function were violated.”
  • When the CONSTRAINED flag is not used, the specified member can resolve either directly to a member name or can resolve to an MDX expression that resolves to a name.
  • StrToSet

  • When the CONSTRAINED flag is used, the set specification must contain qualified or unqualified member names or a set of tuples containing qualified or unqualified member names enclosed by braces {}. This flag is used to reduce the risk of injection attacks via the specified string. If a string is provided that is not directly resolvable to qualified or unqualified member names, the following error appears: “The restrictions imposed by the CONSTRAINED flag in the STRTOSET function were violated.”
  • When the CONSTRAINED flag is not used, the specified set specification can resolve to a valid Multidimensional Expressions (MDX) expression that returns a set.
  • Therefore, if you have a CONSTRAINED flag you have to specify exact members or sets (e.g. [Date].[Year].[2009], or {[Date].[Year].[2009],[Date].[Year].[2010]}). If you omit the flag, you can pass to the StrToMember an expression, which evaluates to a member (e.g. [Date].[Year].[Year].Members.Item(0)), and to StrToSet an expression, which evaluates to a set (e.g. NONEMPTY([Date].[Year].[Year].Members, [Measures].[Amount]).

    The flexibility which removing CONSTRAINED offers can be quite powerful when passing parameters between reports. In example, we may want to pass a parameter to a drill-through report from two different summary reports, where each of those uses a different subset of dimension members, which in turn can be derived by different set expressions.

    Major drawbacks of using this approach is the severe performance hit it leads to, as well as a possible “MDX injection” vulnerability. Since in most cases we would  be using the passed parameters in a subcube expression or on the slicer axis (WHERE clause), this should not lead to as bad performance as we would get if we would use it inside a calculation. However, when we need to use a parameter directly in a calculated measure, we would be better off avoiding an unCONSTRAINED function. 

     Therefore, we may instead use SetToStr on the summary reports and pass a String parameter to a CONSTRAINED StrToSet function in the drill-through report. This way we are first resolving the set expression once and passing it on to the underlying report as a string. We could do that in a calculated measure returning a String, then passed on as a Field to the drill-through parameter. However, in the rare case where we have many rows travelling from the SSAS server to the SSRS server, this could be slow.

    So, whether we use a StrToSet without CONSTRAINED flag, or a String parameter constructed by a SetToStr function, is dependent on the actual scenario, but it is good to have both options in our arsenal of tools and techniques when we need to implement some non-quite-standard piece of functionality.