Custom Dates for an SSIS SCD Task

Just last weekend I implemented a number of Slowly Changing Dimensions in a SQL Server 2005 based project. For the large ones I wrote some SQL code, but for the smaller dimensions, I just decided to use the SSIS SCD task. Since the wizard does most of the work, there is not much beyond it I have done in the past with that component. This time, though we decided to have custom default EffectiveTo dates for the dimensions – 9999-12-31 instead of the default for the SCD task NULL. The wizard, however is not customisable and some manual teaks need to be done before it can handle custom dates. So, I decided to share these since there is not much around on this topic (or at least I could not find any particular references). There is a customisable component on Codeplex – Kimball Method SCD Component, however I could not use it as no custom tools could be used for an unknown reason.

I created a quick mock up of a dimension table for demonstration purposes:

 

Then, I created a SCD task in SSIS with one historical and one changing attribute. For Start and End dates I used my EffectiveFromDateId and EffectiveToDateId and got them populated with [System::StartTime]. Unfortunately, the SSIS task does not allow specifying custom values for the default To date, and uses NULL by default. To change it, we have to modify the following three dataflow components:

We can modify these through the Advanced Editor (right-click). For the actual Slowly Changing Dimension task we have make the following change:

Then we also have to modify the SQL script  for the two OLE DB commands (again through the advanced editor). For the Changing Attributes Updates Output:

And a similar change to the Historical Attributes Inserts Output:

After applying these three changes, we are ready to run the task:

As we have three new rows, they get inserted in the target dimension table. As they are all active, their EffectiveToDates are the default values of 9999-12-31:

Of course, if we decide to change anything through the SSIS SCD wizard, all of these will be lost and we have to redo these changes once again…

Advertisements

When Not To Write MDX and When Not To Use Analysis Services

MDX is a great way to achieve some objectives easily and efficiently. However, there are some things better done in other ways. I will first discuss three distinct mistakes, which designers and developers tend to make when working on a typical BI solution.

1. Leaf-Level Calculations

This is by far the most common one. Both on the MSDN SQL Server forums, and in practice – developers try building calculations in MDX on the leaf level of some dimensions and usually hit severe performance problems. While it is usually possible to build an MDX expression to achieve the goal, it is usually much simpler and way better for performance to just do the calculation either in the ETL, or in the DSV (either as a Named Calculation, or as a part of a Named Query). This avoids the need for the query engine to perform a large number of calculations every time we request an aggregate.

2. Mocking Joins

I have discussed this in a previous post, where I am explaining how we can access a measure group through a dimension, which is not directly related to it (but rather related to it through another measure group and another dimension). Well, instead of doing this, we can simply build a many-to-many relationship between the two and avoid the whole MDX bit.

3. Business Logic over a large dimension

MDX is brilliant for implementing business logic. Unless it needs to operate over millions of dimension members every time a calculation is being requested. In example, recently I tried building a bit of business logic, which needed to order a dimension over a measure, and get the member with a largest value for each member of another dimension with hundreds of thousands of members. On top of it there were other calculations doing similar logic and the end result was not quite what was expected. Even though the MDX was really neat and achieved the purpose in 3-4 lines, I moved the logic back to the ETL (which was quite a bit more complex) because of performance. So, in general, I would not advise in favour of using MDX when to retrieve the result, the query engine needs to go through a lot of cells (in my case quite a few million), especially when ordering is required.

A more fundamental mistake is using Analysis Services in a solution that does not really need it. Two severe and common, in my opinion mistakes are:

1. Data Dumps

Why build a cube when the sole purpose of the project is to allow users to access the underlying data? Unfortunately, sometimes Analysis Services is seen as a silver bullet for everything. If the end report contains a massive amount of data and a key requirement is for it to export nicely to CSV, why bother – just export the data to CSV, zip it up and let the users download it. As far as I know, this can be achieved very easily in a number of other ways. Especially considering the amount of effort and skills needed to build and maintain a SSAS solution.

2. No Aggregations

Another way SSAS gets misused is when a lot of textual data gets stored in a large number of big dimensions, and those get linked in a “fact table”. I have previously worked on a solution where there were no measure columns in the fact table at all and the cube was used to retrieve information about dimension members of the largest dimension called “Member”, containing 4-5 million customers. The rest were dimensions like “Sign Up Date”, “Date Suspended”, “Country of Birth”, “Age Band”, etc. In the end, the main report consisted of the information about the members. No data was aggregated apart from a simple count. The entire OLAP solution could have been replaced by a SQL query with a WHERE clause and an index.

I am sure that there are many other cases when SSAS gets misused. A solution utilising SSAS and MDX properly can be very powerful and elegant. However, sometimes because of poor design choices it gets discarded as inadequate. Don’t blame the tools and the technology if your cube is slow – it is most likely a problem with either your design or the way you have developed your solution.

Obtaining Microsoft BI Certification

I have been a little busy recently with getting certified with Microsoft. I am quite happy to announce I have passed both 70-448 and 70-452 exams in the last couple of weeks without reading a single page of preparation material. Now I am MCP, MCTS and MCITP in SQL Server 2008 BI.

I have always wondered how important and relevant these certificates are. The common opinion around the industry professionals seems to be that experience is what really counts and certifications are for people without experience who are trying to get into the profession. I did it the other way and I am not sorry the least bit for it. An overview of my experience in regards to the two certifications:

MCTS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
Exam: 70-448 Microsoft SQL Server 2008, Business Intelligence Development and Maintenance

I passed this one with 857 (out of 1000). Since it was my first attempt at certification I did go through a few sample questions from a prep book based on MeasureUp, and a friend also flicked my way some really poor quality Braindumps, which contained some quite poor terminology and a fair bit of nonsense. On the MeasureUp tests I scored between 65-85%, so I decided to get a Free Second Shot voucher from Prometric and just go with no further preparation. As expected, my Data Mining skills did not quite cover the expectations and I had a few glitches around SSIS maintenance, but as expected I got more than 90% on all the development components and around 80% on all the administration/maintenance ones. After all I am a developer and I am not that experienced with administering BI solutions. So in general, my impression was that the test can be passed without much preparation from fairly inexperienced developers, who read some preparation materials.

MCITP: Business Intelligence Developer 2008
Exam: 70-452 PRO Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008

This one was harder, but I passed it with 92% with doing one sample test of 30ish questions before the attempt. I had another Prometric Free Second Shot voucher, so I was not stressing if I would pass or fail. The test had double the amount of questions than the MCTS one and it took me a fir while to go through all of them. Again – Data Mining was my weakest part (less than 50% right), while I managed to score 100% on SSRS, SSAS and “Designing the BI Architecture” part. SSIS was almost perfect too. Now, if a fairly inexpreienced developer passes MCTS, this test will definitely cause a lot more headaches. The questions are much more practical and much higher degree of relevant experience would be required to pass (or much more reading).

In general, both of the tests are not easy and I do believe that if one can pass them without preparation (from experience), he is quite prepared to tackle the design, implementation and administration of Microsoft BI solutions. The Free Second Shot vouchers are also great for eliminating stress to some degree. However, on the not-so-good side, the tests can be passed with no experience and because the questions do follow a certain pattern and do not allow for much freedom in choosing the correct answer (short answer ones could be better), I think that there is a moderate chance of inexperienced and not that knowledgeable people to study, pass and then immediately forget the subject matter.

Still, I have no idea how much my career will benefit from these certifications, however I did get a few PDF certificates signed by Steven A. Ballmer to wave at sutiable occasions 🙂

To PowerPivot or Not

Just last week I attended a Microsoft presentation with a subject: “SQL Server 2008 R2 for Developers”. The presenter went through most of the new features in R2 and in SQL Server 2008 but the largest chunk of the presentation was dedicated to PowerPivot. Unsurprisingly, the 50+ developers in the room had not heard anything about it and did not seem very interested in what the presenter had so say and show. I was not surprised not because the feature is insignificant, but mostly because .NET developers are naturally not the best audience for database presentations, let alone for new BI functionality which is tightly coupled with Excel.

However, my eyes and ears were wide open for what the Technology Specialist had to say and he did alright – showed us how we can use Web Slices from SSRS 2008 R2 as a data source for PowerPivot, how we can publish reports through Excel Services, sorting and filtering 101 million rows in less than a second, etc. It was quite impressive.

After the presentation a senior developer approached me (since I am the BI go-to guy) and asked me, among other easy questions, the following: “Who will be using all this?” and “How will users upload 101 million rows spreadsheet to SharePoint?”

Since I had no straight answer for him, I thought it might be a good idea to ask the Microsoft national technology specialist. The replies (not literal quotes) we got were:

“How will users upload 101 million rows spreadsheet to SharePoint?”

Because PowerPivot uses column compression, 101 million rows actually take between 10 and 15 Mb, so uploading them over a network connection to SharePoint is not a problem.

“Who will be using all this?”

I understand that the potential use scenarios for PowerPivot are quite limited, but we (Microsoft) believe that there will be some cases where power-users will need and want this sort of functionality.

Fair enough. However, both of the answers were not convincing. The problems with them are:

1. What happens when we have a large number (say 1-2 million) dimension values, and a large number of facts as a source for PowerPivot? Obviously, column compression will not work that well in this case, which could be quite common.

2. If PowerPivot is mainly for power-users, how powerful do they need to be to take advantage of it? It seems like only a tiny subset of power-users with some expert Excel knowledge will be able to operate this new Excel functionality. I will be quite surprised to see users managing live feeds, analysis services sources, DAX, etc.

3. If BI developers first have to build a PowerPivot environment for the end-users, this does not seem like self-service BI at all, and I would be reluctant to do that if I can achieve the same and more though Analysis Services.

What I liked and I think would be quite useful are the numerous improvements to Excel 2010 as a data analysis and report generation tool. The product seems quite improved and I believe that it will get adopted quickly by heavy Excel users. However, I am somewhat perplexed by the buzz around self-service BI and PowerPivot in particular. Especially since we (BI developers) still need to get some extra functionality implemented to make our and our users’ everyday experience smoother and easier.

Please feel free to let me know if you have better answers to my questions. For the time being I consider myself instead of an opponent, an unconvinced and cautious observer of the recent developments in the BI space, and from this point of view self-service BI seems a bit overhyped. Hopefully it’s me – not the product, what needs refinement.

All Member Properties – Name, Key and Level

I just tried to find some more information about the All Member in SSAS dimension hierarchies and since it was not readily available, I had to experiment a bit, so I thought I may as well share my findings. For some these may be obvious, but for some they could as well be interesting.

So in brief, I will explore the Name, Key and Level of an All member in a dimension hierarchy. The one of choice was the Customer dimension and Customer Geography hierarchy in Adventure Works. There is an All member, called All Customers. As expected, .PROPERTIES(“MEMBER_NAME”) gives us “All Customers”:

WITH
MEMBER [Measures].[test] AS
  [Customer].[Customer Geography].CurrentMember.PROPERTIES(“MEMBER_NAME”)
SELECT
{
  [Measures].[test]
} ON 0,
{
  [Customer].[Customer Geography].Members
} ON 1
FROM [Adventure Works]

The first row shows us: All Customers.

Now, let’s see what its key is:

WITH
MEMBER [Measures].[test] AS
  [Customer].[Customer Geography].CurrentMember.PROPERTIES(“KEY”)
SELECT
{
  [Measures].[test]
} ON 0,
{
  [Customer].[Customer Geography].Members
} ON 1
FROM [Adventure Works]

This gives us 0.

And its level:

WITH
MEMBER [Measures].[test] AS
  [Customer].[Customer Geography].CurrentMember.Level.Name
SELECT
{
  [Measures].[test]
} ON 0,
{
  [Customer].[Customer Geography].Members
} ON 1
FROM [Adventure Works]

The result this time is: (All).

So far so good. Now let’s try using these to get only the All member:

SELECT
{
  [Customer].[Customer Geography].[All Customers]
} ON 0
FROM [Adventure Works]

This works. Now if we try the Key:

SELECT
{
  [Customer].[Customer Geography].&[0]
} ON 0
FROM [Adventure Works]

Interestingly, since the All member is a calculated member and has no physical key, if we try to use the one that SSAS gave us does not actually work – we get nothing on Axis 0.

Using the level works:

SELECT
{
  [Customer].[Customer Geography].[(All)].Item(0)
} ON 0
FROM [Adventure Works]

Also, after experimenting a bit further:

SELECT
{
  [Customer].[Customer Geography].[All]
} ON 0
FROM [Adventure Works]

This query also works even though the All member name is [All Customers], not just [All]. However, Analysis Services does recognise [All].

In summary, the most robust options for referencing the All member in a dimension hierarchy that I have found are:

1. [Dimension].[Hierarchy].[(All)].Item(0)
2. [Dimension].[Hierarchy].[All]

These will always work – regardless of the dimension and hierarchy names.

Another option is using [Dimension].[Hierarchy].[] – e.g. [Customer].[Customer Hierarchy].[All Customers]

And, one that does not work – referencing through its alleged key: [Customer].[Customer Hierarchy].&[0]

Please let me know if there are any better alternatives, or why it would give me a key of 0 for the All member and would not work when actually using this key.

An Open-Source Approach to Business Intelligence

Since Microsoft discontinued PerfromacePoint Planning, it is slightly unclear what is their strategy in this field. I have made a few suggestions in the past, some of which:

  • Stored-Procedure write-back from Excel
  • SSAS write-back from Excel
  • Better support for dimensional and measure group write-back

Whether, when and how these may get implemented is an open question. I firmly believe that we should continue learning from any possible source – including our competitors. This is why I am quite interested in any possibility to extend my knowledge in the BI space regardless of who is presenting the ideas and solutions.

Being lucky as I am, I received an open invite to the PALO Australia Roadshow 2010, presented by a company I recently posted about – Naked Data. In a brief, what you can expect from PALO and the presentation is:

  • In-memory MOLAP
  • Rule-based forecasting and planning
  • Some sci-fi GPU parallel processing
  • Integration with Excel (PALO formulas and Pivot tables)
  • .NET API
  • Free breakfast

I will be there trying to get some ideas for my future forecasting and planning implementations. Seats are limited, so make sure you register soon!

Average Aggregation in Analysis Services

In SSAS we do not have a measure Average aggregation type. We do have AverageOfChildren (or Average over time), however it is semi-additive and works only along a Time dimension. Fortunately, we have Sum and Count, and since Average = Sum / Count, we can build our own Average aggregation when we need one. 

To do that: 

  1. Create a measure using the Sum aggregation type (which is also the default). In our example, let’s call it Amount.
  2. Create a Count of Non-Empty Values (or Count of Rows) measure. In example – [Measure Count].
  3. Create the actual calculation – [Measures].[Amount]/[Measures].[Measure Count]

We can either create a calculated measure, which performs the MDX calculation above: 

CREATE MEMBER CURRENTCUBE.[Measures].[Average Amount]
AS
[Measures].[Amount]/[Measures].[Measure Count]
,NON_EMPTY_BEHAVIOR = {[Measures].[Measure Count]}
,VISIBLE=1; 

, or if we really do not need the Sum base measure, we can set it to be replaced by the calculation with a SCOPE statement: 

SCOPE([Measures].[Amount]);
  This = [Measures].[Amount]/[Measures].[Measure Count];
NON_EMPTY_BEHAVIOR(This) = [Measures].[Measure Count];
END SCOPE; 

Voila! We have created a measure simulating an Average aggregation type. Then, we can hide the Count helper measure and from user point of view there is no evidence of our effort. 

Since the count will never be 0, we do not have to say “If not 0, divide, else – do not” and the NON_EMPTY_BEHAVIOR query hint may in fact improve performance, since the calculation will not be performed when the Count measure is NULL (instead of resulting in NULL/NULL=NULL). 

Mosha has previously blogged about NON_EMPTY_BEHAVIOR and division by zero and I strongly recommend reading his post. 

Another important consideration, which depends on the business scenario is the type of the Count aggregate. It could be Count of rows (Row Bound) or Count of non-empty values (Column Bound). The difference is whether we want to include or exclude the empty values from our aggregate. Either way, the described technique will work equally well. 

I realise that this is a well-known approach, but since it is hard to find the solution online I thought it may be interesting for some less-experienced developers.