SSIS Balanced Data Distributor – Comparison

Microsoft just released a new Data Flow transformation for SSIS – Balanced Data Distributor. Reading the installation pages, we can see that there are a few interesting things about it, among which – it is a multithreaded transform, which uniformly splits a data stream in multiple outputs. I decided to give it a quick test and see how it performs in comparison to a straight insert and a Script Component which splits the input stream in two, as well.

First the tests show performance with input of a SQL Server table (OLE DB Input) and output to a raw file. The input is a TOP 20000000 * select from a large table. The results are as follow:

1. Straight insert: 32 seconds

2. Balanced Data Distributor: 36 seconds

3. Script Component: 57 seconds

4. Conditional Split: 42 seconds



Note that the Conditional Split divides the stream based on the remainder of a division of an integer field, while the Script Component does it based on the row number. The Conditional Split may or may not be useful in a number of cases – when we have non-numeric data only, or when the range of the numeric data is not wide enough to split in the number of streams we would like to (e.g. Gender Key can be 1 or 2, while we may want to split in 10 parallel data streams). Therefore, its functionality is not equivalent to the BDD transform (thanks to Vidas Matelis for asking me to include it in the case study).

The second tests show how fast it is with reversed input and output (raw file to SQL Server table) and everything else identical.

  1. Straight insert: 56 seconds
  2. Balanced Data Distributor: 1 minute and 47 seconds
  3. Script Component: 1 minute and 57 seconds
  4. Conditional Split: 1 minute and 54 seconds

Over 40M rows the difference between the BDD transform and the Script Component – 1:16 vs 2:13 (vs 1:24 for the equivalent Conditional Split), or 57 seconds difference when inserting in a raw file. In general, the overall performance improvement seems to be around 35-45% in that direction. Since I am inserting in my local SQL Server in the same table the parallel split does not seem to be beneficial even though the destination is slower than the source. In fact the straight insert outperforms the parallel data flows in both cases. If we were to insert into a partitioned table over different filegroups hosted on separate drives the results could be quite different. Either way, in my opinion it is a nice and welcome addition to our arsenal of SSIS data flow components.

Edit: Len Wyatt from the SQL Performance Team at Microsoft has provided a link to his post with great bit of detail about the BDD transform in the comments below. Please take a minute and have a look if interested.

3×3 Worst Development Practices 2010 with SSIS, SSAS and SSRS

There are some practices which are outright bad and some which are sometimes bad. I will try to make a list of top three outright worst practices per the main components of SSIS, SSAS and SSRS which I have annoyed me most throughout 2010 (and previous years). I am sure there are many more, but these are ones which I have witnessed often enough to deserve some attention.

SSIS

Using default names for all components in a control flow and a data flow

I really dislike opening projects and seeing only generic names. I have built some in the past because of insufficient time to document everything, but I hated creating them and later leaving them like that. I have also seen developers with time on their hands carelessly omitting them. The worst practice is to actually rename them to an equally unintelligible, if better, language. Many ETL-out-of-the-box tools actually do that and make subsequent maintenance much harder.

Performing joins with Merge Join

Chaining 5 Merge Joins and placing Sort components between them is something SQL Server does faster and with much less resources. Therefore doing joins in the database is a no-brainer even if your preference lies with a fully SSIS-idised process.

Not using configurations

Lucky, in Denali we are getting a completely revamped project configurations approach. Even if imperfect, Package Configurations are a must for every SSIS project. Migrating projects, changing databases, setting variables are just too good to miss.

SSAS

Using MDX for everything

MDX has an aura of a language which allows you to do everything much faster than SQL. Well, this is not entirely true and in the majority of cases a change in the design will allow avoidance of MDX, which in turn removes a whole layer of complexity allowing tremendous improvements over any programmatical approach.

Importing too much dimensional data

Often developers underestimate the impact of unnecessarily large dimensions with many useless attributes. Here the best approach is to add more as you need them rather than import everything because we can. Cubes are smaller and less complex with less attributes, processing is faster, and the most important thing is that the users actually like having less clutter.

Leaving NULLs in the relational database

SSAS does not like NULLs. I have not seen a case when NULLs are necessary in dimension tables. While I would leave a NULL for a measure when there is no data for it, too many of these can be quite harmful and it may be a case of a bloated measure group, as well.

SSRS

Storing business logic in report queries

While we can, we should never store SQL and MDX logic which can be implemented in stored procedures or calculated measures in the reports. Maintaining SSRS queries is much harder and less transparent. There are performance implications around caching, too.

Merging data sets in SSRS

This is hard to be done, and when it is, it is always bad. Data sets should be merged in the queries – not in the reports.

Building massive reports

In many organizations reports are seen as a transport mechanism to Excel. Users generate large, hundreds of thousands of rows report with a large amount of columns and sometimes even basic calculations. All this can be done simply and easily on the database level without going through SSRS.

Analysing Twitter Trends

Recently, I built a pilot for a demo in Avanade based on Twitter trends entirely using Microsoft SQL Server 2008 R2 and Microsoft Office. Since we are hearing about “Social BI”, and other “Social” aspects of software, this may help introduce an approach to this topic from Microsoft BI point of view. 

The basic idea is: 

  1. Connect to search.twitter.com and get a feed which contains the results from a search
  2. Do #1 every now and then (in my case – every 10 seconds)
  3. Pick the last 15 feed items (tweets), and compare them to the last 15 items to determine any new tweets
  4. Store all new tweets in SQL Server
  5. Build a few dimensions (author, time, etc.)
  6. Build a categorisation dimension. In my case I am using Full-Text search to find certain keywords around my search strings (e.g. Windows – bad/good/excellent/terrible), so I can analyse the feedback the search term is receiving
  7. Build a SSAS cube on top of these dimensions and the feed history
  8. Build an Excel pivot-based report on top of the SSAS cube

Now, for the first 4 points, we can use SSIS. We cannot natively connect to RSS/ATOM feeds in SSIS, but we can easily build a Script Transform as a data source, which connects and pulls the feed items in a data flow. I found a very useful C# reference here: 

http://blogs.msdn.com/b/mattm/archive/2009/02/19/read-an-rss-feed-from-ssis.aspx 

And, I customised it a bit (99% is the same as in the example above), so it works with Twitter and also, so it does not include irrelevant info. Of course, depending on what we need to achieve we can change the code to support all RSS/ATOM features. A sample of my script is below: 

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.ServiceModel.Web;
using System.ServiceModel.Syndication;
using System.Xml;
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private string url = string.Empty;
    private SyndicationFeed feed = null;
    private XmlReader reader = null;
   public override void PreExecute()
    {
     //base.PreExecute();
   //here we are using a SSIS variable for the feed address, which we can populate
   //from a database table, thus providing an easily customisable interface to our
   //package
        reader = XmlReader.Create(Variables.Feed1);
        feed = SyndicationFeed.Load(reader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
        reader.Close();
    }
    public override void CreateNewOutputRows()
    {
        if (feed != null)
        {
            foreach (var item in feed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.id = item.Id;
                Output0Buffer.published = item.PublishDate.DateTime;
                Output0Buffer.title = item.Title.Text;
                Output0Buffer.authorname = item.Authors[0].Name;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

Once we can connect to our source, everything becomes pretty much a run-of-the-mill standard ETL. My package at present supports only 5 feeds and they get pulls simultaneously. This is of course just a demo and is easy to extend as required. A natural way to improve this solution is to implement some sort of logic (i.e. a looping sequence container), so you harvest more feeds. The importantly convenient feature of the feeds is their ID, which is very useful for comparing items. Apart from it, another important property of our tweets is that they may contain one or more keyword, which we are searching for. Therefore, we would like to count them for each feed (sometimes resulting in the same feed item getting counted 3 or more times). A good example of this is if we are searching for Microsoft, Vista and Windows7, and we receive something like: “Microsoft is doing well with Windows7 – it is so much better than Vista” – here we want to know that all three terms got hit and we want to show that there was a Tweet for each (multiplying the total count by 3). 

The following is an example of my SSIS items: 

 Control Flow: 

 

Data Flow:

 

I am doing a little trick here to compare my last to present feed items – tagging them as old in the start of the control flow, and then deleting old ones in the end. This is sort of a snapshot processing logic and works just fine when running only this package many times in a row (which I am doing in SQL Server Agent).The output of this package is one table with all feed items.

We don’t need much more for a simple implementation such as WardyIT’s #sqlpass Trends:

http://www.wardyit.com/sqlpass/

However, once we decide to progress a bit further, we can achieve much more. As I mentioned earlier, we can use something like Free-Text Search to analyse the content of the feeds. Utilising CONTAINS(‘something’ NEAR something), we can find keywords around the search terms. This can be very useful if we want to make a little bit of sense from the raw tweets. I prepared a table, which categorises 20 words in Positive/Negative and one level below (Strongly/Moderately/Slightly Positive or Negative) and I am matching these to the tweets. This allows me to see trends in like/dislike logic for each search term. For ad-hoc analysis and easy consumption in Excel, I also created the following SSAS solution:

 

I am processing my cube every hour, which allows me to see fairly up-to-date results from it. Alternatively, depending on requirements you can easily switch it to ROLAP (considering the very low complexity of the cube and the relatively low amount of fact data) and get real-time analytics.Of course, the result matters the most and this is a little (un)useful dashboard in Excel, which showcases a few of the available analytical options we have from this very simple implementation:

 

Please don’t slam me for the choice of search terms and some visualisation items (e.g. the line chart x-axis) – I know that some are not quite relevant, but it is just a quick demo, right? J

Passing database names to SSIS stored procedures

In the rare cases when we use dynamic SQL and want to use a database name in our code, we are better off avoiding hard-coding them. Unfortunately, I could not find an easy way to access a connection manager’s database name and on my current project the catalog name is not in the SSIS configurations XML file. Therefore, I had to resort to a little trick to pull the database name out and pass it to a stored procedure. In brief we can do the following:

1. Create a user variable database_name

2. Create an Execute SQL Task using the connection manager we want to get the database name from, which does:

 SELECT db_name() AS database_name
 
3. Map the Single Row result set to our database_name variable 

4. Place the task created in the previous step before any components which would be using the variable.

5. Pass the variable to our dynamic SQL stored procedure

 

 

There we go – a stored procedure configured in the SSIS package configurations – a bit better than just hard-coding the name.

An ETL Framework from BI Monkey

Today I found out about a cool ETL framework being developed by my colleague in Sydney – James Beresford, a.k.a BI Monkey. He is also writing a blog, which I have seen and read in the past.

Check out his take on what an ETL should look like. He told me that the current version (v1) is suitable for smaller projects and that a new one v2 is coming out soon – so if you like v1 definitely stay posted for v2! And in case you have suggestions – he is pretty cool to chat to and very keen to get the framework further, so don’t hesitate to contact him!

Is 0.5 = 0.5 in SSIS?

Today I had an issue which was causing some peculiar results in my ETL. After approximately 1 hour of trying to find the exact problem, I managed to narrow it down to a Sort task. The Data Viewer which I put immediately before that task showed me:

ColumnA              ColumnB
BBHI                     0.5
LLHR                     0.5

The Sort did:

  1. Sort by ColumnB in Descending order
  2. Sort by ColumnA in Ascending order

I was surprised to see that the output was:

ColumnA              ColumnB
LLHR                     0.5
BBHI                     0.5

After trying to reverse the sort order on the two columns (first ColumnA and then ColumnB), I noticed that ColumnB changes the order of the two rows. The type of that column was Double Precision Float in SSIS and float in SQL Server. Since SQL Server also showed 0.5 for each column, I did not expect the reason for my wrong results to be the floating point number. And I was wrong. Clearly the 0.5 was not exactly 0.5 in one of the two columns. Luckily, I got advised by my client that if I multiply the number by 10,000,000,000 and then round to whole number, I would not have issues with losing any information. My fix was simple – a derived column – ColumnB_Int which did exactly that:

 (DT_I8)Round((ColumnB*10000000000.0),0).

Then I used that column in the sort instead of the original one. This did the trick.

Custom Dates for an SSIS SCD Task

Just last weekend I implemented a number of Slowly Changing Dimensions in a SQL Server 2005 based project. For the large ones I wrote some SQL code, but for the smaller dimensions, I just decided to use the SSIS SCD task. Since the wizard does most of the work, there is not much beyond it I have done in the past with that component. This time, though we decided to have custom default EffectiveTo dates for the dimensions – 9999-12-31 instead of the default for the SCD task NULL. The wizard, however is not customisable and some manual teaks need to be done before it can handle custom dates. So, I decided to share these since there is not much around on this topic (or at least I could not find any particular references). There is a customisable component on Codeplex – Kimball Method SCD Component, however I could not use it as no custom tools could be used for an unknown reason.

I created a quick mock up of a dimension table for demonstration purposes:

 

Then, I created a SCD task in SSIS with one historical and one changing attribute. For Start and End dates I used my EffectiveFromDateId and EffectiveToDateId and got them populated with [System::StartTime]. Unfortunately, the SSIS task does not allow specifying custom values for the default To date, and uses NULL by default. To change it, we have to modify the following three dataflow components:

We can modify these through the Advanced Editor (right-click). For the actual Slowly Changing Dimension task we have make the following change:

Then we also have to modify the SQL script  for the two OLE DB commands (again through the advanced editor). For the Changing Attributes Updates Output:

And a similar change to the Historical Attributes Inserts Output:

After applying these three changes, we are ready to run the task:

As we have three new rows, they get inserted in the target dimension table. As they are all active, their EffectiveToDates are the default values of 9999-12-31:

Of course, if we decide to change anything through the SSIS SCD wizard, all of these will be lost and we have to redo these changes once again…

7 Ways to Process Analysis Services Objects

Being asked a bit too often how we can process Analysis Services databases (or cubes and dimensions) here is a list of 7 different methods:

1. Through the GUI

This one is obvious. We can do it through both SSMS and BIDS.

2. XMLA Script

To generate the script we can use the hefty Script button in SSMS. Simply configuring the processing settings and then instead of clicking the all too usual OK, we can as well click on the little button in the top left corner of the Process window:

xmla_script

Then, we can just execute the generated query.

3. SSIS Analysis Services Processing Task

This Control Flow task allows us to configure any settings and then add it to our ETL process. Quite handy.

image

4. SQL Server Agent Job

This one is really an automation of Method #2 – XMLA Script. We can encapsulate it into a job of SQL Server Analysis Services Command type:

image

5. .NET Code

This allows us to process cubes as a part of an application. Nice if we want to let our users process our cubes on-demand. Of course, better left to application developers, but still a good trick to know. Especially if we want to seem all-knowing when it comes to databases of any type. To achieve this objective, we use AMO (Analysis Management Objects). An API can be found here:

http://technet.microsoft.com/en-us/library/microsoft.analysisservices(SQL.90).aspx

6. Command Line – ascmd

The command line utility can do a lot – including processing SSAS objects. For a full readme you can go here:

http://msdn.microsoft.com/en-us/library/ms365187.aspx

7. Command Line – PowerShell

This PowerShell script will perform a Full Process of Adventure Works DW 2008 on localhost:

[Reflection.Assembly]::LoadWithPartialName(“Microsoft.AnalysisServices”)
$servername=New-Object Microsoft.AnalysisServices.Server
$servername.connect(“localhost”)
$databasename=New-Object Microsoft.AnalysisServices.Database
$databasename=$servername.Databases.GetByName(“Adventure Works DW 2008”)
$databasename.Process(“ProcessFull”)

Using AMO we can do any maintenance tasks through PowerShell, including an object process.

Probably not a fully exhaustive list, but I hope it helps with giving developers some options when it comes to this trivial and crucial part of the development and deployment process.

Inferred Members Implementation Best Practise

Just last week a colleague of mine and I did a very thorough research on the best implementation practises in regards to Inferred Members. We went through a few web sites and we also had a look at previous implementations through SSIS. There are quite a few resources on the subject and the best two methods we found were based on Project REAL and a SQL CAT recommendation. So, in the end we agreed that we can recommend three approaches, which have their advantages and disadvantages and could be all successfully used when we need to have Inferred Member (late arriving dimensions = early arriving facts) support in a solution.

Our findings can be summarised in the following table:

Method Pros Cons
Stored procedures No need to use SSIS SlowComplicated developmentNeed to link source and target data server instances.
Script Component (SSIS) FastHigh reusability .NET skills required
Double Lookup (SSIS) FastHigh reusability Performance issues with multiple dimensions

I will go through each of these methods and provide some more details about them.

1. Stored Procedures

If SSIS data flow tasks are not an option for our ETL we can implement inferred member support through SQL code. The basic idea behind this method is:

  1. Find new dimension members in the fact data and add them to the dimension
  2. Move the fact data from source to target (i.e. staging to datamart) with surrogate keys from the dimension table

If we have our source and target data on separate instances of SQL Server it becomes quite inconvenient to utilise this approach. We need to link the two instances to use tables from both tables in the same stored procedure. This is a major issue and it is easily avoided by using SSIS.

2. SQL Server Integration Services

A general design of SSIS solutions can be represented with the following diagram:

Inferred Members Small

We can implement this in the following ways:

2.1. Script Component (Project REAL)

The Data Flow task for this approach is:

image

After looking up missing states, we pass them to a Script Component, which hashes the misses and hits the database only when a genuinely new miss occurs. We have a stored procedure in place which simply adds the new dim member in the dimension table in the database and returns the new surrogate key, which then gets sent to the fact table.

Using .NET we can efficiently hash all values in a object collection and we can also handle both character and numerical data.

Because we are using a fully cached lookup, this data flow item is case-sensitive, therefore we should make sure we equalise the case in both fact and reference table data before we compare the two. We should also make sure that in this case the Script Component is case-insensitive, because if it is we will end up with multiple different rows in our dimension table for each case variation of our inferred members.

Additionally, the Script Component task should be built either accepting a parameter for the dimension name, or it can read its name (in our implementation) and find the first word in it to determine the dimension it is used for. In the above diagram, the Script Component task is handling the State dimension, therefore its name starts with State – . This makes the implementation of multiple Script Components for multiple dimensions very easy – all we need to do is change its name and it just works. There can also be some .NET code for auditing purposes. This is also fairly easy to implement and an entry level of .NET should be sufficient for development and maintenance. A sample of this approach can be found in Project REAL and it is thoroughly discussed in the SSIS paper produced with it.

2.2. Double Lookup (SQL CAT)

image

Here, we are doing the same as before – matching fact rows against a State dimension. If a mismatch is found in the first (fully cached) lookup, we pass the row to the second one. The New StateSK is a partially cached lookup. Its purpose is similar to a hash table – it caches new rows, and when there is a brand new mismatch it adds it to the database by executing a stored procedure. Then we Union All our inferred members with the rest of the fact data.

Because the second lookup task is utilising partial lookup, it is case-insensitive and case variations of the same character term will not lead to multiple dimension rows for the same member. However the first fully-cached lookup is case-sensitive, so we should make sure that both source and reference data is in the same case because that would be more efficient. Furthermore, In SQL Server 2005 partial caching must have a specified memory limit and if we have a large number of late arriving dimension members, we may run out of memory for them. In that case SQL Server 2005 will start discarding the least used values from its cache, which may have performance implications. The latter problem is overcome in SQL Server 2008.

Another problem with this approach is the use of Union All tasks. These are semi-blocking and may impact performance when used multiple times in our ETL package.

For much more detailed description, including samples you can go to:

SQL CAT – Assigning surrogate keys to early arriving facts using Integration Services

From our testing and research we reached the conclusion that using Script Component is the best approach, closely followed by Double Lookups. The stored procedure approach is slow, hard to maintain and may be impossible to implement in a production environment. Using SSIS with .NET proves to be efficient, convenient and fast.If avoiding .NET is preferable, handling inferred members in SSIS is a very good alternative.