Microsoft just released a new Data Flow transformation for SSIS – Balanced Data Distributor. Reading the installation pages, we can see that there are a few interesting things about it, among which – it is a multithreaded transform, which uniformly splits a data stream in multiple outputs. I decided to give it a quick test and see how it performs in comparison to a straight insert and a Script Component which splits the input stream in two, as well.
First the tests show performance with input of a SQL Server table (OLE DB Input) and output to a raw file. The input is a TOP 20000000 * select from a large table. The results are as follow:
1. Straight insert: 32 seconds
2. Balanced Data Distributor: 36 seconds
3. Script Component: 57 seconds
4. Conditional Split: 42 seconds
Note that the Conditional Split divides the stream based on the remainder of a division of an integer field, while the Script Component does it based on the row number. The Conditional Split may or may not be useful in a number of cases – when we have non-numeric data only, or when the range of the numeric data is not wide enough to split in the number of streams we would like to (e.g. Gender Key can be 1 or 2, while we may want to split in 10 parallel data streams). Therefore, its functionality is not equivalent to the BDD transform (thanks to Vidas Matelis for asking me to include it in the case study).
The second tests show how fast it is with reversed input and output (raw file to SQL Server table) and everything else identical.
- Straight insert: 56 seconds
- Balanced Data Distributor: 1 minute and 47 seconds
- Script Component: 1 minute and 57 seconds
- Conditional Split: 1 minute and 54 seconds
Over 40M rows the difference between the BDD transform and the Script Component – 1:16 vs 2:13 (vs 1:24 for the equivalent Conditional Split), or 57 seconds difference when inserting in a raw file. In general, the overall performance improvement seems to be around 35-45% in that direction. Since I am inserting in my local SQL Server in the same table the parallel split does not seem to be beneficial even though the destination is slower than the source. In fact the straight insert outperforms the parallel data flows in both cases. If we were to insert into a partitioned table over different filegroups hosted on separate drives the results could be quite different. Either way, in my opinion it is a nice and welcome addition to our arsenal of SSIS data flow components.
Edit: Len Wyatt from the SQL Performance Team at Microsoft has provided a link to his post with great bit of detail about the BDD transform in the comments below. Please take a minute and have a look if interested.