November 2010 – Boyan Penev on Microsoft BI

PowerPivot Books

Teo Lachev just published a review of Marco Russo’s and Alberto Ferrari’s book PowerPivot for Excel 2010: Give Your Data Meaning. Instead of a full-blown review, I will add a visual one:

Yes, I ordered three copies from Amazon. Ok, they are not all mine, but you can see what I think about this book…

Recently, I built a pilot for a demo in Avanade based on Twitter trends entirely using Microsoft SQL Server 2008 R2 and Microsoft Office. Since we are hearing about “Social BI”, and other “Social” aspects of software, this may help introduce an approach to this topic from Microsoft BI point of view.

The basic idea is:

Connect to search.twitter.com and get a feed which contains the results from a search
Do #1 every now and then (in my case – every 10 seconds)
Pick the last 15 feed items (tweets), and compare them to the last 15 items to determine any new tweets
Store all new tweets in SQL Server
Build a few dimensions (author, time, etc.)
Build a categorisation dimension. In my case I am using Full-Text search to find certain keywords around my search strings (e.g. Windows – bad/good/excellent/terrible), so I can analyse the feedback the search term is receiving
Build a SSAS cube on top of these dimensions and the feed history
Build an Excel pivot-based report on top of the SSAS cube

Now, for the first 4 points, we can use SSIS. We cannot natively connect to RSS/ATOM feeds in SSIS, but we can easily build a Script Transform as a data source, which connects and pulls the feed items in a data flow. I found a very useful C# reference here:

http://blogs.msdn.com/b/mattm/archive/2009/02/19/read-an-rss-feed-from-ssis.aspx

And, I customised it a bit (99% is the same as in the example above), so it works with Twitter and also, so it does not include irrelevant info. Of course, depending on what we need to achieve we can change the code to support all RSS/ATOM features. A sample of my script is below:

/* Microsoft SQL Server Integration Services Script Component
* Write scripts using Microsoft Visual C# 2008.
* ScriptMain is the entry point class of the script.*/
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.ServiceModel.Web;
using System.ServiceModel.Syndication;
using System.Xml;
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private string url = string.Empty;
    private SyndicationFeed feed = null;
    private XmlReader reader = null;
   public override void PreExecute()
    {
     //base.PreExecute();
   //here we are using a SSIS variable for the feed address, which we can populate
   //from a database table, thus providing an easily customisable interface to our
   //package
        reader = XmlReader.Create(Variables.Feed1);
        feed = SyndicationFeed.Load(reader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
        reader.Close();
    }
    public override void CreateNewOutputRows()
    {
        if (feed != null)
        {
            foreach (var item in feed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.id = item.Id;
                Output0Buffer.published = item.PublishDate.DateTime;
                Output0Buffer.title = item.Title.Text;
                Output0Buffer.authorname = item.Authors[0].Name;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

Once we can connect to our source, everything becomes pretty much a run-of-the-mill standard ETL. My package at present supports only 5 feeds and they get pulls simultaneously. This is of course just a demo and is easy to extend as required. A natural way to improve this solution is to implement some sort of logic (i.e. a looping sequence container), so you harvest more feeds. The importantly convenient feature of the feeds is their ID, which is very useful for comparing items. Apart from it, another important property of our tweets is that they may contain one or more keyword, which we are searching for. Therefore, we would like to count them for each feed (sometimes resulting in the same feed item getting counted 3 or more times). A good example of this is if we are searching for Microsoft, Vista and Windows7, and we receive something like: “Microsoft is doing well with Windows7 – it is so much better than Vista” – here we want to know that all three terms got hit and we want to show that there was a Tweet for each (multiplying the total count by 3).

The following is an example of my SSIS items:

Control Flow:

Data Flow:

I am doing a little trick here to compare my last to present feed items – tagging them as old in the start of the control flow, and then deleting old ones in the end. This is sort of a snapshot processing logic and works just fine when running only this package many times in a row (which I am doing in SQL Server Agent).The output of this package is one table with all feed items.

We don’t need much more for a simple implementation such as WardyIT’s #sqlpass Trends:

http://www.wardyit.com/sqlpass/

However, once we decide to progress a bit further, we can achieve much more. As I mentioned earlier, we can use something like Free-Text Search to analyse the content of the feeds. Utilising CONTAINS(‘something’ NEAR something), we can find keywords around the search terms. This can be very useful if we want to make a little bit of sense from the raw tweets. I prepared a table, which categorises 20 words in Positive/Negative and one level below (Strongly/Moderately/Slightly Positive or Negative) and I am matching these to the tweets. This allows me to see trends in like/dislike logic for each search term. For ad-hoc analysis and easy consumption in Excel, I also created the following SSAS solution:

I am processing my cube every hour, which allows me to see fairly up-to-date results from it. Alternatively, depending on requirements you can easily switch it to ROLAP (considering the very low complexity of the cube and the relatively low amount of fact data) and get real-time analytics.Of course, the result matters the most and this is a little (un)useful dashboard in Excel, which showcases a few of the available analytical options we have from this very simple implementation:

Please don’t slam me for the choice of search terms and some visualisation items (e.g. the line chart x-axis) – I know that some are not quite relevant, but it is just a quick demo, right? J

PowerPivot Tips and Tricks – First Findings

I just finished a little project with PowerPivot together with James Beresford, and I feel compelled to share some of the lessons I learned during these exciting if a little bit stressful few days. There are a few things I have heard in the past in regards to good PowerPivot model design, but I’d like to re-iterate some, and maybe add some more to what other bloggers have already mentioned. As it is usual with my posts – the following is not an exhaustive list, but may resonate well with newbie PowerPivot developers like me (which at this stage is probably 99% of Total).

Integer Keys

I found out the importance of having integer dimension keys. The sample model I was given to work with had floats as primary keys for all dimensions. This has always been a pretty bad choice, however, with PowerPivot avoiding this becomes even more important as these get loaded in memory and unless your workstation has heaps of it, keeping the size of the solution smaller is extremely important. Therefore, even though PowerPivot should work just fine with other data types, I would strongly recommend choosing integers for your keys.

Previewing and Filtering

Always have a look at the table you are importing into PowerPivot and make sure you deselect all the columns from your tables which you do not actually need. They can always be selected later on if needed. As it also is with SSAS – starting from a small model and extending it later is much better than starting from getting every possible column in and then removing them. This is even more valid with PowerPivot as you will use up a lot of memory to retrieve everything from the back-end database and unless you really don’t care about memory this is not a good thing.

Co-existence with SQL Server

I know for a fact that PowerPivot and SQL Server can co-exist peacefully. However, since SQL Server was running on the same box, I had to limit its memory usage to 150Mb so it doesn’t cause issues with PowerPivot.

Moving logic to the database

Instead of trying to do more in PowerPivot, if you are designing the solution and/or building the model – make sure that you move join logic, or similar in the back-end database. There is no point in importing tables in PowerPivot when the relationships between them are one-to-one. Instead – you can join them in a view, and then import them as one piece, thus avoiding some relationships in Vertipaq, which will improve performance and will make it easier to work with the model. In example, I had a 1-1 relationship through a large intermediate table (3+mil rows), which I managed to get rid of completely in the database back-end by applying a simple join.

Many To Many Relationships

If you can avoid them – do it. However, if you can’t, you can read Marco Russo’s, Alberto Ferrari’s and this PowerPivot FAQ posts to get more info on how to implement them. Unfortunately, we have to create multiple calculated members in DAX to avoid the problem and if we have many measures, this can become a bit tedious.

Saving Regularly

Yes, you should save your work, especially when you notice that memory gets scarce. Maybe not every 5 minutes, but at least every 30min should be a norm. Since once memory is very low PowerPivot starts having issues with itself (i.e. becomes a bit unstable), and you can easily lose some work. It does take some time to save a large model on disk, but it may be well worth it. This may sound as a no-brainer, but it is easy to stop saving our work because of the time required to do so. Therefore, it is possible (as it was in my case) to get a funny message asking you to Continue or Cancel and you are not too sure what’s the better choice after a couple of hours of work…

Ordering Dimensions in PowerPivot

UPDATE (2011-07-20): Indeed, with the release of CTP3 of SQL Server Denali we can order the dimension members by another column effectively making this post obsolete. Please ignore the advice below if you are working with a version of PowerPivot and BISM >= Denali CTP3.

I am sure that this post will get outdated very soon, but until that happens it is a neat little trick that PowerPivot users may need to know.

While in Analysis Services we can order our attribute hierarchies any way we want, in PowerPivot we are stuck with the default sort order, which can be annoying in some cases. The other day while building a PowerPivot model (which performed GREAT by the way), I had to order a band dimension, which had members like “Band 1-4”, “Band 5-10”, “Band 11-14”, “Band 15-19”,…, “Band 100-109”, etc. Alphabetically this got ordered very wrong as all bands starting with “Band 1…” got bunched up in the beginning followed by “Band 2…” The users were very unhappy and after searching for a solution I got a bit disheartened because I could not find a solution anywhere online.

The following is Adventure Works 2008 in Excel 2010 (through PowerPivot) and the default ordering of the English Month Name attribute in the Date dimension:

What I noticed was that the actual order which I wanted was reflected by the primary key of the table, which went 1,2,3,4,… So, I un-hid the key from the Pivot Table, renamed it to Band Order and showed users how they can nest it in front of the Band attribute. Because I had one key per band I did not have to worry about having multiple unordered bands under the order attribute.

Furthermore, if placed on columns, the Order attribute row can be hidden for sheets which remain static (it’s ok if they have slicers):

Otherwise, if it is on rows, we can use the Classic Pivot Table view and then hide the order column, which gives a different, but equally useful result:

As a side note, when ordering date attributes like in my examples above, you can use the Sort Options -> More Options functionality in Excel like described in a PowerPivot Team’s post. I used the month as an example and the technique described here should be used when having trouble with other attributes. You can also sort them manually by dragging and dropping them in the pivot table as Dan English explains here.

Thoughts on BISM, SSAS and MDX

I missed PASS this year but I am addicted to all blog posts coming from the attendees. I am reading, sending links to colleagues and mostly thinking about the future – both with disappointment (rarely), and excitement (99% of the time). There were two very significant blog posts by Chris Webb and Teo Lachev in the last couple of days and I would like to share my thoughts on the future of SSAS.

Chris and Teo are both moderately unhappy (that’s how I got the mood from what they wrote) with the future of SSAS because of the shift of focus from “traditional” M/H/R OLAP to the in-memory Vertipaq engine, as well as the switch from MDX to DAX. My initial feeling was similar – I like MDX and the way SSAS works. Furthermore, after all, SSAS is the best OLAP tool on the market right now and it got there after a long evolution. As Donald Farmer mentioned on Tech Ed in Australia this year, SSAS is the best-selling OLAP tool for Oracle. It also is the best-selling OLAP tool in general. Leaving a tool like that and moving on to a completely new technology is a stunning move in my opinion.

There are a few things, which I think are important to consider before declaring it as a wrong move. First of all, in today’s affordable 64bit world, there are a few limits for in-memory OLAP. Also, MDX, if powerful, different, interesting and sometime elegant, is HARD. I know (personally) literally 3-4 developers in Australia who know how to write good MDX and grasp the concepts behind it. The vast majority of feedback on MDX is that it is “too hard”. Therefore, the two main selling points of SSAS – ease of building analytics and performance are hard to improve on without radical action. The only way to make Microsoft OLAP easier to use is to move on from MDX and the easiest way to improve performance is to go in-memory. This is why I am very optimistic about the future of SSAS and BISM. If small-size competitors like QlikView can build a good product (technologically), which can in some cases outperform SSAS, I am very confident that the SSAS team with its vast expertise on building the most successful OLAP tool in the world can beat them over a few years. It is better to start earlier rather than later.

On the flipside, the handling of the whole matter is puzzling and disorienting. Chris had a few really good points about the problems that may arise from such a swift transition. How do we sell traditional pre-Vertipaq SSAS when it is getting replaced with a new technology, which will require a couple (at least) releases to beat everyone else and become Enterprise-ready? This is a question I would like to get an answer for, as I am an advocate of Microsoft BI and apart from the minor inconvenience to my base for arguments with advocates of other BI suites/products will no doubt cause a major headache when talking to prospective clients. This is in fact my only concern.

To recap – I am very happy about the ongoing evolution of SSAS and I would be very happy when Vertipaq becomes the SSAS of in-memory BI, when we have an easier language to query it and when the stack gets better integration between its components. However, the transition is too abrupt for my taste. The “security through obscurity” approach to marketing I am observing may not be able to convince enterprise customers with complex needs and strategic plans to adopt a tool in transition. I hope that Microsoft does put more effort in a seamless migration path from SSAS to Vertipaq, so that the #1 spot of Microsoft OLAP technologies gets preserved through this drastic change.

Jumping straight to Amazon now, to buy a book or two on DAX – a great opportunity for me to get a head start in DAX and PowerPivot before we get the technology in SSAS J My previous concerns about the PowerPivot adoption rates because of a few little things is now replaced with excitement over the fact that we will get this new technology on a full-featured client-server architecture (if not fully-featured in Denali, then soon after that).

UPDATE: Please note Amir Netz’s response to Chris’ post in the comments – it explains the intent behind BISM and paints a brighter future for MOLAP and MDX.

UPDATE #2:TK Anand has a blog post in regards to this topic here.

UPDATE #3:Chris Webb’s follow up with a more optimistic tone here.