Analysing Twitter Trends

Recently, I built a pilot for a demo in Avanade based on Twitter trends entirely using Microsoft SQL Server 2008 R2 and Microsoft Office. Since we are hearing about “Social BI”, and other “Social” aspects of software, this may help introduce an approach to this topic from Microsoft BI point of view. 

The basic idea is: 

  1. Connect to search.twitter.com and get a feed which contains the results from a search
  2. Do #1 every now and then (in my case – every 10 seconds)
  3. Pick the last 15 feed items (tweets), and compare them to the last 15 items to determine any new tweets
  4. Store all new tweets in SQL Server
  5. Build a few dimensions (author, time, etc.)
  6. Build a categorisation dimension. In my case I am using Full-Text search to find certain keywords around my search strings (e.g. Windows – bad/good/excellent/terrible), so I can analyse the feedback the search term is receiving
  7. Build a SSAS cube on top of these dimensions and the feed history
  8. Build an Excel pivot-based report on top of the SSAS cube

Now, for the first 4 points, we can use SSIS. We cannot natively connect to RSS/ATOM feeds in SSIS, but we can easily build a Script Transform as a data source, which connects and pulls the feed items in a data flow. I found a very useful C# reference here: 

http://blogs.msdn.com/b/mattm/archive/2009/02/19/read-an-rss-feed-from-ssis.aspx 

And, I customised it a bit (99% is the same as in the example above), so it works with Twitter and also, so it does not include irrelevant info. Of course, depending on what we need to achieve we can change the code to support all RSS/ATOM features. A sample of my script is below: 

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.ServiceModel.Web;
using System.ServiceModel.Syndication;
using System.Xml;
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private string url = string.Empty;
    private SyndicationFeed feed = null;
    private XmlReader reader = null;
   public override void PreExecute()
    {
     //base.PreExecute();
   //here we are using a SSIS variable for the feed address, which we can populate
   //from a database table, thus providing an easily customisable interface to our
   //package
        reader = XmlReader.Create(Variables.Feed1);
        feed = SyndicationFeed.Load(reader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
        reader.Close();
    }
    public override void CreateNewOutputRows()
    {
        if (feed != null)
        {
            foreach (var item in feed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.id = item.Id;
                Output0Buffer.published = item.PublishDate.DateTime;
                Output0Buffer.title = item.Title.Text;
                Output0Buffer.authorname = item.Authors[0].Name;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

Once we can connect to our source, everything becomes pretty much a run-of-the-mill standard ETL. My package at present supports only 5 feeds and they get pulls simultaneously. This is of course just a demo and is easy to extend as required. A natural way to improve this solution is to implement some sort of logic (i.e. a looping sequence container), so you harvest more feeds. The importantly convenient feature of the feeds is their ID, which is very useful for comparing items. Apart from it, another important property of our tweets is that they may contain one or more keyword, which we are searching for. Therefore, we would like to count them for each feed (sometimes resulting in the same feed item getting counted 3 or more times). A good example of this is if we are searching for Microsoft, Vista and Windows7, and we receive something like: “Microsoft is doing well with Windows7 – it is so much better than Vista” – here we want to know that all three terms got hit and we want to show that there was a Tweet for each (multiplying the total count by 3). 

The following is an example of my SSIS items: 

 Control Flow: 

 

Data Flow:

 

I am doing a little trick here to compare my last to present feed items – tagging them as old in the start of the control flow, and then deleting old ones in the end. This is sort of a snapshot processing logic and works just fine when running only this package many times in a row (which I am doing in SQL Server Agent).The output of this package is one table with all feed items.

We don’t need much more for a simple implementation such as WardyIT’s #sqlpass Trends:

http://www.wardyit.com/sqlpass/

However, once we decide to progress a bit further, we can achieve much more. As I mentioned earlier, we can use something like Free-Text Search to analyse the content of the feeds. Utilising CONTAINS(‘something’ NEAR something), we can find keywords around the search terms. This can be very useful if we want to make a little bit of sense from the raw tweets. I prepared a table, which categorises 20 words in Positive/Negative and one level below (Strongly/Moderately/Slightly Positive or Negative) and I am matching these to the tweets. This allows me to see trends in like/dislike logic for each search term. For ad-hoc analysis and easy consumption in Excel, I also created the following SSAS solution:

 

I am processing my cube every hour, which allows me to see fairly up-to-date results from it. Alternatively, depending on requirements you can easily switch it to ROLAP (considering the very low complexity of the cube and the relatively low amount of fact data) and get real-time analytics.Of course, the result matters the most and this is a little (un)useful dashboard in Excel, which showcases a few of the available analytical options we have from this very simple implementation:

 

Please don’t slam me for the choice of search terms and some visualisation items (e.g. the line chart x-axis) – I know that some are not quite relevant, but it is just a quick demo, right? J