Range Queries with Azure DataMarket Feeds

By default the Azure DataMarket does not allow range queries. In fact, the only way we can filter a data feed is through specifying one or more values for the “queryable” fields specified for it. There is not technical reason behind not allowing range queries as both the back-end (presumably SQL Azure, or SQL Server) and the OData protocol support them. Fortunately, there is a way to consume a range of the values in a column of a data feed in PowerPivot. It is not straight-forward and I do not think that the target audience of both self-service BI/PowerPivot and the DataMarket itself would appreciate the complexity, but it could be useful anyway.

If we want to pull all three tables from the DataMarket we can simply use https://api.datamarket.azure.com/BoyanPenev/DateStream/ as the URL in PowerPivot:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Otherwise, we can pick each one with a URL like (for the BasicCalendarEngish table):

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish

If we filter the data set on the DataMarket website to request only the data for 2010 we get the following URL:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20eq%202010

Note the last bit:

?$filter=YearKey%20eq%202010

This is simply the URL/OData encoded ?$filter=YearKey = 2010

In OData we can also use other operators, not just = (or eq). For ranges these are gt (greater than), ge (greater than or equal to), lt (less than) and le (less than or equal to). We can also use and and or operators to combine different predicates. For a more thorough list, please refer to http://www.odata.org/developers/protocols/uri-conventions. If we replace the ” = 2010″ with ” < 2010″ and then encode the URL, we do indeed get all years prior to 2010. Things get slightly more complicated when we have a more complex scenario. In example, when building a date table we may want to include all years between 2000 and 2030. To do that, we would have to write something like:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey >= 2000 and YearKey <= 2030

encoded, the same looks like this:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20ge%202000%20and%20YearKey%20le%202030

Here space is %20 and the math comparison operators have been replaced with the OData operators (in red).

If we paste this in PowerPivot and hit “Next”:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

…we get exactly what we expect – a table with 30 years.

Things get more complicated if we include the datetime DateKey in the URL. For a single date (e.g. 1900-01-01), we have to use:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′

After Applying URL encoding we get:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27

Where %27 is apostrophe and %3a is a colon (for a list of ASCII characters and their URL encoded form we can refer to http://www.w3schools.com/tags/ref_urlencode.asp).

Now, to combine the two we would need to write:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′ or (YearKey >= 2000 and YearKey <= 2030)

Encoded this becomes:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27%20or%20%28YearKey%20ge%202000%20and%20YearKey%20le%202030%29

This monstrous-to-write URL string returns 30 years of data + 1 day.

I suppose this approach can be classified as a workaround, as I have not seen any documentation on PowerPivot referring to any options for filtering data from the Azure DataMarket. However, in my opinion, this should be a feature of the DataMarket itself as it would make it easier/possible for users with any tool to get just the data they need and even possibly reduce the load on the site service since it will no longer be necessary to export everything and then attempt to apply a filter.

Advertisements

Feeds in SSRS Reports

The XML data source in SSRS can be very powerful. Not only for querying web services but also for showing feed data on our reports. In this post I will show you how we can show the contents of my favourite blog site SQLBlog.com in an SSRS report.

First we need to get the feed URL. In this case it is: http://sqlblog.com/blogs/MainFeed.aspx. Once we have this, we can use it in a new XML data source in SSRS just as it is:

Then, we can use it in a data set in our report. The little tricky part here is that SSRS expects us to send a query to the data source. We do not need to do this for an RSS feed and instead we can leave the query blank.

However, this results in an empty fields list and hitting Refresh Fields does not help. To overcome this problem we need to know what the fields are at the first place. We can open Query Designer and click on the Execute button (!). The result is a number of rows corresponding to the feed items and columns containing various types of information about the items.

In the SQLBlog case, we get one line for each category the author has placed the feed in, while in other cases we may get one line per feed item or another variation. It is always necessary to inspect the contents and ensure we understand what we get from the data source in order to deal with the feed effectively later on because the feed contents vary. Many columns are irrelevant and we need to select only the ones we need. From the sample feed I wanted to show just the title, link, pubDate, creator and the description on my report. Since the field list was blank I added these fields manually as Query Fields.

Then, I placed these fields in a table to check the result.

I could see all items as expected, but the formatting was off. Additionally, there were multiple items repeating because of the multiple categories for each article. With a bit of work I managed to group/format/lay out the items appropriately, add report actions for the links and generally transform the table in a nice report, which I could use as my default feed reader.

And the rendered report:

By the way, the second item is very interesting – Alberto Ferrari has been just awarded the Microsoft MVP award for 2011!

One problem that you may find is the default authentication SSRS uses for an XML data source – Do not use credentials. This will cause a problem once the report is deployed, so I would recommend this should be changed to something else (in example, Windows Authentication).

There we go – we have a fully functional blog reader. We can enhance this by providing a list of authors (like a table of contents with linked reports), report parameters filtering the table, and other nice to have things. Of course, there are many feed readers out there any you may wonder why you would build your own in SSRS. Well, think about not only RSS but also other content which comes through XML – in example web service content which we can query through SOAP, twitter feeds, etc. Additionally, incorporating such content within a SSRS dashboard or report could come in handy if we get our hands on a feed showing financial information. One other source for content could be Microsoft’s Azure Data Marketplace – after all the data there comes in feeds and it could be handy to show it directly on our reports with a little bit of formatting.

Download the sample report here:

[listyofiles folder=”wp-content/SSRSFeeds”]

Analysing Twitter Trends

Recently, I built a pilot for a demo in Avanade based on Twitter trends entirely using Microsoft SQL Server 2008 R2 and Microsoft Office. Since we are hearing about “Social BI”, and other “Social” aspects of software, this may help introduce an approach to this topic from Microsoft BI point of view. 

The basic idea is: 

  1. Connect to search.twitter.com and get a feed which contains the results from a search
  2. Do #1 every now and then (in my case – every 10 seconds)
  3. Pick the last 15 feed items (tweets), and compare them to the last 15 items to determine any new tweets
  4. Store all new tweets in SQL Server
  5. Build a few dimensions (author, time, etc.)
  6. Build a categorisation dimension. In my case I am using Full-Text search to find certain keywords around my search strings (e.g. Windows – bad/good/excellent/terrible), so I can analyse the feedback the search term is receiving
  7. Build a SSAS cube on top of these dimensions and the feed history
  8. Build an Excel pivot-based report on top of the SSAS cube

Now, for the first 4 points, we can use SSIS. We cannot natively connect to RSS/ATOM feeds in SSIS, but we can easily build a Script Transform as a data source, which connects and pulls the feed items in a data flow. I found a very useful C# reference here: 

http://blogs.msdn.com/b/mattm/archive/2009/02/19/read-an-rss-feed-from-ssis.aspx 

And, I customised it a bit (99% is the same as in the example above), so it works with Twitter and also, so it does not include irrelevant info. Of course, depending on what we need to achieve we can change the code to support all RSS/ATOM features. A sample of my script is below: 

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.ServiceModel.Web;
using System.ServiceModel.Syndication;
using System.Xml;
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private string url = string.Empty;
    private SyndicationFeed feed = null;
    private XmlReader reader = null;
   public override void PreExecute()
    {
     //base.PreExecute();
   //here we are using a SSIS variable for the feed address, which we can populate
   //from a database table, thus providing an easily customisable interface to our
   //package
        reader = XmlReader.Create(Variables.Feed1);
        feed = SyndicationFeed.Load(reader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
        reader.Close();
    }
    public override void CreateNewOutputRows()
    {
        if (feed != null)
        {
            foreach (var item in feed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.id = item.Id;
                Output0Buffer.published = item.PublishDate.DateTime;
                Output0Buffer.title = item.Title.Text;
                Output0Buffer.authorname = item.Authors[0].Name;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

Once we can connect to our source, everything becomes pretty much a run-of-the-mill standard ETL. My package at present supports only 5 feeds and they get pulls simultaneously. This is of course just a demo and is easy to extend as required. A natural way to improve this solution is to implement some sort of logic (i.e. a looping sequence container), so you harvest more feeds. The importantly convenient feature of the feeds is their ID, which is very useful for comparing items. Apart from it, another important property of our tweets is that they may contain one or more keyword, which we are searching for. Therefore, we would like to count them for each feed (sometimes resulting in the same feed item getting counted 3 or more times). A good example of this is if we are searching for Microsoft, Vista and Windows7, and we receive something like: “Microsoft is doing well with Windows7 – it is so much better than Vista” – here we want to know that all three terms got hit and we want to show that there was a Tweet for each (multiplying the total count by 3). 

The following is an example of my SSIS items: 

 Control Flow: 

 

Data Flow:

 

I am doing a little trick here to compare my last to present feed items – tagging them as old in the start of the control flow, and then deleting old ones in the end. This is sort of a snapshot processing logic and works just fine when running only this package many times in a row (which I am doing in SQL Server Agent).The output of this package is one table with all feed items.

We don’t need much more for a simple implementation such as WardyIT’s #sqlpass Trends:

http://www.wardyit.com/sqlpass/

However, once we decide to progress a bit further, we can achieve much more. As I mentioned earlier, we can use something like Free-Text Search to analyse the content of the feeds. Utilising CONTAINS(‘something’ NEAR something), we can find keywords around the search terms. This can be very useful if we want to make a little bit of sense from the raw tweets. I prepared a table, which categorises 20 words in Positive/Negative and one level below (Strongly/Moderately/Slightly Positive or Negative) and I am matching these to the tweets. This allows me to see trends in like/dislike logic for each search term. For ad-hoc analysis and easy consumption in Excel, I also created the following SSAS solution:

 

I am processing my cube every hour, which allows me to see fairly up-to-date results from it. Alternatively, depending on requirements you can easily switch it to ROLAP (considering the very low complexity of the cube and the relatively low amount of fact data) and get real-time analytics.Of course, the result matters the most and this is a little (un)useful dashboard in Excel, which showcases a few of the available analytical options we have from this very simple implementation:

 

Please don’t slam me for the choice of search terms and some visualisation items (e.g. the line chart x-axis) – I know that some are not quite relevant, but it is just a quick demo, right? J