Archive for “December, 2014”

Dynamic date ranges in MDX: what happens if there’s no data?

Imagine the following scenario: a user wants a query to return the sales value from the past N years (lets say 5). It’s quite an easy query, your rows clause would look like this:

SELECT {[Measures].[Sales]} ON COLUMNS,
([Time].[2005].Lag(4) : [Time].[2005]) ON ROWS
FROM [SteelWheelsSales]

(As we’re using our good old friend SteelWheelsSales, lets pretend we’re back in 2005; If you’d like to have SteelWheelsSales updated, say to cover 2013-2015, please vote on the JIRA we created)

Here lies the problem: SteelWheelsSales only has data from 2003 to 2005, so the member [Time].[2005].Lag(4), which would be 2001, doesn’t exist. As such, the first member of that range doesn’t exist and the row set is empty.

And that’s not nice. It would be good to have a query that goes back N years in time but if there’s not enough data it still displays something.

What we need is a dynamic date range. It goes as far back as 4 years ago or, if there aren’t enough years in your database, shows all years.

Here’s what such a query would look like:

WITH
Set YEARS as [Time].[Years].Members
Member [Measures].[Year Number] as Rank([Time].[2005], YEARS ) - 1
Member [Measures].[Lag] as Iif( [Measures].[Year Number]<4, [Measures].[Year Number], 4) SELECT [Time].[2005].Lag( [Measures].[Lag] ) : [Time].[2005] ON ROWS, [Measures].[Sales] ON COLUMNS FROM [SteelWheelsSales]

Here's how it works:

  • The Year Number measure determines the rank of the last year of the date range; we subtract 1 to make it zero based;
  • The Lag measure checks whether the last year's rank is at least 4. If it's not, it takes the zero based rank (which means, it takes the first available year as the start of the date range
  • The date range on Rows takes this Lag measure to set the beginning of the range

This query returns all three years available in SteelWheelsSales. However, if more periods are available, it'll display up to 5 years of data.

And we can parametrize the query to take a year number:

WITH
Set YEARS as [Time].[Years].Members
Member [Measures].[Year Number] as Rank([Time].[${year}], YEARS ) - 1
Member [Measures].[Lag] as Iif( [Measures].[Year Number]<4, [Measures].[Year Number], 4) SELECT [Time].[${year}].Lag( [Measures].[Lag] ) : [Time].[${year}] ON ROWS, [Measures].[Sales] ON COLUMNS FROM [SteelWheelsSales]

Bonus points: make it smarter, that is, make the query dinamycally return a range of members of the time dimension, regardless of the level of the parameter:

WITH
Set MEMBERS as ${timeParameter}.Level.Members
Member [Measures].[Member Number] as (Rank(${timeParameter}, MEMBERS) - 1)
Member [Measures].[Lag] as IIf(([Measures].[Member Number] < 4), [Measures].[Member Number], 4) SELECT {[Measures].[Sales]} ON COLUMNS, (${timeParameter}.Lag([Measures].[Lag]) : ${timeParameter}) ON ROWS FROM [SteelWheelsSales]

This query takes a parameter ${timeParameter} as a fully qualified MDX member, e.g. [Time].[2004] or [Time].[2004].[QTR1] or [Time].[2004].[QTR1].[Jan] and returns a row set with 5 time periods, or a smaller range if we don't have that much data in the past.

GEM, a Generic ETL Machine

It’s finally here. After a few months of work, a lot of bugs fixed and a last minute refactor of all database connections, we are proud to release the first version of GEM, Generic ETL Machine.

It’s an ETL framework based on Pentaho Data Integration, to automate most of the common tasks that take time to develop, are essential to ensure a proper maintainable ETL, but that are hard to explain when the project’s stakeholders want to see their data. Things like enabling a solid logging mechanism with email alerts; data lineage to allow us to trace the data from the reports and dashboards, to the cube, then the fact table, then the staging tables, then the source files or databases, complete with line numbers, date the record was extracted, etc; a way (still very archaic) to define the order in which several tasks should run; a way to quickly deploy from the development laptop to a server and switch environments, from dev to test, to prod; and all this, of course, configurable.

This is still the first version and a lot of work is left to do. For the time being there’s no user interface to display the status of each ETL run or the results, all rollbacks must be performed manually in the database and there’s no way to schedule different tasks at different intervals. That’s all on the roadmap, time permitting, together with increasing the number of available samples (CSV files and web services to name a couple), support for PDI clusters, output and input directly into Hadoop in its multiple shapes and forms.

But it’s usable as it is (well, sort of – we only support a couple databases on each stage for the time being) and it’s been allowing us to spend a significant larger amount of time dealing with what really matters when we have to develop an ETL from scratch: the data analysis and the design of the data model. Not enabling file or DB logging, or setting up the email step for the 500th time.

Clone it, fork it, commit to it, enjoy!

source code on Github