PDI: getting the previous value of a field, but only sometimes

In PDI we can easily retrieve the previous (or next) value of a field easily, by using the Transformation step Analytical Query. It allows us to fetch the value of a given field or fields N rows back or N rows forward. Quite useful for things like cumulative sums (although for that we could just use a Group by step), getting the previous attribute values of a SCD, etc to map a SCD of type II to a SCD of type III, etc.

However, this is a rather limited step:

  • The number of rows back or forth is fixed; it can be a variable, but can’t be dynamically changed;
  • We can only get a field’s value, not anything more complex, like operations on fields;
  • We can’t get values based on a condition (e.g., if customer number is null, take the previous customer number; otherwise, build it using some algorithm)

We were facing a problem where the answer seemed to be the Analytical Query step… only that it wasn’t. Here’s the problem: on an event tracking database we have a category_raw field; as a general rule, this is the value we want to use as the true category of the object, let’s call it category_new. However, if the category_raw field value is ‘X’, we should ignore it and instead use the previous category_new value. Sounds simple, right?

Here’s a sample of the data and the desired output:

category_raw;category_new
A;A
B;B
X;B
D;D
E;E

From here it seems quite obvious: we use the Analytical Query step, fetch the category_raw value of the previous row and with a simple condition we evaluate

category_raw == “X” ? prev_category_raw : category_raw

However, if we have various consecutive exceptions,

category_raw;category_new
A;A
X;A
X;A
D;D
E;E

This approach doesn’t work: the 3rd row would fetch “X”, the value of the previous row, not “A” which occurred 2 rows before.

We tried to trick the Analytical Query step into fetching the value of the same field it’s creating, but that doesn’t work either. The step has to read values from one field and write a new field, but they must be different.

In the end we decided to go with a simple Java Script step, not so much because it’s the only way to go (you can quite easily fetch the objects you need using a Java class, Java expression, etc.), but because it’s simple.

Here’s a Javascript code:

var category_new;
if( category_raw == “X”){
category_new = getVariable( “prev_category_new”, “”);
}else{
category_new = category_raw;
}
setVariable( “prev_category_new”, category_new, “s”);

We know we’re setting a variable and using it in the same transformation, which is something we were always told is wrong. And usually, it is. However, as setting and getting the variable both happen inside the same step, it’s actually harmless: the variable is set for row 1 before it’s called on row 2, and so on…

Here’s a sample of the result:

category_raw;category_new
A;A
B;B
X;B
X;B
X;B
F;F
X;F
H;H

Each row picks up the correct category_new value, taking into account the handling of the excepcional “X” values.

Hello world, again

Well, we finally got round to upgrade our Pentaho installation to 5.2 and decided to revamp our World Population dashboard in the process.

First thing you may notice: it’s called World Data now, not World Population. This is because we plugged in data available at the Gapminder and add a few selectors to allow you to choose different metrics. Anything from Overall population to GDP/capita in inflation-adjusted PPP dollars, to percentage of renewable energies, you name it and probably Gapminder has it.

The best part: there’s time series data, sometimes dating as far back as the Middle Ages, which allows us to show how those indicators evolved for any particular country. Just click on a country and you’ll see.

We also added a switch to allow you to change the color scale from linear to log (the algorithm decided automatically which one to use, but sometimes it may choose unwisely).

Here, enjoy it: World Data demo dashboard.

Reading/Writing files from a Kettle CDA datasource in Pentaho 5.x

Kettle datasources in CDA are the most versatile way of getting data from disparate systems into a dashboard or report. Both C-tools and PRD expose Kettle as a datasource, so we can call a transformation that queries several different systems (databases, web services, etc.), do some Kettle magic on the results and then ship the final resultset back to the server.

However… there’s no way out of the box to reference sub-transformations in Pentaho 5.x. Solution files are now hosted on Jackrabbit and not on the file system and ${Internal.Transformation.Filename.Directory} just doesn’t work. If you have a transformation calling a sub-transformation and upload both to the repository, when you run it you get an error message saying that it cannot find the file.

Although this usually isn’t a problem, as most Kettle transformations used by dashboards are relatively simple, for more complex tasks you may want to use mappings (sub-transformations), call a job or transformation executor or, and perhaps the best use case, use metadata injection. In these scenarios you _really_ need to be able to call ktr files from the BA server context, and as it happens you can’t.

Well, sort of. As it turns out, you can.

The problem: lets say you have your server installed in /opt/pentaho/biserver-ce (or biserver-ee, we at Ubiquis don’t discriminate), and your kettle file is in your repository in /public/myDashboard/files/kettle/myTransformation.ktr. You want to use Metadata injection on a transformation called myTemplate.ktr and so you reference it by ${Internal.Transformation.Filename.Directory}/myTemplate.ktr in your metadata injection step. When you run your datasource, your error message will say something like

Cannot read from file /opt/pentaho/biserver-ce/pentaho-solutions/public/myDashboard/files/kettle/myTemplate.ktr because it’s not a file.

So what seems to be happening is that Kettle is using your pentaho-solutions folder as the root folder and it just appends the JCR path to it. Which obviously fails, as there’s no physical counterpart in your filesystem for the JCR files.

But, if you use the Pentaho Repository Synchronizer plugin to copy the contents of JCR to the filesystem and vice-versa (which is perhaps the simplest way to have your solution files under GIT or SVN, at least until the Ivy GS plugin becomes fully functional), then the synchronisation process will create a folder repositorySynchronizer inside your pentaho-solutions folder and use it to create the copies of every JCR file.

So, after synchronising JCR with the file system, your template transformation can be found in /opt/pentaho/biserver-ce/pentaho-solutions/repositorySynchronizer/public/myDashboard/files/kettle.

And now, the funny bit: to get it working, just create a symbolic link to remove the extra repositorySynchronizer element from your path: go to your pentaho-solutions folder and just type ln -s repositorySynchronizer/public public.

Now the template transformation is found and you can reference it from a transformation called by a dashboard.

Credit for the discovery goes to Miguel Cunhal, our junior consultant, that had the tough task of getting our metadata injection transformation running inside a demo dashboard. But more on that in another post.

Dynamic date ranges in MDX: what happens if there’s no data?

Imagine the following scenario: a user wants a query to return the sales value from the past N years (lets say 5). It’s quite an easy query, your rows clause would look like this:

SELECT {[Measures].[Sales]} ON COLUMNS,
([Time].[2005].Lag(4) : [Time].[2005]) ON ROWS
FROM [SteelWheelsSales]

(As we’re using our good old friend SteelWheelsSales, lets pretend we’re back in 2005; If you’d like to have SteelWheelsSales updated, say to cover 2013-2015, please vote on the JIRA we created)

Here lies the problem: SteelWheelsSales only has data from 2003 to 2005, so the member [Time].[2005].Lag(4), which would be 2001, doesn’t exist. As such, the first member of that range doesn’t exist and the row set is empty.

And that’s not nice. It would be good to have a query that goes back N years in time but if there’s not enough data it still displays something.

What we need is a dynamic date range. It goes as far back as 4 years ago or, if there aren’t enough years in your database, shows all years.

Here’s what such a query would look like:

WITH
Set YEARS as [Time].[Years].Members
Member [Measures].[Year Number] as Rank([Time].[2005], YEARS ) - 1
Member [Measures].[Lag] as Iif( [Measures].[Year Number]<4, [Measures].[Year Number], 4) SELECT [Time].[2005].Lag( [Measures].[Lag] ) : [Time].[2005] ON ROWS, [Measures].[Sales] ON COLUMNS FROM [SteelWheelsSales]

Here's how it works:

  • The Year Number measure determines the rank of the last year of the date range; we subtract 1 to make it zero based;
  • The Lag measure checks whether the last year's rank is at least 4. If it's not, it takes the zero based rank (which means, it takes the first available year as the start of the date range
  • The date range on Rows takes this Lag measure to set the beginning of the range

This query returns all three years available in SteelWheelsSales. However, if more periods are available, it'll display up to 5 years of data.

And we can parametrize the query to take a year number:

WITH
Set YEARS as [Time].[Years].Members
Member [Measures].[Year Number] as Rank([Time].[${year}], YEARS ) - 1
Member [Measures].[Lag] as Iif( [Measures].[Year Number]<4, [Measures].[Year Number], 4) SELECT [Time].[${year}].Lag( [Measures].[Lag] ) : [Time].[${year}] ON ROWS, [Measures].[Sales] ON COLUMNS FROM [SteelWheelsSales]

Bonus points: make it smarter, that is, make the query dinamycally return a range of members of the time dimension, regardless of the level of the parameter:

WITH
Set MEMBERS as ${timeParameter}.Level.Members
Member [Measures].[Member Number] as (Rank(${timeParameter}, MEMBERS) - 1)
Member [Measures].[Lag] as IIf(([Measures].[Member Number] < 4), [Measures].[Member Number], 4) SELECT {[Measures].[Sales]} ON COLUMNS, (${timeParameter}.Lag([Measures].[Lag]) : ${timeParameter}) ON ROWS FROM [SteelWheelsSales]

This query takes a parameter ${timeParameter} as a fully qualified MDX member, e.g. [Time].[2004] or [Time].[2004].[QTR1] or [Time].[2004].[QTR1].[Jan] and returns a row set with 5 time periods, or a smaller range if we don't have that much data in the past.

GEM, a Generic ETL Machine

It’s finally here. After a few months of work, a lot of bugs fixed and a last minute refactor of all database connections, we are proud to release the first version of GEM, Generic ETL Machine.

It’s an ETL framework based on Pentaho Data Integration, to automate most of the common tasks that take time to develop, are essential to ensure a proper maintainable ETL, but that are hard to explain when the project’s stakeholders want to see their data. Things like enabling a solid logging mechanism with email alerts; data lineage to allow us to trace the data from the reports and dashboards, to the cube, then the fact table, then the staging tables, then the source files or databases, complete with line numbers, date the record was extracted, etc; a way (still very archaic) to define the order in which several tasks should run; a way to quickly deploy from the development laptop to a server and switch environments, from dev to test, to prod; and all this, of course, configurable.

This is still the first version and a lot of work is left to do. For the time being there’s no user interface to display the status of each ETL run or the results, all rollbacks must be performed manually in the database and there’s no way to schedule different tasks at different intervals. That’s all on the roadmap, time permitting, together with increasing the number of available samples (CSV files and web services to name a couple), support for PDI clusters, output and input directly into Hadoop in its multiple shapes and forms.

But it’s usable as it is (well, sort of – we only support a couple databases on each stage for the time being) and it’s been allowing us to spend a significant larger amount of time dealing with what really matters when we have to develop an ETL from scratch: the data analysis and the design of the data model. Not enabling file or DB logging, or setting up the email step for the 500th time.

Clone it, fork it, commit to it, enjoy!

source code on Github

Pentaho Community Meeting 2014, Antwerp

Wow, what a ride!

Last week-end (part of) the Pentaho Community gathered in Antwerp for the 7th (8th?) annual Pentaho Community Meeting. This year the location was Antwerp in Belgium, hosted by our friends at Know BI and sponsored by Pentaho.

And it was unanimously considered the best PCM ever, if not by anything else, by the sheer quality of the presentations. We had tons of cool stuff being presented, a lot of new Sparkl plugins, PDI plugins and al in all, interesting tricks learnt from people across a wide range of specialities and roles in the Pentaho world.

In case you missed it, it’s rumoured next year PCM will be held in London (and Ubiquis is keen to help organising it) but, most importantly, you should check the Pentaho London User Group, which will meet again on 2 December at Skills Matter. Among other interesting topics, Dan Keeley (@codek1) will be presenting a talk covering the last PCM.

See you there!

MonetDB bulk loader and PDI

It’s pretty much common knowledge that you cannot use a Table Output step with a column oriented database. All those insert statements are painstakingly slow. That’s why we have bulk loaders. Not all DBMSs have a bulk load step, but the most common ones do (either as a Job entry or a Transformation step). There’s even a Vertica bulk loader, though not included with the release (note to self, create a JIRA requesting that).

This post is about PDI’s MonetDB bulk loader step. You have to define the DB connection, set the target schema and table, set a buffer size (number of rows to be sent each time, default is 100k), and insert the field mappings. So, it’s relatively easy to set up and test.

pdi_bulk_loader

It works quite nicely and… oh, wait, it doesn’t work anymore. Apparently something is broken with the bulk load step. It does get the rows through up to whatever you defined as the buffer size, but then it blows up. So, in this case it loaded 100k rows and threw an exception after that. If you increase the buffer size to 500k rows it’ll load 500k rows and blow up on row 500001. And so on.

Obviously this is a problem. If your target DB is MonetDB you have two options to load the data via PDI: the slow Table Output or the much faster MonetDB bulk loader, which is broken.

After some googling I found a JIRA issue filed describing exactly the same problem (PDI-12278) and it also suggested a code change to fix it. The issue was reported in May and it’s on the backlog, so it’s not likely it’ll make it to a release anytime soon and I needed the step working now, so I decided to test the fix. As it happens, it works very nicely.

Now, I’m not a Java dev. In fact, I can’t even read Java code, let alone write it. So I decided to write this to explain to the non-devs around what you need to do to get the bulk loader step working.

I assume you have a Linux/Unix/Mac machine to do this. If you’re running Windows you should: a) Switch to a decent operating system; b) failing that, install Cygwin and all the packages required for this to work (e.g., you’ll need xmltasks as least, but last time I used Cygwin on a Windows box to build anything was some 4 years ago, so don’t take my word for it).

Step 1: Get the code

From your shell just clone PDI from the Github repo:

$ git clone https://github.com/pentaho/pentaho-kettle

Step 2: Build it

$ cd pentaho-kettle
$ ant dist

While this is building, grab yourself a cup of coffee and wait. It may take up to 1 hour to download all the required dependencies.

Check that you get Build successful in the end.

Step 3: See if it works

$ cd dist
$ ./spoon.sh

If it doesn’t run, most likely something is wrong with your environment, or the build failed somewhere. Go back to step 2 and fix whatever is the problem.

Step 4: Change the code, build again

$ cd engine/src/org/pentaho/di/trans/steps/monetdbbulkloader
$ nano MonetDBBulkLoader.java

Comment out lines 505-511 (WARNING: obviously, as more code is committed to the repository, the line numbers may change. Below you have what the code should look like, with a few lines before and after for context; easy way to find the code snippet you need to change? search for “again…”, there’s a single occurrence)

// again...
error = data.in.waitForPrompt();
if ( error != null ) {
throw new KettleException( "Error loading data: " + error );
}

// COMMENT FROM HERE
// data.out.writeLine( "" );
// and again, making sure we commit all the records
// error = data.in.waitForPrompt();

// if ( error != null ) {
// throw new KettleException( "Error loading data: " + error );
// }
// TO HERE

if ( log.isRowLevel() ) {
logRowlevel( Const.CR );
}

Save the file and build it again.

Step 5: Test to see it works

pdi_bulk_loader_fixed

It’s working nicely and loaded all 10M rows in about 1’15” (the upload was done to a remote server some 3 countries away, not localhost, hence the relatively slow load speed).

Step 6: Apply the patch to a release version of PDI, instead of using trunk-snapshot

The previous steps allowed us to test the patch on the trunk-snapshot version of PDI. Most likely, this is NOT the version of PDI you want to run. You may want it for example, in PDI 5.2. What we need to do now is unpackage the JAR file where the MonetDBBulkLoader class is, copy it to the same location on the release version and then repackage.

For those that don’t knot the least bit about java, all those JAR files inside an application folder (e.g., the lib folder of PDI) are just archive files. You can unzip them to see what’s inside.

So first, lets copy the trunk-snapshot file to somewhere nicer and unpack it. Lets assume you have a ~/temp folder you want to use

$ cd lib
$ cp kettle-engine-TRUNK-SNAPSHOT.jar ~/temp/

Now do the same for your release PDI kettle engine (just replace <version> by your PDI version and build numbers)

$ cd /lib
$ cp kettle-engine-<version>.jar ~/temp/

Unzip both files:

$ cd ~/temp
$ mkdir trunk-snapshot
$ mkdir release
$ unzip kettle-engine-TRUNK-SNAPSHOT.jar -d trunk-snapshot
$ unzip kettle-engine-<version>.jar -d release

No copy the class you want from the snapshot folder to the release folder,

$cp trunk-snapshot/org/pentaho/di/trans/steps/monetdbbulkloader/MonetDBBulkLoader.class release/org/pentaho/di/trans/steps/monetdbbulkloader/MonetDBBulkLoader.class

Repack the release jar file:

$cd release
$jar cmf META-INF/MANIFEST.MF ../kettle-engine-<version>.jar *

Finally, grab your brand new kettle-engine JAR file and copy it to the lib folder of your PDI (remember to backup the original one, just in case).

Step 7: test that it works

Your MonetDB bulk loader step should now work in your 5.2 PDI version.

WARNING: This step by step guide is provided as-is. It was tested on PDI CE 5.2 and worked for me. That does not asure it will run for everybody, nor across all versions or all platforms. Use it at your own risk and remember to backup your original files first. If you spot an error or a typo on this post, please leave a comment and we’ll change it.

1 year gone by

Ubiquis Consulting turned 1 year old last Saturday. It’s been a great year, full of the normal ups and downs of starting a business, but overall a very positive experience. And we got to bake a cake to celebrate the anniversary, which is always a plus.

IMG_1503

For our second year, we plan to keep this blog a bit busier, so stay tuned.

So, what else?… Oh, yes, please join me in welcoming Miguel Cunhal to the Ubiquis Consulting family. Miguel is a very bright young consultant and he is looking forward to learn all things BI and Pentaho. We’re sure he’s a very valuable adition to our team.

Date dimensions

As this is the first post about data warehousing stuff I thought we might start with something simple: the date dimension.

In a time when a lot of organisations are talking about big data, predictive analytics, no-SQL data stores and huge hadoop clusters one might think the basic stuff is well covered. It’s known territory, there’s nothing new to discover there.

However, even the most basic of things are sometimes weirdly wrong in a lot of places. Case in point: the date dimension.

I’ve seen quite a few data warehouses and all of them have a date dimension. It’s probably the most used dimension in any data warehouse. And yet, in almost all of them things are just… wrong.

How can a date dimension be wrong? It’s just a set of columns with years, month names and numbers, days of the month, anybody can calculate those for any day, it’s quite trivial, isn’t it? Well, yes and no. The problem usually is around weeks.

If the typical year-quarter-month-day hierarchy has no big secrets around it (although I’ve seen a few date dimensions with no mention to month numbers and so months are sorted April-August-December-…), weeks cause all sorts of odd things in reports.

Case 1: weeks as children of Months

I’ve seen a few examples where a date hierarchy would look like this: Year-Quarter-Month-Week-Day. Obviously, as months are very badly behaved and don’t have an integer number of weeks, this causes some weeks to break around month boundaries. You get Week 31, child of July, with 4 days (28, 29, 30 and 31 of July) and Week 31, child of August, with 3 days (August 1, 2 and 3). Called me old fashioned, but I like weeks to be 7 days long.

This type of hierarchy is obviously wrong, but it’s also easy to fix. Just use two different hiearchies, one Year-Quarter-Month-Day and another one Year-Week-Day of Week-Day. Weeks do not fit into Months and as such should not be in the same hierarchy. And most date dimensions I’ve seen, apart a few exceptions, do have this in mind.

Case 2: getting the years wrong

However, the most common error I’ve encountered is more subtle. Let’s first use the naive approach and see where it takes us. A simple PDI transformation generates 10 rows and calculates dates between 2013-12-23 and 2014-01-11. I chose these limits because they’re both more than 1 week apart from the year change. The calculator step is calculating the year, week number and day of week of the date. I also calculate the day of the week in English because I want to know what does DOW=1 mean and I never know from memory.

naive_approach

When we preview the data we notice immediately two things: 1) weeks start on Sunday (ok, it’s a convention I can live with) and 2) 2013-12-29, 2013-12-30 and 2013-12-31 are in week 1… of 2013?

naive_data

This can’t be right, can it? If 2013-12-31 belongs to week 1, as does 2013-01-01, that means we have this weird week 1 of 2013 that has some days in January and some more in December. Which doesn’t make any sense. Even worse, if we repeat the exercise for end of 2012, beginning of 2013, we see that there are 5 days in January 2013 that belong to week 1. These 5 days, together with 3 days in December give us a very weird W1 of 2013 with 8 days, which are divided in to two groups very far apart from each other.

But, believe it or not, I’ve seen something like this in a lot of production dim_date tables and the users’ reaction was more a “meh, we know our data never makes sense around the end of the year, we live with it” than a “this is obviously wrong, fix it!”. This was a production data warehouse that was used everyday by senior level executives to get numbers out and evaluate the company performance and there’s a 2 week period every year for which the data is uterly useless and completely wrong!

Call me old fashioned, but for me a week should have the following two basic properties:
1) it must have 7 days. Always.
2) they must be 7 consecutive days.

On top of that, if possible, I’d like it to:
3) always start in the same day of the week
and 4) sorting within a week must match the chronological sorting of dates.

So clearly we have a problem here. Looking at the calculator step and the options available (btw, someone should sort those, it’s getting harder to find a specific calculation with every new version of PDI), there’s an interesting option immediately after the Week of Date A: the ISO8601 Week of Year of date A. And also ISO8601 Year of date A.

calculator_options

The wikipedia article explains quite well what the definition is and how to implement it (my favourite definition, because it’s also the easiest to actually implement in an algorithm: week 1 is the week of January 4th). So, there’s a standard for calculating week numbers, lets use it! But, while reading it, it becomes clear that the year cannot be the calendar year. It may happen that January 1st belongs to week 52 of the previous year (e.g., 2012-01-01 belongs to Week 52 of 2011) or even week 53 of the previous year (2010-01-01 belongs to week 1 of 2009). Likewise, December 31 may belong to week 1 of the following year, as was the case with 2013-12-31.

So, lets give it another go and calculate week numbers and years by the ISO 8601 standard.

correct_calculator

correct_data

Ok, now it makes sense. Just one thing, though: ISO weeks start on Mondays and days of week return 1 for Sundays. So we can’t use the Day of week calculation to give us an ordinal column.

Two options there:
a) Map Mondays to 1, Tuesdays to 2, …, Sundays to 7
b) Just replace all 1 for Sundays by 8 if you’re feeling lazy.

Now we can calculate a proper date dimension using PDI, one that doesn’t blow up when users query the week over week growth of sales around New Year.

transformation

You can download the transformation here: dim_date.ktr.

And this is what the Mondrian hierarchy looks like:


<Dimension type="TimeDimension" visible="true" highCardinality="false" name="Dates">
<Hierarchy name="By Month" visible="true" hasAll="true" allMemberName="All" primaryKey="date_id">
<Table name="dim_date">
</Table>
<Level name="Year" visible="true" column="year" type="Integer" uniqueMembers="true" levelType="TimeYears">
</Level>
<Level name="Month" visible="true" column="month_name_full" ordinalColumn="month_number" type="String" uniqueMembers="false" levelType="TimeMonths">
</Level>
<Level name="Day" visible="true" column="day_of_month" type="Integer" uniqueMembers="false" levelType="TimeDays">
</Level>
<Level name="Date" visible="true" column="date_string" type="String" uniqueMembers="true" levelType="TimeDays">
</Level>
</Hierarchy>
<Hierarchy name="By Week" visible="true" hasAll="true" allMemberName="All" primaryKey="date_id">
<Table name="dim_date">
</Table>
<Level name="Year" visible="true" column="iso_year_of_week" type="Integer" uniqueMembers="true" levelType="TimeYears">
</Level>
<Level name="Week" visible="true" column="iso_week_of_year" type="Integer" uniqueMembers="false" levelType="TimeWeeks">
</Level>
<Level name="Day of Week" visible="true" column="day_of_week_full" type="String" ordinalColumn="day_of_week_ordinal" uniqueMembers="false" levelType="TimeDays">
</Level>
<Level name="Date" visible="true" column="date_string" type="String" uniqueMembers="true" levelType="TimeDays">
</Level>
</Hierarchy>
</Dimension>

Of course a lot more tweaks can be added to this date dimension, suggestions as to how to improve it are welcome.

Hello world, again

We got some very good feedback to our World population demo, thanks everyone.

And in response to your opinions, we made a few modifications:
– Moving the globe no longer brings up the country details overlay;
– Clicking on a country in the table will bring the details overlay;
– When a country is selected, it’s highlighted in a different color.

The new version is updated in the demo site.