PDI: getting the previous value of a field, but only sometimes

In PDI we can easily retrieve the previous (or next) value of a field easily, by using the Transformation step Analytical Query. It allows us to fetch the value of a given field or fields N rows back or N rows forward. Quite useful for things like cumulative sums (although for that we could just use a Group by step), getting the previous attribute values of a SCD, etc to map a SCD of type II to a SCD of type III, etc.

However, this is a rather limited step:

  • The number of rows back or forth is fixed; it can be a variable, but can’t be dynamically changed;
  • We can only get a field’s value, not anything more complex, like operations on fields;
  • We can’t get values based on a condition (e.g., if customer number is null, take the previous customer number; otherwise, build it using some algorithm)

We were facing a problem where the answer seemed to be the Analytical Query step… only that it wasn’t. Here’s the problem: on an event tracking database we have a category_raw field; as a general rule, this is the value we want to use as the true category of the object, let’s call it category_new. However, if the category_raw field value is ‘X’, we should ignore it and instead use the previous category_new value. Sounds simple, right?

Here’s a sample of the data and the desired output:

category_raw;category_new
A;A
B;B
X;B
D;D
E;E

From here it seems quite obvious: we use the Analytical Query step, fetch the category_raw value of the previous row and with a simple condition we evaluate

category_raw == “X” ? prev_category_raw : category_raw

However, if we have various consecutive exceptions,

category_raw;category_new
A;A
X;A
X;A
D;D
E;E

This approach doesn’t work: the 3rd row would fetch “X”, the value of the previous row, not “A” which occurred 2 rows before.

We tried to trick the Analytical Query step into fetching the value of the same field it’s creating, but that doesn’t work either. The step has to read values from one field and write a new field, but they must be different.

In the end we decided to go with a simple Java Script step, not so much because it’s the only way to go (you can quite easily fetch the objects you need using a Java class, Java expression, etc.), but because it’s simple.

Here’s a Javascript code:

var category_new;
if( category_raw == “X”){
category_new = getVariable( “prev_category_new”, “”);
}else{
category_new = category_raw;
}
setVariable( “prev_category_new”, category_new, “s”);

We know we’re setting a variable and using it in the same transformation, which is something we were always told is wrong. And usually, it is. However, as setting and getting the variable both happen inside the same step, it’s actually harmless: the variable is set for row 1 before it’s called on row 2, and so on…

Here’s a sample of the result:

category_raw;category_new
A;A
B;B
X;B
X;B
X;B
F;F
X;F
H;H

Each row picks up the correct category_new value, taking into account the handling of the excepcional “X” values.

Back to blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

  • Graph Databases: Loading Data with Neo4j

    Graph databases are becoming more popular as a way of storing and analysing large connected datasets. Neo4j is a...

    Read more
  • Date Dimension Revisited: Bank Holidays

    Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation...

    Read more
  • Setup Pentaho BA Server to use SSL Certificates

    SSL Certificate Java Truststore Setup SSL, or Secure Socket Layer, is a technology which allows web browsers and web...

    Read more