Filling in the gaps

Today I saw Harris Ward asking a question I’ve seen pop up over and over again. And somehow the solution is never obvious.

The problem is how to fill in the gaps when we have nulls in our data stream. For example, our data set looks like:

And we want to replace all those nulls by the last non-null value we’ve seen in the data set: all non null values stay as is, but Mar, Apr and May get the value 20, Jul gets 50 and Sep, Oct and Nov get 40.

If we’re ingesting the data from a CSV, Excel or XML file, it’s easy to achieve this (ever wondered what the “Repeat” flag means in the fields tab of the Text File input step?).

But suppose our data doesn’t come from files. We need to implement the equivalente of the repeat flag inside a PDI transformation.

Invariably, when any of us encounters this requirement we all go through the same attempts. Let’s use the Analytical query step! Or the Group by with a “pass all rows option”. Or the NVL operation of the calculator step. And then we learn these all fail.

This is perhaps one of the best examples where a Javascript step is the best option. Because it lets us persist variables between rows, and we can control their values at will.

So, to achieve the desired result, here’s our proposed solution: add a Javascript step and use the following code:

Variable tmp is only initialized with the first row of data. From then on it’ll keep the last non null value of sales (or 0).

Sure, we’ve all heard the motto “Don’t use Javascript, it kills PDI performance”. But the occasional simple JS script doesn’t hurt that much in most cases, you’re likely to have a slower step elsewhere in your transformation (a DB read or write operation, for example). And, if performance is vital, you can always re-code this as a User defined Java class. But for all but the most extreme applications Javascript will work just fine, though.

Don’t forget to specify on the configuration of the output fields that you want to overwrite the value of the sales field, though:

And here’s the transformed data:

Exactly as intended.

Download the sample transformation here.

Back to blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

  • Graph Databases: Loading Data with Neo4j

    Graph databases are becoming more popular as a way of storing and analysing large connected datasets. Neo4j is a...

    Read more
  • Date Dimension Revisited: Bank Holidays

    Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation...

    Read more
  • Setup Pentaho BA Server to use SSL Certificates

    SSL Certificate Java Truststore Setup SSL, or Secure Socket Layer, is a technology which allows web browsers and web...

    Read more