Status change fact table – Part 4 (A PDI implementation)

(previous articles of this series can be found here, here and here)

In this series of posts we’ve been discussing a data model to handle status changes of objects and a way to track them and perform analytics.

The previous article describes the data model and in this article we show an implementation using PDI.

The idea is as follows: after reading a data slice and sorting it by order_id (our primary key) and event_date, we determine which event is the first (of the batch) and which one is the last. We then clone all events but the last one, and apply some logic:

  • The clone of any event is a “Out” event, that means the event_count should be set to -1;
  • The “Out” event will have as timestamp the timestamp of the next “In” event (the moment the event is in fact closed);
  • On the “Out” events we determine the event’s age (which is by definition 0 at the “In” events)
  • We calculate the cumulative sum of all event ages to determine the age of the order itself (this will be valid in both “In” and “Out” events, one should beware of this when using this measure in reports;
  • All “last” events (by order) shall be assigned a different output partition. The output partitions will be called “Open” for all “In” events that don’t yet have a matching “Out” event; and “Closed” for all matched “In/Out” pairs.
  • The output is appended to “Closed” partition, but it overwrites the contents of the “Open” partition.
  • On the subsequent run, all new input data plus all unmatched “In” events previously processed will be fed into the ETL. Those “In” events that eventually get matched with an “Out” event will move to the “Closed” partition

The ETL can thus run incrementally against an input data set that is fed new data periodically. If at any time the ETL runs without new input data, nothing will be done: no new data is appended to the “Closed” partition, and the contents of the “Open” partition are read, nothing is done to them (as there are no other events to process) and re-written to the “Open” partition.

At the end of this article you’ll find a download link to a ZIP file that has all necessary files to run the ETL described above.

Inside the ZIP file you’ll find the following folders:

  • data: all the input data files;
  • inbox: folder that will be processed by the transformation
  • output: destination of the processed data
  • kettle: location of the actual PDI transformation

To run the transformation just move the first few files from data into inbox (remark: you need to respect the dates; copy older files first); run the transformation. The output folder will now have two files under output/open/data.csv and output/closed/data.csv. These are the two files that constitute our output. To run it again, remove the files from the inbox folder and move the next batch of files from data. The ETL will run incrementally.

Download PDI Implementation

Read the last part of this series of articles here

Back to blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

  • Graph Databases: Loading Data with Neo4j

    Graph databases are becoming more popular as a way of storing and analysing large connected datasets. Neo4j is a...

    Read more
  • Date Dimension Revisited: Bank Holidays

    Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation...

    Read more
  • Setup Pentaho BA Server to use SSL Certificates

    SSL Certificate Java Truststore Setup SSL, or Secure Socket Layer, is a technology which allows web browsers and web...

    Read more