PDI and XML: Beyond the native steps


In day to day development we are sometimes faced with having to not only parse data delivered in XML, but a. lso generate complex XML files containing said data. It is nowadays common to exchange database extracts in XML format, and, in many other cases, results to API calls are provided in XML. Moreover, in many occasions one needs to construct either simple or complex XML, either to transmit data or to notify other systems that an action is required.

All of this requires a data engineer to create processes that receive, parse, and construct XML in near real time.

Parsing XML in PDI

Constructing a process to parse XML in PDI can be a frustrating ordeal. The developer creates a transformation to read XML using the native steps, optimises it for large throughput to a database only to discover that, once the source files increase in size, the real bottleneck is in the reading process. Moreover, when splitting the original file into several streams, issues with recursion or complex structures, memory and CPU resources increase almost exponentially.

To overcome this problem we suggest reading XML using the StAX parser step and processing its output. All of the parsing needs to be done manually, of course, but now all of the data is being streamed before being interpreted. We also suggest staging data using one or several “Serialize to file” steps before doing any merges or joins before finally saving to a database. Storing the data locally during processing is considerably faster than storing it in a staging database over the network, especially for large files.

Constructing XML

Creating an XML to represent a complex relational database structure can be a daunting task, especially when several hierarchical levels need to be represented. We proposed that the best way to do this is to template the output corresponding to the data in the leaves of the hierarchies.

This, in turn, will become the data to be represented by the higher levels. This recursion can be easily constructed using PDI transformations.

Ordering requirements, such as XML sequences, can easily be catered for and joins between parts of the XML are now reduced to a sorting of snippets according to a pre-defined priority.

Not only does this reduce the memory requirements when generating large XML files, it also significantly reduces the number of queries one makes to the database where the data is stored, thus considerably speeding up XML generation.


PDI is a generic tool that allows not only the integration of data sources but also its export. By its visual nature, it allows the creation of complex pipelines without forcing the developers to do any type of advanced coding. XML, however, has traditionally been in the domain of library utilization and custom code. We present a number of patterns that bring both the parsing and construction of complex and large XML files to the domain of visual programming used in PDI.

You can download the code and example data here.

Setup Pentaho BA Server to use SSL Certificates

SSL Certificate Java Truststore Setup

SSL, or Secure Socket Layer, is a technology which allows web browsers and web servers to communicate over a secured connection. This means that during our initial attempt to communicate with a web server it will present our web browser with a set of credentials, in the form of a “Certificate”, as proof the site is who and what it claims to be. In certain cases, the server may also request a Certificate from our web browser, asking for proof that we are who we claim to be.

This post will go through the process required to have our Pentaho Server using SSL certificates.

Create your SSL key/certificate

First thing to do is to create the SSL certificate key for our tomcat server to use. And then we will need to tell our tomcat server to allow HTTPS connections. To do this in the Pentaho Server we edit a file named “server.xml”, referenced here:

And here :

For this exercise we will be using the Java KeyStore (JKS). A keystore manages the provision of the client private keys/certificates we use when we try to access a server, and a truststore (cacerts in Java) manages the verification of those keys/certificates. 

In order to do create the key we need to create a new JKS keystore, containing a single self-signed key/certificate.  Execute the following from a terminal command line:

$ keygen -genkey -alias tomcat -keyalg RSA

This command will create a new file, in the home directory of our user named “.keystore”. To specify a different location or filename, add the -keystore parameter, followed by the complete path to our keystore file. We will also need to reflect this new location in the server.xml configuration file.

        1. First we are prompted for a password for the keystore and later for a password for the actual key. (Strangely in older tomcat versions they need to be the same) This keystore password will have to be given to our server application so it can access our server side keys. Note – The default password used by Tomcat is “changeit” (all lower case).
        2. After deciding on a password, a number of questions need to be answered as to certify the authenticity of our server.
        3. Finally we are prompted for the password for the actual tomcat alias key. If any issues come up in this step, check if you need to have the same password as for the datastore.

Ok, we have a keystore where Java can find certificate keys to show servers when prompted, with one key with the alias tomcat. Now we need to tell Java to use this key in their trustore. We export the key to a *.cer file and go tell Java that he should incorporate this key in cacerts, Javas’ truststore.

$ keytool -export -alias tomcat -file tomcat.cer -storepass <password> -keypass <password> -keystore .keystore
Certificate stored in file <tomcat.cer>
$ cd $PENTAHO_JAVA_HOME/jre/lib/security/
$ keytool -import -alias tomcat -file ~/tomcat.cer -keystore cacerts -storepass changeit

Configure Tomcat for HTTPS

Now we need to tell Pentaho 8 we are ready to use HTTPS using our key. To do so we need to change our server.xml file in the tomcat folder: tomcat/conf/server.xml.

There we need to uncomment the connector for HTTPS SSL protocol in port 8443 and add the credentials  for the keystore location and password:

<Connector URIEncoding="UTF-8" port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol"
maxThreads="150" SSLEnabled="true" scheme="https" secure="true"
keystoreFile="<pentaho_user_home>/.keystore" keystorePass=<password>
clientAuth="false" sslProtocol="TLS" />

This configures a new connector at port 8443 using https, however we do not close the http connector at port 8080 yet. This allows both to be used, at the same time (we may close it later if we want to prevent HTTP connections). Do be careful as if you try to login on both and your browser executes cached code it can create a login error. Using a private browsing window helps avoid this issue.

Also we need to tell Pentaho 8 we now have a new port in the server.properties file: pentaho-solutions/system/server.properties


And this should be it. We should have gotten our server to respond in both port 8080 using HTTP and 8443 using HTTPS. Keep in mind what I mentioned regarding the browser cache, so if we simply use a private browsing window all should be ok.

Securing a Pentaho Server

How to customise security and access

This post will show you how you can restrict the access to your Pentaho BA Server, essentially creating a “guest” user role, and cherry picking what that user has access to.

Spring Security – Becoming very selective

We want to restrict the access of visitors to only specific demos, rather than the default permissions granted to the Authenticated users.

In order to do this we first need to adjust our BA Server’s Administration->User/Roles settings. What we want to achieve is a specific role for all company users, and a set of separate roles for visitors.

First we create a role for company users with almost all privileges, and remove all other user roles. After this is done we will need to restrict the privileges of Authenticated users. We restrict their access to a minimum so we only allow them to read and execute content.

Now we add a guest user, without a role associated, which means it will behave as an Authenticated user. Once this is done we need to hard code how access to web services are given to internal roles. Our specific goal in this post will be to allow a guest user access only to a single demo dashboard.

Opening Spring Framework’s “pentaho-solutions/system/applicationContext-spring-security.xml”, you will see that this file contains the configuration of the security filters for various services. Specifically we will be working on the “filterInvocationInterceptorForWS” and “filterInvocationInterceptor” beans, which grant access to specific user roles to webservices which are accessed through URI’s. Each one of these beans lists a set of patterns and the roles that have access to them. The first step is to restrict all access only for company users. Then add patterns that will be regex-matched to cherry picked dashboards and allow any Authenticated user access. 

By default this is what filterInvocationInterceptor looks like:

<bean id="filterInvocationInterceptor" class="org.springframework.security.web.access.intercept.FilterSecurityInterceptor">
<sec:filter-security-metadata-source request-matcher="ciRegex" use-expressions="false">
<sec:intercept-url pattern="\A/content/common-ui/resources/web/(.+/)*.+\.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/.*require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/.*require-js-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/content/common-ui/resources/web/require.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/content/common-ui/resources/web/require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/content/data-access/resources/gwt/.*css\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webcontext.js.*\Z" access="Anonymous,Authenticated" />

A service granted to both Authenticated and Anonymous will refer to a service required for any access (including prior to logging in), we must allow those to remain open as they are essential. We therefore focus on patterns that are granted only to Authenticated users and edit the “access” to list only our role “User” and “Admin”.

<sec:intercept-url pattern="\A/mantle/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/.*\Z" access="User,Admin" />

This security filter manages only basic BA Server components, like for instance the login services. However does not filter access to the home screen. Which means a user that is Authenticated has no access to mantle, however will be shown an empty homepage. In order to avoid this we add a pattern for /home, also restricted to our role “User” and “Admin”. It is important to note that these security beans are applied from top to bottom, which means order matters very much. We add our new pattern immediately after the mantle one. A snippet should look like the one below:

<sec:intercept-url pattern="\A/mantlelogin/.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/mantle/mantleloginservice/*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/mantle/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/home.*\Z" access="User,Admin"/>

<sec:intercept-url pattern="\A/welcome/.*\Z" access="Anonymous, Authenticated" />
<sec:intercept-url pattern="\A/public/.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/login.*\Z" access="Anonymous,Authenticated" />

Now we turn to the filterInvocationInterceptorForWS and follow the same first step, restrict the access to all patterns granted to Authenticated changing them to only to “User, Admin”.

<sec:filter-security-metadata-source request-matcher="ciRegex" use-expressions="false">
<sec:intercept-url pattern="\A/webservices/unifiedrepository\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userrolelistservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userroleservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/authorizationpolicy\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/rolebindingdao\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/scheduler\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/repositorysync\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/datasourcemgmtservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/.*require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*require-js-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/.*\Z" access="User,Admin" />

So at this stage we have restricted all access to almost all components and services to our company users using the role “User”. And we are now able to cherry pick the exact services we want to open. In order to allow access by visitors to our dashboard and nothing else, we add a URI pattern and allow access to “Authenticated”. We then add it on the TOP, which means it is the first applied filter and therefore the most constricting.

<sec:intercept-url pattern="\A/api/repos/:public:ubiquis:world_population:world_population.wcdf/generatedContent.*\Z" access="Authenticated"/>

At this stage if you try to open the URI the dashboard will fail to render. This happens because the visitor is trying to render the dashboard without access to any other services, so through trial and error use the browser console and check which other services are required to render that particular dashboard. Once you identify which ones are required you open those services too.

<sec:intercept-url pattern="\A/api/repos/:public:ubiquis:world_population:world_population.wcdf/generatedContent.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/api/repos/:public:ubiquis:world_population:.*\.png*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/plugin/cda/api/doQuery\?.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/api/repos/pentaho-cdf/.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/api/repos/pentaho-cdf-dd/.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/plugin/pentaho-cdf-dd/api/resources/public/ubiquis/.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/webservices/unifiedrepository\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userrolelistservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userroleservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/authorizationpolicy\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/rolebindingdao\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/scheduler\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/repositorysync\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/datasourcemgmtservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/.*require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*require-js-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/.*\Z" access="User,Admin" />

As you can see there were images that needed to be accessed in our repository, and we also needed to allow the user to run CDA queries. There are a number of javascript and css files also required for CDF to render our dashboard correctly. Notice that although we allowed access to CDA, we did not grant access to the CDA previewer or editor.

The resulting dashboard can be found clicking this link (you can see that username and password are being passed in the URL).

This principle can be applied to different guest user roles, where a subset of visitors will have access to one part of the server, and others another, allowing for a much more controlled approach to how to grant access to your server.



Filling in the gaps

Today I saw Harris Ward asking a question I’ve seen pop up over and over again. And somehow the solution is never obvious.

The problem is how to fill in the gaps when we have nulls in our data stream. For example, our data set looks like:

And we want to replace all those nulls by the last non-null value we’ve seen in the data set: all non null values stay as is, but Mar, Apr and May get the value 20, Jul gets 50 and Sep, Oct and Nov get 40.

If we’re ingesting the data from a CSV, Excel or XML file, it’s easy to achieve this (ever wondered what the “Repeat” flag means in the fields tab of the Text File input step?).

But suppose our data doesn’t come from files. We need to implement the equivalente of the repeat flag inside a PDI transformation.

Invariably, when any of us encounters this requirement we all go through the same attempts. Let’s use the Analytical query step! Or the Group by with a “pass all rows option”. Or the NVL operation of the calculator step. And then we learn these all fail.

This is perhaps one of the best examples where a Javascript step is the best option. Because it lets us persist variables between rows, and we can control their values at will.

So, to achieve the desired result, here’s our proposed solution: add a Javascript step and use the following code:

Variable tmp is only initialized with the first row of data. From then on it’ll keep the last non null value of sales (or 0).

Sure, we’ve all heard the motto “Don’t use Javascript, it kills PDI performance”. But the occasional simple JS script doesn’t hurt that much in most cases, you’re likely to have a slower step elsewhere in your transformation (a DB read or write operation, for example). And, if performance is vital, you can always re-code this as a User defined Java class. But for all but the most extreme applications Javascript will work just fine, though.

Don’t forget to specify on the configuration of the output fields that you want to overwrite the value of the sales field, though:

And here’s the transformed data:

Exactly as intended.

Download the sample transformation here.

Happy Pi Day!

Today is Pi Day, March 14 or 3.14 (using a rather awkward M.dd format).

So this morning, the following dialog occurred within our team:

– Happy Pi Day!
– You know what we should do? A PDI Monte Carlo simulator to calculate Pi…
– Uuuuh, nice. Yeah, I think I can do that.
– … without scripting steps.
– Challenge accepted!

Well, happy Pi Day everyone, and behold our awesome (though of limited use) Monte Carlo simulator to calculate an approximation of Pi. It uses only the following steps:

  • Generate rows
  • Generate random value
  • Calculator (x2)
  • Add sequence (x2)
  • Filter rows (x2)
  • Write to log

It runs “forever” and gives updated results on the logs for each 1.27 Million rows (approximately).

Remark: forever means until you reach 1b rows and the sequence counters roll over.

Oh, you may want to run it with Minimal Logging only.

Happy Pi Day!

Download here

Status change fact table – Part 5 (Conclusion)

(previous parts of this series are here: part 1, part 2, part 3, part 4)

The fact table described in the previous parts of this article allow us to count how many objects entered or exited a given state in a certain time period.

However, if we want to count how many objects are in a given state on a given point in time, we need to count all In and Out events since the beginning of time.

This is problematic and doesn’t scale well, as more calculations are required as more data is ingested.

However, we can add a Periodic Snapshot fact table to the mix and have those running totals counted for each state in each day. As such, questions such as “How many objects were Out for delivery on day X” can be answered by looking at a specific snapshot date and gouping all rows that have that specific attribute.

Even if we include attributes such as the age of objects or events, as we did in the implementation of the status change fact table, we can still keep track of everything.

In brief, the snapshot could be implemented by doing something like the following:

  • Take all rows for the snapshot on day D;
  • Increment snapshot date to D+1; increment all ages by 1;
  • Take all events from the status change table for day D+1
  • Add the two sets of events from steps 2 and 3
  • Insert the data into a new partition of the snapshot table.

The snapshot table will have a very high cardinality, especially because we will most likely need to keep it atomic, without aggregation on any attribute. Keeping the snapshot partitioned lets us maintain the snapshot more easily, allowing us to delete specific partitions when the data is found to be incorrect.
Bear in mind that, as in with any Periodic Snapshot algorithm, it is highly sensitive to data errors. Any incorrect number in one day will be propagated into the future and the only practical solution is to delete all snapshot data from the first day the totals are wrong and reprocess a significant amount of data.

But, with all its flaws, it’s still our best shot at keeping track of all measurable values within our data.

In conclusion, this model implements and extends functionality that used to be achieved by an Accumulating Snapshot. It features the ability to track an arbitrary number of stages of a process, and doesn’t require updates of past data, making it easy to implement in a Hadoop environment. It can be used to track data pertaining to shipments and deliveries, bug tracking software, or any other business process where objects go from one state to the next and we need to track its path.

Status change fact table – Part 4 (A PDI implementation)

(previous articles of this series can be found here, here and here)

In this series of posts we’ve been discussing a data model to handle status changes of objects and a way to track them and perform analytics.

The previous article describes the data model and in this article we show an implementation using PDI.

The idea is as follows: after reading a data slice and sorting it by order_id (our primary key) and event_date, we determine which event is the first (of the batch) and which one is the last. We then clone all events but the last one, and apply some logic:

  • The clone of any event is a “Out” event, that means the event_count should be set to -1;
  • The “Out” event will have as timestamp the timestamp of the next “In” event (the moment the event is in fact closed);
  • On the “Out” events we determine the event’s age (which is by definition 0 at the “In” events)
  • We calculate the cumulative sum of all event ages to determine the age of the order itself (this will be valid in both “In” and “Out” events, one should beware of this when using this measure in reports;
  • All “last” events (by order) shall be assigned a different output partition. The output partitions will be called “Open” for all “In” events that don’t yet have a matching “Out” event; and “Closed” for all matched “In/Out” pairs.
  • The output is appended to “Closed” partition, but it overwrites the contents of the “Open” partition.
  • On the subsequent run, all new input data plus all unmatched “In” events previously processed will be fed into the ETL. Those “In” events that eventually get matched with an “Out” event will move to the “Closed” partition

The ETL can thus run incrementally against an input data set that is fed new data periodically. If at any time the ETL runs without new input data, nothing will be done: no new data is appended to the “Closed” partition, and the contents of the “Open” partition are read, nothing is done to them (as there are no other events to process) and re-written to the “Open” partition.

At the end of this article you’ll find a download link to a ZIP file that has all necessary files to run the ETL described above.

Inside the ZIP file you’ll find the following folders:

  • data: all the input data files;
  • inbox: folder that will be processed by the transformation
  • output: destination of the processed data
  • kettle: location of the actual PDI transformation

To run the transformation just move the first few files from data into inbox (remark: you need to respect the dates; copy older files first); run the transformation. The output folder will now have two files under output/open/data.csv and output/closed/data.csv. These are the two files that constitute our output. To run it again, remove the files from the inbox folder and move the next batch of files from data. The ETL will run incrementally.

Download PDI Implementation

Read the last part of this series of articles here

Status change fact table – Part 3 (The model)

(parts 1 and 2 of this series can be viewed here and here)

The classical approach to solving this problem is through an accumulating snapshot fact table. That table would have the order date, shipment date, delivery date and would be updated at every milestone of each order. We may add a return date, or re-delivery date to keep track of returns or failed deliveries, but we have to decide the structure beforehand.

A possible design of an accumulating snapshot to track the data would be


There are significant limitations to that design, namely the fact that we can’t track more than a fixed number of delivery attempts or product returns. In the example above, we chose to track the dates of the first shipment, the first delivery attempt, the first return to the warehouse (whether due to a failed delivery or return by customer, a field showing the reason for the first return to warehouse, and the successful delivery date.

We can expand this design, for example to also track the last return date as well as the last return reason, to we can track both the 1st failed delivery and an eventual return by the customer.

We can also add a few more intermediate milestones.

But we have to decide before we start collecting and processing data which milestones are worth keeping and which ones are not. And when we add more milestones to the list of interesting events we want to track we have to revisit the table structures, which may require reprocessing large amounts of data.

The approach we propose is different.

Instead of being order-centric, as in the accumulating snapshot design, the fact table should be event-centric. We consider each event as the actual fact we want to track, not the order itself. And we add another twist: we duplicate the event rows, one to track the moment the order entered a given state and another to track when the order left that state into another.

What we’re doing is considering a state change to be made of two distinct parts: if an order moves from state A to state B, then:

  1. The order LEAVES state A
  2. Then the order ENTERS state B.

Each of those two instances are treated as if they were actual events. But we add a new column, which we call Direction, which takes values of +1 and -1 depending on whether it refers to a new state or an old one.

So, the data above would look like this:

order_id order_date customer_name event_date event_name warehouse carrier event_count
1 01/01/16 Acme Industries 02/01/16 08:00 Preparing shipment London Deliveries express 1
1 01/01/16 Acme Industries 03/01/16 10:00 Preparing shipment London Deliveries express -1
1 01/01/16 Acme Industries 03/01/16 10:00 Shipped from warehouse London Deliveries express 1
1 01/01/16 Acme Industries 03/01/16 14:30 Shipped from warehouse London Deliveries express -1
1 01/01/16 Acme Industries 03/01/16 14:30 Delivery failed London Deliveries express 1

What we did here is separate the event into two different facts: one with an event_count of +1 marking the entry into a given state and another with an event_count of -1 marking its exit from that state. We shall call these the In and Out events, respectively.

Notice that the timestamp of the Out event should always match the timestamp of the next In event. All other attributes of the Out event are kept from the corresponding In event.

A few things about this model are worth noting:

  • sum(event_count) yields the same result as count(disctinct order_id), as within each order_id all rows are paired up and cancel each other out, except the most recent one;
  • This remains true if we limit our query to a given event_date interval;
  • It still remains true if we limit further our query to a specific attribute (e.g., customer_name);
  • It’s still true if we filter further on a specific event_name, with a caveat: it will return the difference between the number of objects in that state at the end of the interval and the number of objects in that state at the beginning, giving us a net count of objects in/out any given state;
  • It aggregates nicely and includes only additive measures;
  • It can be enriched, for example including an event_age measure in all Out events (defined as zero for all In events), which allow us to calculate average ages of objects in any given state; this measure also allows drilling and filtering;
  • Furthermore, it allow us to calculate the cumulative age of objects, as long as we’re careful, when calculating it, to only take cumulative ages of Out events;
  • We can view the event_count column as a measure, counting how many objects entered minus how many objects left a given state, but also as a dimension, counting In and Out events separately (with the caveat that counts of Out events are negative numbers)
  • As we only insert new rows and never update records, this can be implemented in a Hadoop environment without any changes;
  • And the algorithm can be implemented in a MapReduce type of algorithm, furthering its applicability to Hadoop data stores and providing scalability.

What this table achieves is in fact a generalisation of the Accumulating Snapshot table as described by Kimball, by means of a trade off between the number of columns and the number of rows. If we were to capture events in a process of fixed length, let’s say N milestones, Kimball’s approach requires 1 row and N additional columns, whereas our approach requires 1 additional column and a total of 2N-1 rows.

However, where Kimball’s approach reaches its limit is in the ability to change the length of the process being analysed; adding a new milestone requires changing the table’s structure and filling in missing data, whereas in our approach only new rows have to be added and for all those objects where new events being tracked don’t exist, they are simply not there. In processes where the length varies significantly (multiple delivery attempts, for example), Kimball’s approach will result in many empty cells, whereas in this approach those rows would simply be missing.

This model doesn’t come without its challenges, of course:

  • There’s no practical way to determine how many objects had an event of type X in a time period if there are repeated events. We can only count how many such events occurred, but if an object had multiple events within that time period, it’ll be overcounted;
  • During the ETL it’s necessary to read all “most recent events” for all objects, so we can determine their Out event correctly and then filter them out from the final output; one way to achieve this is to apply partitioning: all In events that don’t yet have a matching Out event are stored in a separate partition which is overwritten on each ETL run;
  • The implementation algorithm will need to sort the input data to properly merge most recent events coming from the target fact table with new incoming events from the source system; if the input data is very small when compared to the global population of objects already being tracked this adds a significant overhead to the ETL process;
  • The algorithm is highly sensitive to late arriving data; if the events arrive in the wrong order and we process the 1st, 3rd and 4th events of an object in one ETL run and only later we read the 2nd event of its lifecycle, there’s no way to insert it into the target table without compromising consistency. A reload of data will be necessary.

In the next article we’re going to see a practical implementation of this fact table using a set of PDI transformations that read new data in CSV files from an inbox folder and append its output data to another CSV file, which we can then open in the tool of our choice to do some analytics with it.

(part 4 of the series is here)