The classical approach to solving this problem is through an accumulating snapshot fact table. That table would have the order date, shipment date, delivery date and would be updated at every milestone of each order. We may add a return date, or re-delivery date to keep track of returns or failed deliveries, but we have to decide the structure beforehand.
A possible design of an accumulating snapshot to track the data would be
There are significant limitations to that design, namely the fact that we can’t track more than a fixed number of delivery attempts or product returns. In the example above, we chose to track the dates of the first shipment, the first delivery attempt, the first return to the warehouse (whether due to a failed delivery or return by customer, a field showing the reason for the first return to warehouse, and the successful delivery date.
We can expand this design, for example to also track the last return date as well as the last return reason, to we can track both the 1st failed delivery and an eventual return by the customer.
We can also add a few more intermediate milestones.
But we have to decide before we start collecting and processing data which milestones are worth keeping and which ones are not. And when we add more milestones to the list of interesting events we want to track we have to revisit the table structures, which may require reprocessing large amounts of data.
The approach we propose is different.
Instead of being order-centric, as in the accumulating snapshot design, the fact table should be event-centric. We consider each event as the actual fact we want to track, not the order itself. And we add another twist: we duplicate the event rows, one to track the moment the order entered a given state and another to track when the order left that state into another.
What we’re doing is considering a state change to be made of two distinct parts: if an order moves from state A to state B, then:
- The order LEAVES state A
- Then the order ENTERS state B.
Each of those two instances are treated as if they were actual events. But we add a new column, which we call Direction, which takes values of +1 and -1 depending on whether it refers to a new state or an old one.
So, the data above would look like this:
|1||01/01/16||Acme Industries||02/01/16 08:00||Preparing shipment||London||Deliveries express||1|
|1||01/01/16||Acme Industries||03/01/16 10:00||Preparing shipment||London||Deliveries express||-1|
|1||01/01/16||Acme Industries||03/01/16 10:00||Shipped from warehouse||London||Deliveries express||1|
|1||01/01/16||Acme Industries||03/01/16 14:30||Shipped from warehouse||London||Deliveries express||-1|
|1||01/01/16||Acme Industries||03/01/16 14:30||Delivery failed||London||Deliveries express||1|
What we did here is separate the event into two different facts: one with an event_count of +1 marking the entry into a given state and another with an event_count of -1 marking its exit from that state. We shall call these the In and Out events, respectively.
Notice that the timestamp of the Out event should always match the timestamp of the next In event. All other attributes of the Out event are kept from the corresponding In event.
A few things about this model are worth noting:
- sum(event_count) yields the same result as count(disctinct order_id), as within each order_id all rows are paired up and cancel each other out, except the most recent one;
- This remains true if we limit our query to a given event_date interval;
- It still remains true if we limit further our query to a specific attribute (e.g., customer_name);
- It’s still true if we filter further on a specific event_name, with a caveat: it will return the difference between the number of objects in that state at the end of the interval and the number of objects in that state at the beginning, giving us a net count of objects in/out any given state;
- It aggregates nicely and includes only additive measures;
- It can be enriched, for example including an event_age measure in all Out events (defined as zero for all In events), which allow us to calculate average ages of objects in any given state; this measure also allows drilling and filtering;
- Furthermore, it allow us to calculate the cumulative age of objects, as long as we’re careful, when calculating it, to only take cumulative ages of Out events;
- We can view the event_count column as a measure, counting how many objects entered minus how many objects left a given state, but also as a dimension, counting In and Out events separately (with the caveat that counts of Out events are negative numbers)
- As we only insert new rows and never update records, this can be implemented in a Hadoop environment without any changes;
- And the algorithm can be implemented in a MapReduce type of algorithm, furthering its applicability to Hadoop data stores and providing scalability.
What this table achieves is in fact a generalisation of the Accumulating Snapshot table as described by Kimball, by means of a trade off between the number of columns and the number of rows. If we were to capture events in a process of fixed length, let’s say N milestones, Kimball’s approach requires 1 row and N additional columns, whereas our approach requires 1 additional column and a total of 2N-1 rows.
However, where Kimball’s approach reaches its limit is in the ability to change the length of the process being analysed; adding a new milestone requires changing the table’s structure and filling in missing data, whereas in our approach only new rows have to be added and for all those objects where new events being tracked don’t exist, they are simply not there. In processes where the length varies significantly (multiple delivery attempts, for example), Kimball’s approach will result in many empty cells, whereas in this approach those rows would simply be missing.
This model doesn’t come without its challenges, of course:
- There’s no practical way to determine how many objects had an event of type X in a time period if there are repeated events. We can only count how many such events occurred, but if an object had multiple events within that time period, it’ll be overcounted;
- During the ETL it’s necessary to read all “most recent events” for all objects, so we can determine their Out event correctly and then filter them out from the final output; one way to achieve this is to apply partitioning: all In events that don’t yet have a matching Out event are stored in a separate partition which is overwritten on each ETL run;
- The implementation algorithm will need to sort the input data to properly merge most recent events coming from the target fact table with new incoming events from the source system; if the input data is very small when compared to the global population of objects already being tracked this adds a significant overhead to the ETL process;
- The algorithm is highly sensitive to late arriving data; if the events arrive in the wrong order and we process the 1st, 3rd and 4th events of an object in one ETL run and only later we read the 2nd event of its lifecycle, there’s no way to insert it into the target table without compromising consistency. A reload of data will be necessary.
In the next article we’re going to see a practical implementation of this fact table using a set of PDI transformations that read new data in CSV files from an inbox folder and append its output data to another CSV file, which we can then open in the tool of our choice to do some analytics with it.
(part 4 of the series is here)