Status change fact table – Part 5 (Conclusion)

(previous parts of this series are here: part 1part 2part 3part 4)

The fact table described in the previous parts of this article allow us to count how many objects entered or exited a given state in a certain time period.

However, if we want to count how many objects are in a given state on a given point in time, we need to count all In and Out events since the beginning of time.

This is problematic and doesn’t scale well, as more calculations are required as more data is ingested.

However, we can add a Periodic Snapshot fact table to the mix and have those running totals counted for each state in each day. As such, questions such as “How many objects were Out for delivery on day X” can be answered by looking at a specific snapshot date and gouping all rows that have that specific attribute.

Even if we include attributes such as the age of objects or events, as we did in the implementation of the status change fact table, we can still keep track of everything.

In brief, the snapshot could be implemented by doing something like the following:

  • Take all rows for the snapshot on day D;
  • Increment snapshot date to D+1; increment all ages by 1;
  • Take all events from the status change table for day D+1
  • Add the two sets of events from steps 2 and 3
  • Insert the data into a new partition of the snapshot table.

The snapshot table will have a very high cardinality, especially because we will most likely need to keep it atomic, without aggregation on any attribute. Keeping the snapshot partitioned lets us maintain the snapshot more easily, allowing us to delete specific partitions when the data is found to be incorrect.
Bear in mind that, as in with any Periodic Snapshot algorithm, it is highly sensitive to data errors. Any incorrect number in one day will be propagated into the future and the only practical solution is to delete all snapshot data from the first day the totals are wrong and reprocess a significant amount of data.

But, with all its flaws, it’s still our best shot at keeping track of all measurable values within our data.

In conclusion, this model implements and extends functionality that used to be achieved by an Accumulating Snapshot. It features the ability to track an arbitrary number of stages of a process, and doesn’t require updates of past data, making it easy to implement in a Hadoop environment. It can be used to track data pertaining to shipments and deliveries, bug tracking software, or any other business process where objects go from one state to the next and we need to track its path.

Back to blog

You May Also Like

  • Graph Databases: Loading Data with Neo4j

    Graph databases are becoming more popular as a way of storing and analysing large connected datasets. Neo4j is a...

    Read more
  • Date Dimension Revisited: Bank Holidays

    Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation...

    Read more
  • Setup Pentaho BA Server to use SSL Certificates

    SSL Certificate Java Truststore Setup SSL, or Secure Socket Layer, is a technology which allows web browsers and web...

    Read more