(previous parts of this series are here: part 1, part 2, part 3, part 4)
The fact table described in the previous parts of this article allow us to count how many objects entered or exited a given state in a certain time period.
However, if we want to count how many objects are in a given state on a given point in time, we need to count all In and Out events since the beginning of time.
This is problematic and doesn’t scale well, as more calculations are required as more data is ingested.
However, we can add a Periodic Snapshot fact table to the mix and have those running totals counted for each state in each day. As such, questions such as “How many objects were Out for delivery on day X” can be answered by looking at a specific snapshot date and gouping all rows that have that specific attribute.
Even if we include attributes such as the age of objects or events, as we did in the implementation of the status change fact table, we can still keep track of everything.
In brief, the snapshot could be implemented by doing something like the following:
- Take all rows for the snapshot on day D;
- Increment snapshot date to D+1; increment all ages by 1;
- Take all events from the status change table for day D+1
- Add the two sets of events from steps 2 and 3
- Insert the data into a new partition of the snapshot table.
The snapshot table will have a very high cardinality, especially because we will most likely need to keep it atomic, without aggregation on any attribute. Keeping the snapshot partitioned lets us maintain the snapshot more easily, allowing us to delete specific partitions when the data is found to be incorrect.
Bear in mind that, as in with any Periodic Snapshot algorithm, it is highly sensitive to data errors. Any incorrect number in one day will be propagated into the future and the only practical solution is to delete all snapshot data from the first day the totals are wrong and reprocess a significant amount of data.
But, with all its flaws, it’s still our best shot at keeping track of all measurable values within our data.
In conclusion, this model implements and extends functionality that used to be achieved by an Accumulating Snapshot. It features the ability to track an arbitrary number of stages of a process, and doesn’t require updates of past data, making it easy to implement in a Hadoop environment. It can be used to track data pertaining to shipments and deliveries, bug tracking software, or any other business process where objects go from one state to the next and we need to track its path.