In this series of articles we’re going to address a business problem that is traditionally solved, although only partially, by an accumulating snapshot fact table. We will propose a new type of fact table that not only generalizes the concept of the accumulating snapshot, as described by Ralph Kimball, but also provides an implementation that doesn’t require updates, which makes it Hadoop friendly, and can be implemented by a MapReduce algorithm.
So, here’s the description of the business problem we got from our (fictional) client:
We need to analyse data pertaining to orders and order fulfilment. We have a database where order events are stored, indicating when an order is placed by a client, when it’s prepared for shipment at our warehouses, when it dispatches and when it’s delivered. In some cases shipments are sent back by the client, if we shipped the wrong products or if some item is defective; in that case the order may be shipped again, cancelled altogether, modified, etc. Also, in some cases the delivery attempt fails and it may need to be retried one or more times by our courier.
We need to create reports to answer, at least, the following questions:
- How many orders were placed/shipped/delivered on a given time period?
- What’s the number of orders returned to the warehouse by any reason other than “could not deliver” (defective or broken products; wrong products shipped; delivered past its due date; etc.)?
- How many orders are in our backlog waiting to be dispatched? How does that compare to the backlog last week/month/year?
- How many orders are delivered but haven’t been paid yet? How many payments are overdue, given our standard 30 day payment terms? How does that compare with the same indicator last week/month/year?
- Which warehouses are more likely to have orders returned due to defective items?
- Which carriers are more likely to break items during delivery?
- Of course, for all these questions, we may need to drill down by customer, carrier, warehouse, etc.
All the data for our analytics solution must be hosted in a Hadoop cluster and data processing should be done directly in the cluster as much as possible (e.g. in MapReduce jobs).
From the business questions we can see how an accumulating snapshot fact table would be able to address some of the questions, by inserting new facts when an order is placed and updating them for each milestone in the order’s life cycle: prepared, shipped, delivered, paid. We can also add a couple more milestones, namely for shipment returns, delivery retries, etc. But this has several limitations.
One limitation of the accumulating snapshot is that we have to define which milestones are worth recording beforehand. If an order is retried more than once we can only keep track of a fixed number of retries; likewise, an order may be returned multiple times, but we must determine beforehand how many order returns can be tracked. And there’s no way to track a variable and arbitrarily large number of milestones. As time goes by, more variety is expected in the data and it may happen that although most orders go through 5 or 6 stages in their lifecycle, some have well over 20 stages including a varying number of delivery attempts or product returns. And in a few exotic examples, there may be over 100 milestones for one particular order.
Another limitation is that the accumulating snapshot isn’t future proof. Even if we accept the limitation of being able to track only a fixed number of delivery attempts and product returns, if the client asks us to also keep track of late payments and the number of actions taken to collect an overdue invoice (emails, letters, phone calls) we need to add new columns to what is already a very wide fact table.
Finally, this is not Hadoop friendly. HDFS is a “write once, read many times” filesystem and doesn’t offer the chance to easily update records. We could try versioning records, but that places an extra burden on the query engine to perform the analytics, requiring grouping facts by order ID and retrieving only the latest version of each fact to get the most current state. And those queries will necessarily be slower than your typical OLAP query,
Select dim_table.attribute, sum(fact_table.measure)
from fact_table, dim_table
where fact_table.key = dim_table.key
group by dim_table.attribute
In the next article of this series we’ll see a sample of the data we need to process.
(part 2 of the series here