In this series of posts weâve been discussing a data model to handle status changes of objects and a way to track them and perform analytics.
The previous article describes the data model and in this article we show an implementation using PDI.
The idea is as follows: after reading a data slice and sorting it by order_id (our primary key) and event_date, we determine which event is the first (of the batch) and which one is the last. We then clone all events but the last one, and apply some logic:
- The clone of any event is a âOutâ event, that means the event_count should be set to -1;
- The âOutâ event will have as timestamp the timestamp of the next âInâ event (the moment the event is in fact closed);
- On the âOutâ events we determine the eventâs age (which is by definition 0 at the âInâ events)
- We calculate the cumulative sum of all event ages to determine the age of the order itself (this will be valid in both âInâ and âOutâ events, one should beware of this when using this measure in reports;
- All âlastâ events (by order) shall be assigned a different output partition. The output partitions will be called âOpenâ for all âInâ events that donât yet have a matching âOutâ event; and âClosedâ for all matched âIn/Outâ pairs.
- The output is appended to âClosedâ partition, but it overwrites the contents of the âOpenâ partition.
- On the subsequent run, all new input data plus all unmatched âInâ events previously processed will be fed into the ETL. Those âInâ events that eventually get matched with an âOutâ event will move to the âClosedâ partition
The ETL can thus run incrementally against an input data set that is fed new data periodically. If at any time the ETL runs without new input data, nothing will be done: no new data is appended to the âClosedâ partition, and the contents of the âOpenâ partition are read, nothing is done to them (as there are no other events to process) and re-written to the âOpenâ partition.
At the end of this article youâll find a download link to a ZIP file that has all necessary files to run the ETL described above.
Inside the ZIP file youâll find the following folders:
- data: all the input data files;
- inbox: folder that will be processed by the transformation
- output: destination of the processed data
- kettle: location of the actual PDI transformation
To run the transformation just move the first few files from data into inbox (remark: you need to respect the dates; copy older files first); run the transformation. The output folder will now have two files under output/open/data.csv and output/closed/data.csv. These are the two files that constitute our output. To run it again, remove the files from the inbox folder and move the next batch of files from data. The ETL will run incrementally.
Read the last part of this series of articles cheap clomid 50mg