Status change fact table – Part 4 (A PDI implementation)

(previous articles of this series can be found here, here and here)

In this series of posts we’ve been discussing a data model to handle status changes of objects and a way to track them and perform analytics.

The previous article describes the data model and in this article we show an implementation using PDI.

The idea is as follows: after reading a data slice and sorting it by order_id (our primary key) and event_date, we determine which event is the first (of the batch) and which one is the last. We then clone all events but the last one, and apply some logic:

  • The clone of any event is a “Out” event, that means the event_count should be set to -1;
  • The “Out” event will have as timestamp the timestamp of the next “In” event (the moment the event is in fact closed);
  • On the “Out” events we determine the event’s age (which is by definition 0 at the “In” events)
  • We calculate the cumulative sum of all event ages to determine the age of the order itself (this will be valid in both “In” and “Out” events, one should beware of this when using this measure in reports;
  • All “last” events (by order) shall be assigned a different output partition. The output partitions will be called “Open” for all “In” events that don’t yet have a matching “Out” event; and “Closed” for all matched “In/Out” pairs.
  • The output is appended to “Closed” partition, but it overwrites the contents of the “Open” partition.
  • On the subsequent run, all new input data plus all unmatched “In” events previously processed will be fed into the ETL. Those “In” events that eventually get matched with an “Out” event will move to the “Closed” partition

The ETL can thus run incrementally against an input data set that is fed new data periodically. If at any time the ETL runs without new input data, nothing will be done: no new data is appended to the “Closed” partition, and the contents of the “Open” partition are read, nothing is done to them (as there are no other events to process) and re-written to the “Open” partition.

At the end of this article you’ll find a download link to a ZIP file that has all necessary files to run the ETL described above.

Inside the ZIP file you’ll find the following folders:

  • data: all the input data files;
  • inbox: folder that will be processed by the transformation
  • output: destination of the processed data
  • kettle: location of the actual PDI transformation

To run the transformation just move the first few files from data into inbox (remark: you need to respect the dates; copy older files first); run the transformation. The output folder will now have two files under output/open/data.csv and output/closed/data.csv. These are the two files that constitute our output. To run it again, remove the files from the inbox folder and move the next batch of files from data. The ETL will run incrementally.

Download PDI Implementation

Read the last part of this series of articles here

Status change fact table – Part 3 (The model)

(parts 1 and 2 of this series can be viewed here and here)

The classical approach to solving this problem is through an accumulating snapshot fact table. That table would have the order date, shipment date, delivery date and would be updated at every milestone of each order. We may add a return date, or re-delivery date to keep track of returns or failed deliveries, but we have to decide the structure beforehand.

A possible design of an accumulating snapshot to track the data would be


There are significant limitations to that design, namely the fact that we can’t track more than a fixed number of delivery attempts or product returns. In the example above, we chose to track the dates of the first shipment, the first delivery attempt, the first return to the warehouse (whether due to a failed delivery or return by customer, a field showing the reason for the first return to warehouse, and the successful delivery date.

We can expand this design, for example to also track the last return date as well as the last return reason, to we can track both the 1st failed delivery and an eventual return by the customer.

We can also add a few more intermediate milestones.

But we have to decide before we start collecting and processing data which milestones are worth keeping and which ones are not. And when we add more milestones to the list of interesting events we want to track we have to revisit the table structures, which may require reprocessing large amounts of data.

The approach we propose is different.

Instead of being order-centric, as in the accumulating snapshot design, the fact table should be event-centric. We consider each event as the actual fact we want to track, not the order itself. And we add another twist: we duplicate the event rows, one to track the moment the order entered a given state and another to track when the order left that state into another.

What we’re doing is considering a state change to be made of two distinct parts: if an order moves from state A to state B, then:

  1. The order LEAVES state A
  2. Then the order ENTERS state B.

Each of those two instances are treated as if they were actual events. But we add a new column, which we call Direction, which takes values of +1 and -1 depending on whether it refers to a new state or an old one.

So, the data above would look like this:

order_id order_date customer_name event_date event_name warehouse carrier event_count
1 01/01/16 Acme Industries 02/01/16 08:00 Preparing shipment London Deliveries express 1
1 01/01/16 Acme Industries 03/01/16 10:00 Preparing shipment London Deliveries express -1
1 01/01/16 Acme Industries 03/01/16 10:00 Shipped from warehouse London Deliveries express 1
1 01/01/16 Acme Industries 03/01/16 14:30 Shipped from warehouse London Deliveries express -1
1 01/01/16 Acme Industries 03/01/16 14:30 Delivery failed London Deliveries express 1

What we did here is separate the event into two different facts: one with an event_count of +1 marking the entry into a given state and another with an event_count of -1 marking its exit from that state. We shall call these the In and Out events, respectively.

Notice that the timestamp of the Out event should always match the timestamp of the next In event. All other attributes of the Out event are kept from the corresponding In event.

A few things about this model are worth noting:

  • sum(event_count) yields the same result as count(disctinct order_id), as within each order_id all rows are paired up and cancel each other out, except the most recent one;
  • This remains true if we limit our query to a given event_date interval;
  • It still remains true if we limit further our query to a specific attribute (e.g., customer_name);
  • It’s still true if we filter further on a specific event_name, with a caveat: it will return the difference between the number of objects in that state at the end of the interval and the number of objects in that state at the beginning, giving us a net count of objects in/out any given state;
  • It aggregates nicely and includes only additive measures;
  • It can be enriched, for example including an event_age measure in all Out events (defined as zero for all In events), which allow us to calculate average ages of objects in any given state; this measure also allows drilling and filtering;
  • Furthermore, it allow us to calculate the cumulative age of objects, as long as we’re careful, when calculating it, to only take cumulative ages of Out events;
  • We can view the event_count column as a measure, counting how many objects entered minus how many objects left a given state, but also as a dimension, counting In and Out events separately (with the caveat that counts of Out events are negative numbers)
  • As we only insert new rows and never update records, this can be implemented in a Hadoop environment without any changes;
  • And the algorithm can be implemented in a MapReduce type of algorithm, furthering its applicability to Hadoop data stores and providing scalability.

What this table achieves is in fact a generalisation of the Accumulating Snapshot table as described by Kimball, by means of a trade off between the number of columns and the number of rows. If we were to capture events in a process of fixed length, let’s say N milestones, Kimball’s approach requires 1 row and N additional columns, whereas our approach requires 1 additional column and a total of 2N-1 rows.

However, where Kimball’s approach reaches its limit is in the ability to change the length of the process being analysed; adding a new milestone requires changing the table’s structure and filling in missing data, whereas in our approach only new rows have to be added and for all those objects where new events being tracked don’t exist, they are simply not there. In processes where the length varies significantly (multiple delivery attempts, for example), Kimball’s approach will result in many empty cells, whereas in this approach those rows would simply be missing.

This model doesn’t come without its challenges, of course:

  • There’s no practical way to determine how many objects had an event of type X in a time period if there are repeated events. We can only count how many such events occurred, but if an object had multiple events within that time period, it’ll be overcounted;
  • During the ETL it’s necessary to read all “most recent events” for all objects, so we can determine their Out event correctly and then filter them out from the final output; one way to achieve this is to apply partitioning: all In events that don’t yet have a matching Out event are stored in a separate partition which is overwritten on each ETL run;
  • The implementation algorithm will need to sort the input data to properly merge most recent events coming from the target fact table with new incoming events from the source system; if the input data is very small when compared to the global population of objects already being tracked this adds a significant overhead to the ETL process;
  • The algorithm is highly sensitive to late arriving data; if the events arrive in the wrong order and we process the 1st, 3rd and 4th events of an object in one ETL run and only later we read the 2nd event of its lifecycle, there’s no way to insert it into the target table without compromising consistency. A reload of data will be necessary.

In the next article we’re going to see a practical implementation of this fact table using a set of PDI transformations that read new data in CSV files from an inbox folder and append its output data to another CSV file, which we can then open in the tool of our choice to do some analytics with it.

(part 4 of the series is here)

Status change fact table – Part 2 (The input data)

(See part 1 of this series here)

To make matters simpler we’ll assume that a table with a stream of order events already exists and that it’s denormalised, with each row containing all the relevant data. Either the source system already had that information or it was built in an earlier phase of the ETL.

Table order_events:
Order_id int
Order_date datetime
Customer_name string
Event_date datetime
Event_name string
Warehouse string
Carrier string

Here’s a sample of the data we’ll be dealing with:

order_id order_date customer_name event_date event_name warehouse carrier
1 01/01/16 Acme Industries 02/01/16 08:00 Preparing shipment London Deliveries express
1 01/01/16 Acme Industries 03/01/16 10:00 Shipped from warehouse London Deliveries express
1 01/01/16 Acme Industries 03/01/16 14:30 Delivery failed London Deliveries express
1 01/01/16 Acme Industries 03/01/16 18:00 Returned to warehouse London Deliveries express
1 01/01/16 Acme Industries 04/01/16 09:00 Shipped from warehouse London Deliveries express
1 01/01/16 Acme Industries 04/01/16 11:30 Delivery failed London Deliveries express
1 01/01/16 Acme Industries 04/01/16 18:00 Returned to warehouse London Deliveries express
1 01/01/16 Acme Industries 05/01/16 10:30 Shipped from warehouse London Deliveries express
1 01/01/16 Acme Industries 05/01/16 15:00 Delivery successful London Deliveries express
2 01/01/16 Boggs Corp 02/01/16 09:00 Preparing shipment London Deliveries express
2 01/01/16 Boggs Corp 03/01/16 10:00 Shipped from warehouse London Deliveries express
2 01/01/16 Boggs Corp 03/01/16 11:00 Delivery successful London Deliveries express
3 02/01/16 Boggs Corp 03/01/16 08:00 Preparing shipment Liverpool WeDeliverStuff
3 02/01/16 Boggs Corp 04/01/16 09:30 Shipped from warehouse Liverpool WeDeliverStuff
3 02/01/16 Boggs Corp 04/01/16 15:00 Delivery successful Liverpool WeDeliverStuff
3 02/01/16 Boggs Corp 08/01/16 16:30 Returned by customer Liverpool WeDeliverStuff
3 02/01/16 Boggs Corp 09/01/16 08:00 Preparing shipment Liverpool WeDeliverStuff
3 02/01/16 Boggs Corp 10/01/16 10:00 Shipped from warehouse Liverpool WeDeliverStuff
3 02/01/16 Boggs Corp 10/01/16 14:00 Delivery successful Liverpool WeDeliverStuff

Our data sample has 3 orders, from 2 different customers and that shipped from two different warehouses. One of the orders was only delivered after a couple of failed attempts and in another the products were returned and had to be replaced.

Our goal in the next article is to build a fact table that can track all events that happened to all orders and provide meaningful analytics to the client.

In the next part of this series we’re going to explore a data model that allows analysing this dataset.

Status change fact table – Part 1 (The problem)

In this series of articles we’re going to address a business problem that is traditionally solved, although only partially, by an accumulating snapshot fact table. We will propose a new type of fact table that not only generalizes the concept of the accumulating snapshot, as described by Ralph Kimball, but also provides an implementation that doesn’t require updates, which makes it Hadoop friendly, and can be implemented by a MapReduce algorithm.

So, here’s the description of the business problem we got from our (fictional) client:

We need to analyse data pertaining to orders and order fulfilment. We have a database where order events are stored, indicating when an order is placed by a client, when it’s prepared for shipment at our warehouses, when it dispatches and when it’s delivered. In some cases shipments are sent back by the client, if we shipped the wrong products or if some item is defective; in that case the order may be shipped again, cancelled altogether, modified, etc. Also, in some cases the delivery attempt fails and it may need to be retried one or more times by our courier.

We need to create reports to answer, at least, the following questions:

  1. How many orders were placed/shipped/delivered on a given time period?
  2. What’s the number of orders returned to the warehouse by any reason other than “could not deliver” (defective or broken products; wrong products shipped; delivered past its due date; etc.)?
  3. How many orders are in our backlog waiting to be dispatched? How does that compare to the backlog last week/month/year?
  4. How many orders are delivered but haven’t been paid yet? How many payments are overdue, given our standard 30 day payment terms? How does that compare with the same indicator last week/month/year?
  5. Which warehouses are more likely to have orders returned due to defective items?
  6. Which carriers are more likely to break items during delivery?
  7. Of course, for all these questions, we may need to drill down by customer, carrier, warehouse, etc.

All the data for our analytics solution must be hosted in a Hadoop cluster and data processing should be done directly in the cluster as much as possible (e.g. in MapReduce jobs).

From the business questions we can see how an accumulating snapshot fact table would be able to address some of the questions, by inserting new facts when an order is placed and updating them for each milestone in the order’s life cycle: prepared, shipped, delivered, paid. We can also add a couple more milestones, namely for shipment returns, delivery retries, etc. But this has several limitations.

One limitation of the accumulating snapshot is that we have to define which milestones are worth recording beforehand. If an order is retried more than once we can only keep track of a fixed number of retries; likewise, an order may be returned multiple times, but we must determine beforehand how many order returns can be tracked. And there’s no way to track a variable and arbitrarily large number of milestones. As time goes by, more variety is expected in the data and it may happen that although most orders go through 5 or 6 stages in their lifecycle, some have well over 20 stages including a varying number of delivery attempts or product returns. And in a few exotic examples, there may be over 100 milestones for one particular order.

Another limitation is that the accumulating snapshot isn’t future proof. Even if we accept the limitation of being able to track only a fixed number of delivery attempts and product returns, if the client asks us to also keep track of late payments and the number of actions taken to collect an overdue invoice (emails, letters, phone calls) we need to add new columns to what is already a very wide fact table.

Finally, this is not Hadoop friendly. HDFS is a “write once, read many times” filesystem and doesn’t offer the chance to easily update records. We could try versioning records, but that places an extra burden on the query engine to perform the analytics, requiring grouping facts by order ID and retrieving only the latest version of each fact to get the most current state. And those queries will necessarily be slower than your typical OLAP query,

Select dim_table.attribute, sum(fact_table.measure)
from fact_table, dim_table
where fact_table.key = dim_table.key
group by dim_table.attribute

In the next article of this series we’ll see a sample of the data we need to process.

(part 2 of the series here

Using PDI and C-tools to display real time scores analysis for UEFA Euro 2016

So, the Euro 2016 started this past Friday. We’ll have matches almost everyday until July 10th.

And here at Ubiquis we decided to start a little project around the Euro. The basic idea is: suppose England gets to half time level with Slovakia. Given both teams’ past history, what’s the most likely scenario? A win for England? A draw? A win for Slovakia? We got score change information from all Euro finals matches since 1960 from Wikipedia and parsed it with PDI. And using a real time scores API called, built a couple queries that answer one basic question: “out of all matches in which team A was leading/trailing by X goals or more, at the same point in time, how many ended up in wins, losses and draws?”.

So, for example, when yesterday Germany was leading Ukraine by 1 goal at Half time, the dashboard would give us some idea of how often Germany managed to keep a lead and win the match, or how often Ukraine managed to turn the tables around when losing by 1 goal or more and ended up drawing or winning the match.

The dataset is quite small and the data model very simple, but it serves to show how Pentaho C-tools and PDI can be leveraged to create a real time information system, fed by external data sources, in a reliable maner.

Here’s the link: Euro 2016 Analysis dashboard (no longer available).

Hopefully we’ll use the learnings from this project to expand its scope and reliability and use it for other sports competitions. Stay tuned.

Fizz buzz

So, today is the day everybody should blog about something really stupid…

Ok, so here’s ours: Fizz Buzz in MDX.

Fizz Buzz, for those that don’t know it, is one of those “tests” interviewers love to see whether a candidate can code or not. It’s useless as pretty much everybody knows it by now, but that’s besides the point.

There are a number of examples of “how to implement Fizz Buzz in language X”. Some are quite lame, others more impressive.

So, in the spirit of April Fools, here’s our Fizz Buzz implementation, in MDX. Run it against SteelWheelsSales, or any cube that has a dimension called [Time].


member [Time].[0] as 0
member [Time].[1] as 1
member [Time].[2] as 2
member [Time].[3] as 3
member [Time].[4] as 4
member [Time].[5] as 5
member [Time].[6] as 6
member [Time].[7] as 7
member [Time].[8] as 8
member [Time].[9] as 9
set DIGITS as { [Time].[0], [Time].[1], [Time].[2], [Time].[3], [Time].[4],
[Time].[5], [Time].[6], [Time].[7], [Time].[8], [Time].[9] }

set NUMBERS as Generate( DIGITS, Generate( DIGITS, DIGITS, ALL), ALL)

member [Measures].[Ordinal] as NUMBERS.CurrentOrdinal

member [Measures].[Fizz] as Iif( [Measures].[Ordinal] = Round( [Measures].[Ordinal]/3 ) *3, "fizz", "")
member [Measures].[Buzz] as Iif( [Measures].[Ordinal] = Round( [Measures].[Ordinal]/5 ) *5, "buzz", "")
member [Measures].[FB] as [Measures].[Fizz] || [Measures].[Buzz]

member [Measures].[FizzBuzz] as Iif( [Measures].[FB] = "", [Measures].[Ordinal], [Measures].[FB] )

NUMBERS on Rows,
{ [Measures].[Ordinal], [Measures].[FizzBuzz] } on Columns


Although a trivial and quite futile exercise, it does have one feature that is worth mentioning: the All argument in the Generate function. Generate applies each element of the first set to each element of the second set. However, without the ALL argument, duplicates from the resulting set are removed, which results in us getting only a 10 element set.

Non-identifiable data: filtered measures in Mondrian

Here’s an interesting requirement (fictional, but inspired by a real world scenario):

We provide a mobile phone based service. People sign in, we know what phone, OS and browser they use, and provide them some uber feature they all love. It works great and we now have a few million users. We want to publish some service statistics, open to our customers. Such as how many people use the service in a given town? what’s the distribution of phone brands/models? what percentage of users uses a certain combination of mobile phone, operating system and browser? The problem resides in the following: we must respect users’ privacy and therefore cannot publish information that allows any given user to be identifiable. As such, we can only publish metrics if there’s a minimum of 10 users within any given category.

It’s easy to understand why this is relevant. If in a 500 people town only 1 person has a 1st generation iPhone, anybody with access to the data could identify that person’s usage pattern. However, if there’s a large number of users, then one individual’s actions are not easily identifiable.

The problem is: how can we implement that behaviour in a Mondrian cube?

Well, as it happens, there’s a way. To make matters easier to replicate, we’ll be using SteelWheelsSales, and will be limiting access to Sales information based on the Markets dimension, trying to provide aggregated Sales values for any given Market, Country or City, but without allowing a single invoice to be identifiable (well, at least not easily).

Part 1: conditionally display measures, based on fact counts

This is the core of the solution. The actual criteria to be implemented can be changed later, but first we need to prove we can even do this.

The idea is the following: Given two measures, [Sales] and [Count], define a CalculatedMember that only displays safe values; then, remove access to some of the measures

So we start by adding the measure definitions to our SteelWheelsSales cube:

  <Measure name="Count" column="TOTALPRICE" formatString="#,###" aggregator="count">
  <CalculatedMember name="Curated Sales" dimension="Measures">
        Iif( [Measures].[Count] < 10, null, [Measures].[Sales] )

The [Curated Sales] measuge will only be visible if it’s supported by at least 10 facts from the fact table.

We now enable security by adding

  <Role name="Authenticated">
    <SchemaGrant access="none">
  <Role name="Administrator">
    <SchemaGrant access="all">
  <Role name="restricted">
    <SchemaGrant access="none">
      <CubeGrant cube="SteelWheelsSales" access="all">
        <HierarchyGrant hierarchy="[Measures]" access="custom">
          <MemberGrant member="[Measures].[Curated Sales]" access="all">
          <MemberGrant member="[Measures].[Quantity]" access="none">
          <MemberGrant member="[Measures].[Sales]" access="none">
          <MemberGrant member="[Measures].[Count]" access="none">

With this we explicitly deny access to everybody, and only grant conditional access to users with the restricted role.

A quick remark: when setting the access to [Measures] as custom, access to any measure that is not explicitly granted is denied. Which means, even if we don’t explicitly deny access to the [Sales], [Quantity] and [Count] measures, they won’t be accessible to the user. However, as a good practice, I prefer to explicitly grant/deny access to every measure, which may become a bit of a headache if the cube has a lot of measures.

The important bit here is the fact that even though the formula references measures are not accessible, it still works (much to my surprise, I was fearing a Mondrian error in the logs which would make this approach a dead-end).

With these new definitions in our cube, the query

  Descendants( [Markets].[EMEA].[Belgium], 2) on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

will display only data for Brussels, whereas Charleroi will be displayed as null. This is because the Count of rows that make up Charleroi is 8 only.

So we’re done, right? Well, not quite.

Part 2: partial information can be as dangerous

In the scenario above, Belgium has 2 cities. One is deemed unsafe and is therefore hidden, but the other is safe, so it’s visible.

However, the total for the country will still be considered safe and as such the query

  [Markets].[EMEA].[Belgium] on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

displays the correct total for Belgium. Subtracting the value for Brussels from the total for Belgium we get the correct value for Charleroi, which should be hidden. So we’re not clear yet.

To add to the confusion, we need a solution that scales without damaging performance a lot. So, we want to avoid querying the lowest level of the dimension if we’re only querying at the country level.

Here’s a possible set of criteria:
– A member of the dimension is deemed unsafe if it has a fact count of at least 10;
– If any sibling of a member is unsafe, then the member is considered unsafe.

That way, because Charleroi has a too small underlying dataset, Brussels would also be considered unsafe to display, therefore preventing simple arithmetic to crack the numbers:

  <CalculatedMember name="Curated Sales" dimension="Measures">
     Iif( Min( [Markets].CurrentMember.Siblings, [Measures].[Count] ) < 10, null, [Measures].[Sales] )

We can now query the value for Belgium, and it’s correctly displayed, but we don’t show the values for any of its cities, as at least one of them is considered unsafe.

Ok, but maybe we’re throwing away the baby with the bath water. For example, if we try to see the value for all cities in California,

  [Markets].[NA].[USA].[CA].Children on Rows,
  {[Measures].[Curated Sales], [Measures].[Count]} on Columns
FROM [SteelWheelsSales] WHERE [Time].[2004]

will display nothing, as there are unsafe cities; however, there are 5 of those. Hiding only these few unsafe cities is still safe, as no single city value can be determined. Maybe we can get a better algorithm that still achieves the same result without limiting too much.

Part 3: safety in numbers; unsafe data can be made safe if there’s a lot of them.

Let’s refine our criteria:
– A member is deemed unsafe if it has a low count, of less than 10;
– If no sibling has a low count, the member is safe;
– But if the member has many unsafe siblings (more than 3) it may still be considered safe;

So, we define the measure as

  <CalculatedMember name="Curated Sales" dimension="Measures">
        [Measures].[Count] > 10
       AND (
          Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) = 0
        OR Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) > 3 
     , [Measures].[Sales], null )

With this definition, the query

  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales]

will not display the value for sales for any city of California, as there’s only one unsafe city; however,

  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales] WHERE [Time].[2004]

will display the value for most cities, as there’s a significant number of unsafe cities and therefore their real values are obscured by their numbers. Even if we subtract the known city values from the CA total we only get the sum of those small cities, not each one individually.


Part 4: caveat emptor

Lets not go overboard with this approach. It’s not perfect. In fact, it’s perfectly feasible that a given city is considered safe for [Time].[All Years], [Time].[2003] and [Time].[2004][Time].[2005], in which case it would still be possible to determine the unsafe value by asking the right questions to Mondrian.

Determining the right set of criteria is very tough and you should probably ask your on duty statistician to help defining them and spend a significant amount of time testing the algorithm.

But it’s a start.

Secret Santa Generator

It’s Christmas! And with Christmas comes Secret Santa! We replicated the names in a hat process which is normally used for this kind of thing: write all the names in pieces of paper and put them in a hat; each person takes one, if you drew your own name, you put it back and draw another. If the last person draws his own name, the whole process is repeated. We know it’s not the most efficient way to do it but is an interesting challenge to implement in PDI. You can find it here.

We start with a list of names and emails in a CSV file. PDI reads them, generates two sets of random numbers, builds the pairings and checks if in any pair there’s a repeated name. If so, it repeats the process again, until we have a valid list of pairs of names and then it emails everyone with the names of the recipient they were assigned.

Using loops in PDI is a powerful pattern but a dangerous one. One should always guarantee that the job will not loop indefinitely, eventually blowing up with an out of memory error. In our case this would happen if the list provided has only one name on it, but we trust the user not to misuse the program (besides, what’ the point of having a secret Santa if there’s only one Santa to assign, anyway?).

Ctools table component tweaks

(tl;dr: jump straight to section 3 below if you already know about scrolling tables)

Tables in Ctools are implemented by, a jQuery plugin to… well, display tables. However, as it normally is the case with Ctools, not all features are exposed via the CDE interface. And how could they, given the astounding number of features available plus the ever growing list of DataTables plugins out there. Implementing that extra bit of behaviour the users really love ends up being more an exercise in googling and finding the right plugin or trick to do the task than exploring the list of available properties in the Advanced column of CDE.

And that is the subject of today’s post. A specific tweak to the DataTables component that may be quite handy to know about.

1. To paginate or not to paginate.
By default tables are paginated in DataTables and, therefore, in CDE. However, especially if the result set is not very large, users may prefer a single page table to be displayed. Ok, that’s easy, just go to your Advanced properties tab and set pagination to False. Done, easy.

2. Don’t paginate, but scroll
However, single page tables are really only useful if your table is displaying a handful of rows. If you’re displaying some 20-30 records, for example, they may not be enough to justify pagination, but they’re way too many to display in a single page. One way around it is via CSS: just set its parent div height to be of a fixed size and set its overflow property to scroll. Again, done, easy.

Ah, except that now the table headers, filter box, etc. disappear when you scroll. And you can’t just set the scroll on the tbody element (ok, maybe you can with some really clever CSS hack). Fortunately, there’s an easy way around. A quick google search brings us to an example showing a scrollable table. The relevant property is scrollY, which can be set to any meaningful unit (e.g., 200px; vh and vw are units proportional to the viewpoort). To use it in CDE all we need is to add a new Extra option on the Advanced properties panel and set the property scrollY to whatever value we want.

Ok, but this is all relatively old news. If you have some experience with CDE and tables you may already know all this. So lets move on to something cooler.

3. But when the table refreshes, the scrollable area goes back to the top
Ah, so now your table scrolls up and down, but it has a major issue: the user wants the table to remember where it was. Namely, when the table refreshes because some parameter changed, the table should keep the search string and the scroll position and go back to the previous state.

For that, unfortunately, there’s no option in DataTables we can use, at least none we know about. So we need to implement it. Fortunately, it’s not that complex and the idea can be generalized for other properties we want to keep across component updates.

Step 1: add the following to the table’s preExecution function:

if( typeof this.tableState == "undefined"){
this.tableState = {
search: '',
scroll: 0
return true;

This creates the tableState property where we can store the two values we want to keep.

Step 2: add the following on the preChange of any selector that changes a parameter the table listens to:

var t = render_customerTable;
if( typeof t.tableState != "undefined"){ = $("#" + t.htmlObject + " div.dataTables_filter input").val();
t.tableState.scroll = $("#" + t.htmlObject + " .dataTables_scrollBody" ).scrollTop();

(we need the condition just in case the selector renders before the table itself, which is the case in the sample provided)

This retrieves the two values we want to keep and updates the tableState to reflect them.

Step 3: add this to the postExecution of the table:

var e = this.placeholder().find("div.dataTables_filter input");
this.placeholder().find(" .dataTables_scrollBody" ).scrollTop(this.tableState.scroll);

This will restore the contents of the search box, force the table to draw itself again (by calling the keyup event) and then scroll the table body.

And that’s it. The table now does what we needed with relatively low effort.

Note that the code was written in the pre-require framework. If your dashboard has support for require.js the code should still work, but there are probably better ways to interact with the table, other than jQuerying its HTML target.

If you want to try it, download the sample we created.

Join the team!

We’re looking for a new junior member to join our team of Business Analytics and Big Data consultants.

If you’re passionate about data and have strong analytical skills this is a good opportunity to join a dynamic team developing high quality solutions using Pentaho and Hadoop.

Previous experience in programming (Java or Javascript preferred) and in SQL is desired, but not essential.

Apply now