Using PDI to build a Crosstabs Report
Recently we were in the midst of a migration from an older version to a more recent version of Pentaho Report Designer (PRD), and we were asked to make some prpt reports produce the same results in PRD 7.1 as they did in 3.9.1. You might not know this, but PRD has been know to be “volatile/creative” when you render reports in different versions, especially when these reports include crosstabs.
Crosstabs are an eternal promise of PRD, it was made available at some point but deprecated shortly after. It allowed you to build tables where you did not know how many rows or columns you expected to have.
As an example: what if we want to build a table with all our employee information, plus for each one a column for the products they sold with the revenue they generated. However I don’t know which products they are tasked to sell. Below a simple example:
|Employee Name||Single Bed||Table||Chair||Stool|
If you want the report to be produced weekly, you could create a nice crosstabs report that every week would automatically query my sales data to show me a column for all the products each employee is selling. This is a cool feature, but faced with a new version where crosstabs are deprecated, and all of the reports are rendering incorrectly, we decided to create the reports using Pentaho Data Integration (PDI) instead.
The reports in question were mainly tables written in Excel (xlsx), so there is no reason to stick to PRD, we can move to PDI using the Excel writer step, and ETL Metadata injection to query and inject the products each week.
Build a template
Take the following sample query output:
What we need to do is denormalise our data, creating a column per product with the revenue as the aggregate. This will be quite a simple template:
We use the key “product”, since we want a column per product, and using the value “revenue” grouping on the employee information we should get the total revenue per product for each user:
So as you can see above we will need to inject a list of products as targets, a value fieldname and a datatype. This is simple enough but we also need to write this to an xlsx file.
Depending on the final report, you might need to add different styles in the Excel Writer for header and total cells, or even for different sub tables. In our case, our report had specific styles and sheet names we needed to preserve. This means we need to use a template to define styles for each cell. Which means our output file will be created as a copy of our template file, and then overwritten with specific styles already placed in the template.
Looking through the options in the writer:
- If output file exists: We want to replace any file with the same name, since this is the main writer of the report.
- Wait for first row before creating file: If this is a part of a larger job, there could be use cases where you do not want to create your report file as soon as the transformation starts, but rather wait to see if there are any rows coming from a query, or from a previous process.
- So we select “Use template when creating new files”, and add the path for the template. We also add the sheet name we want from the template file, and make the sheet active, writing to the copy created.
- For each field we need to provide a name, type, format, style and header_style, in case they are different.
In this example we need to inject a list of all field names, including user fields; plus a type, format, style for each cell, and header style.
Build an Injector
So now we need to think about how to build an injector. First we want to avoid running the query twice, once for reading the metadata structure and the other for the data itself. So rather than the injector transformation running the query and then the template doing the same, we chose to stream the rows with the ETL Metadata Injection step.
We need to split our stream into three, one to treat and stream the data into the injector, one to inject the fields to the row denormaliser which will denormalise the data, and finally one to inject the fields into the excel writer, including stylings etc…
The first stream is where we might need to do some data treatment before streaming, in our simple example there is no such need. We also add a blocking step to prevent deadlocks, since we are breaking a cardinal rule by splitting and rejoining a stream. We need to make sure any large data sources don’t create a jam.
Streaming the data is quite easy you directly output rows from a step in your injector transformation, into a specific step in you injected one.
Note: Small tip, when streaming data into the injected transformation we suggest temporarily persisting your data so that you can easily preview the template after injection.
Now lets move on to the second data stream where we prepare the metadata by grouping it on the “products” field.
We use a Memory Group By step in order to avoid having a separate sort steps. First we add the group field “Products”.
And add the value fieldname and the type to each product row
This is injected into our Denormalise Rows step in our template
Finally we focus on the stream that will be injected into the Excelwriter step.
First we decided to rename Product as “output_field_name” for consistency and to avoid confusion with the field “product” coming from above.
Then we need inject the cell references for the styles we want for each type of cell. We want our report to look like this:
Our template looks like this:
The style for our body cells are assigned to cell A2, our header to cell A1 for the first 3 fields and D1 thereafter.
With an add constants we add to the configuration for the products:
Then using a data grid step we add the metadata for the fields User, First Name and Last Name with
So now we have a list of fields to inject to our writer, with formats, types and styles for header and field cell.
Since our template row 1 and 2 are styled, and since we chose to overwrite existing cells in our excel writer options, we write the header while preserving the style of A1, and therefore allow for all other cells to be able to use that same style.
The result is our report.
This of course is the most simple of cases, and we can always build from here. In our real world case we needed to have a totals row with its specific formatting, which we added in a separate Excel writer step.
With these techniques we were able to avoid PRD altogether to reproduce the reports we needed.
Download the files
You can find the code and dummy data here.