Using PDI and C-tools to display real time scores analysis for UEFA Euro 2016

So, the Euro 2016 started this past Friday. We’ll have matches almost everyday until July 10th.

And here at Ubiquis we decided to start a little project around the Euro. The basic idea is: suppose England gets to half time level with Slovakia. Given both teams’ past history, what’s the most likely scenario? A win for England? A draw? A win for Slovakia? We got score change information from all Euro finals matches since 1960 from Wikipedia and parsed it with PDI. And using a real time scores API called xmlscores.com, built a couple queries that answer one basic question: “out of all matches in which team A was leading/trailing by X goals or more, at the same point in time, how many ended up in wins, losses and draws?”.

So, for example, when yesterday Germany was leading Ukraine by 1 goal at Half time, the dashboard would give us some idea of how often Germany managed to keep a lead and win the match, or how often Ukraine managed to turn the tables around when losing by 1 goal or more and ended up drawing or winning the match.

The dataset is quite small and the data model very simple, but it serves to show how Pentaho C-tools and PDI can be leveraged to create a real time information system, fed by external data sources, in a reliable maner.

Here’s the link: Euro 2016 Analysis dashboard (no login required).

Hopefully we’ll use the learnings from this project to expand its scope and reliability and use it for other sports competitions. Stay tuned.

Fizz buzz

So, today is the day everybody should blog about something really stupid…

Ok, so here’s ours: Fizz Buzz in MDX.

Fizz Buzz, for those that don’t know it, is one of those “tests” interviewers love to see whether a candidate can code or not. It’s useless as pretty much everybody knows it by now, but that’s besides the point.

There are a number of examples of “how to implement Fizz Buzz in language X”. Some are quite lame, others more impressive.

So, in the spirit of April Fools, here’s our Fizz Buzz implementation, in MDX. Run it against SteelWheelsSales, or any cube that has a dimension called [Time].


WITH

member [Time].[0] as 0
member [Time].[1] as 1
member [Time].[2] as 2
member [Time].[3] as 3
member [Time].[4] as 4
member [Time].[5] as 5
member [Time].[6] as 6
member [Time].[7] as 7
member [Time].[8] as 8
member [Time].[9] as 9
set DIGITS as { [Time].[0], [Time].[1], [Time].[2], [Time].[3], [Time].[4],
[Time].[5], [Time].[6], [Time].[7], [Time].[8], [Time].[9] }

set NUMBERS as Generate( DIGITS, Generate( DIGITS, DIGITS, ALL), ALL)

member [Measures].[Ordinal] as NUMBERS.CurrentOrdinal

member [Measures].[Fizz] as Iif( [Measures].[Ordinal] = Round( [Measures].[Ordinal]/3 ) *3, "fizz", "")
member [Measures].[Buzz] as Iif( [Measures].[Ordinal] = Round( [Measures].[Ordinal]/5 ) *5, "buzz", "")
member [Measures].[FB] as [Measures].[Fizz] || [Measures].[Buzz]

member [Measures].[FizzBuzz] as Iif( [Measures].[FB] = "", [Measures].[Ordinal], [Measures].[FB] )

SELECT
NUMBERS on Rows,
{ [Measures].[Ordinal], [Measures].[FizzBuzz] } on Columns

FROM
[SteelWheelsSales]

Although a trivial and quite futile exercise, it does have one feature that is worth mentioning: the All argument in the Generate function. Generate applies each element of the first set to each element of the second set. However, without the ALL argument, duplicates from the resulting set are removed, which results in us getting only a 10 element set.

Non-identifiable data: filtered measures in Mondrian

Here’s an interesting requirement (fictional, but inspired by a real world scenario):

We provide a mobile phone based service. People sign in, we know what phone, OS and browser they use, and provide them some uber feature they all love. It works great and we now have a few million users. We want to publish some service statistics, open to our customers. Such as how many people use the service in a given town? what’s the distribution of phone brands/models? what percentage of users uses a certain combination of mobile phone, operating system and browser? The problem resides in the following: we must respect users’ privacy and therefore cannot publish information that allows any given user to be identifiable. As such, we can only publish metrics if there’s a minimum of 10 users within any given category.

It’s easy to understand why this is relevant. If in a 500 people town only 1 person has a 1st generation iPhone, anybody with access to the data could identify that person’s usage pattern. However, if there’s a large number of users, then one individual’s actions are not easily identifiable.

The problem is: how can we implement that behaviour in a Mondrian cube?

Well, as it happens, there’s a way. To make matters easier to replicate, we’ll be using SteelWheelsSales, and will be limiting access to Sales information based on the Markets dimension, trying to provide aggregated Sales values for any given Market, Country or City, but without allowing a single invoice to be identifiable (well, at least not easily).

Part 1: conditionally display measures, based on fact counts

This is the core of the solution. The actual criteria to be implemented can be changed later, but first we need to prove we can even do this.

The idea is the following: Given two measures, [Sales] and [Count], define a CalculatedMember that only displays safe values; then, remove access to some of the measures

So we start by adding the measure definitions to our SteelWheelsSales cube:

  <Measure name="Count" column="TOTALPRICE" formatString="#,###" aggregator="count">
  </Measure>
  <CalculatedMember name="Curated Sales" dimension="Measures">
    <Formula>
      <![CDATA[
        Iif( [Measures].[Count] < 10, null, [Measures].[Sales] )
      ]]>
    </Formula>
  </CalculatedMember>

The [Curated Sales] measuge will only be visible if it’s supported by at least 10 facts from the fact table.

We now enable security by adding

  <Role name="Authenticated">
    <SchemaGrant access="none">
    </SchemaGrant>
  </Role>
  <Role name="Administrator">
    <SchemaGrant access="all">
    </SchemaGrant>
  </Role>
  <Role name="restricted">
    <SchemaGrant access="none">
      <CubeGrant cube="SteelWheelsSales" access="all">
        <HierarchyGrant hierarchy="[Measures]" access="custom">
          <MemberGrant member="[Measures].[Curated Sales]" access="all">
          </MemberGrant>
          <MemberGrant member="[Measures].[Quantity]" access="none">
          </MemberGrant>
          <MemberGrant member="[Measures].[Sales]" access="none">
          </MemberGrant>
          <MemberGrant member="[Measures].[Count]" access="none">
          </MemberGrant>
        </HierarchyGrant>
      </CubeGrant>
    </SchemaGrant>
  </Role>

With this we explicitly deny access to everybody, and only grant conditional access to users with the restricted role.

A quick remark: when setting the access to [Measures] as custom, access to any measure that is not explicitly granted is denied. Which means, even if we don’t explicitly deny access to the [Sales], [Quantity] and [Count] measures, they won’t be accessible to the user. However, as a good practice, I prefer to explicitly grant/deny access to every measure, which may become a bit of a headache if the cube has a lot of measures.

The important bit here is the fact that even though the formula references measures are not accessible, it still works (much to my surprise, I was fearing a Mondrian error in the logs which would make this approach a dead-end).

With these new definitions in our cube, the query

SELECT
  Descendants( [Markets].[EMEA].[Belgium], 2) on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

will display only data for Brussels, whereas Charleroi will be displayed as null. This is because the Count of rows that make up Charleroi is 8 only.

So we’re done, right? Well, not quite.

Part 2: partial information can be as dangerous

In the scenario above, Belgium has 2 cities. One is deemed unsafe and is therefore hidden, but the other is safe, so it’s visible.

However, the total for the country will still be considered safe and as such the query

SELECT
  [Markets].[EMEA].[Belgium] on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

displays the correct total for Belgium. Subtracting the value for Brussels from the total for Belgium we get the correct value for Charleroi, which should be hidden. So we’re not clear yet.

To add to the confusion, we need a solution that scales without damaging performance a lot. So, we want to avoid querying the lowest level of the dimension if we’re only querying at the country level.

Here’s a possible set of criteria:
– A member of the dimension is deemed unsafe if it has a fact count of at least 10;
– If any sibling of a member is unsafe, then the member is considered unsafe.

That way, because Charleroi has a too small underlying dataset, Brussels would also be considered unsafe to display, therefore preventing simple arithmetic to crack the numbers:

  <CalculatedMember name="Curated Sales" dimension="Measures">
   <Formula>
    <![CDATA[
     Iif( Min( [Markets].CurrentMember.Siblings, [Measures].[Count] ) < 10, null, [Measures].[Sales] )
    ]]>
   </Formula>
  </CalculatedMember>

We can now query the value for Belgium, and it’s correctly displayed, but we don’t show the values for any of its cities, as at least one of them is considered unsafe.

Ok, but maybe we’re throwing away the baby with the bath water. For example, if we try to see the value for all cities in California,

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  {[Measures].[Curated Sales], [Measures].[Count]} on Columns
FROM [SteelWheelsSales] WHERE [Time].[2004]

will display nothing, as there are unsafe cities; however, there are 5 of those. Hiding only these few unsafe cities is still safe, as no single city value can be determined. Maybe we can get a better algorithm that still achieves the same result without limiting too much.

Part 3: safety in numbers; unsafe data can be made safe if there’s a lot of them.

Let’s refine our criteria:
– A member is deemed unsafe if it has a low count, of less than 10;
– If no sibling has a low count, the member is safe;
– But if the member has many unsafe siblings (more than 3) it may still be considered safe;

So, we define the measure as

  <CalculatedMember name="Curated Sales" dimension="Measures">
   <Formula>
    <![CDATA[
     Iif(
        [Measures].[Count] > 10
       AND (
          Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) = 0
        OR Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) > 3 
       )
     , [Measures].[Sales], null )
    ]]>
   </Formula>
  </CalculatedMember>

With this definition, the query

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales]

will not display the value for sales for any city of California, as there’s only one unsafe city; however,

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales] WHERE [Time].[2004]

will display the value for most cities, as there’s a significant number of unsafe cities and therefore their real values are obscured by their numbers. Even if we subtract the known city values from the CA total we only get the sum of those small cities, not each one individually.

However…

Part 4: caveat emptor

Lets not go overboard with this approach. It’s not perfect. In fact, it’s perfectly feasible that a given city is considered safe for [Time].[All Years], [Time].[2003] and [Time].[2004][Time].[2005], in which case it would still be possible to determine the unsafe value by asking the right questions to Mondrian.

Determining the right set of criteria is very tough and you should probably ask your on duty statistician to help defining them and spend a significant amount of time testing the algorithm.

But it’s a start.

Secret Santa Generator

It’s Christmas! And with Christmas comes Secret Santa! We replicated the names in a hat process which is normally used for this kind of thing: write all the names in pieces of paper and put them in a hat; each person takes one, if you drew your own name, you put it back and draw another. If the last person draws his own name, the whole process is repeated. We know it’s not the most efficient way to do it but is an interesting challenge to implement in PDI. You can find it here.

We start with a list of names and emails in a CSV file. PDI reads them, generates two sets of random numbers, builds the pairings and checks if in any pair there’s a repeated name. If so, it repeats the process again, until we have a valid list of pairs of names and then it emails everyone with the names of the recipient they were assigned.

Using loops in PDI is a powerful pattern but a dangerous one. One should always guarantee that the job will not loop indefinitely, eventually blowing up with an out of memory error. In our case this would happen if the list provided has only one name on it, but we trust the user not to misuse the program (besides, what’ the point of having a secret Santa if there’s only one Santa to assign, anyway?).

Ctools table component tweaks

(tl;dr: jump straight to section 3 below if you already know about scrolling tables)

Tables in Ctools are implemented by DataTables.net, a jQuery plugin to… well, display tables. However, as it normally is the case with Ctools, not all features are exposed via the CDE interface. And how could they, given the astounding number of features available plus the ever growing list of DataTables plugins out there. Implementing that extra bit of behaviour the users really love ends up being more an exercise in googling and finding the right plugin or trick to do the task than exploring the list of available properties in the Advanced column of CDE.

And that is the subject of today’s post. A specific tweak to the DataTables component that may be quite handy to know about.

1. To paginate or not to paginate.
By default tables are paginated in DataTables and, therefore, in CDE. However, especially if the result set is not very large, users may prefer a single page table to be displayed. Ok, that’s easy, just go to your Advanced properties tab and set pagination to False. Done, easy.

2. Don’t paginate, but scroll
However, single page tables are really only useful if your table is displaying a handful of rows. If you’re displaying some 20-30 records, for example, they may not be enough to justify pagination, but they’re way too many to display in a single page. One way around it is via CSS: just set its parent div height to be of a fixed size and set its overflow property to scroll. Again, done, easy.

Ah, except that now the table headers, filter box, etc. disappear when you scroll. And you can’t just set the scroll on the tbody element (ok, maybe you can with some really clever CSS hack). Fortunately, there’s an easy way around. A quick google search brings us to an example showing a scrollable table. The relevant property is scrollY, which can be set to any meaningful unit (e.g., 200px; vh and vw are units proportional to the viewpoort). To use it in CDE all we need is to add a new Extra option on the Advanced properties panel and set the property scrollY to whatever value we want.

Ok, but this is all relatively old news. If you have some experience with CDE and tables you may already know all this. So lets move on to something cooler.

3. But when the table refreshes, the scrollable area goes back to the top
Ah, so now your table scrolls up and down, but it has a major issue: the user wants the table to remember where it was. Namely, when the table refreshes because some parameter changed, the table should keep the search string and the scroll position and go back to the previous state.

For that, unfortunately, there’s no option in DataTables we can use, at least none we know about. So we need to implement it. Fortunately, it’s not that complex and the idea can be generalized for other properties we want to keep across component updates.

Step 1: add the following to the table’s preExecution function:

function(){
if( typeof this.tableState == "undefined"){
this.tableState = {
search: '',
scroll: 0
};
}
return true;
}

This creates the tableState property where we can store the two values we want to keep.

Step 2: add the following on the preChange of any selector that changes a parameter the table listens to:

function(){
var t = render_customerTable;
if( typeof t.tableState != "undefined"){
t.tableState.search = $("#" + t.htmlObject + " div.dataTables_filter input").val();
t.tableState.scroll = $("#" + t.htmlObject + " .dataTables_scrollBody" ).scrollTop();
}
}

(we need the condition just in case the selector renders before the table itself, which is the case in the sample provided)

This retrieves the two values we want to keep and updates the tableState to reflect them.

Step 3: add this to the postExecution of the table:

function(){
var e = this.placeholder().find("div.dataTables_filter input");
e.val(this.tableState.search);
e.keyup();
this.placeholder().find(" .dataTables_scrollBody" ).scrollTop(this.tableState.scroll);
}

This will restore the contents of the search box, force the table to draw itself again (by calling the keyup event) and then scroll the table body.

And that’s it. The table now does what we needed with relatively low effort.

Note that the code was written in the pre-require framework. If your dashboard has support for require.js the code should still work, but there are probably better ways to interact with the table, other than jQuerying its HTML target.

If you want to try it, download the sample we created.

Join the team!

We’re looking for a new junior member to join our team of Business Analytics and Big Data consultants.

If you’re passionate about data and have strong analytical skills this is a good opportunity to join a dynamic team developing high quality solutions using Pentaho and Hadoop.

Previous experience in programming (Java or Javascript preferred) and in SQL is desired, but not essential.

Apply now

Ctools now supports Require.js – But… what about jQuery plugins?

Great news from the Pentaho Community Dashboards front. As of the current version (15.06.30), there’s support for Require.js, which greatly streamlines the javascript downloaded when rendering a dashboard, cleans up the global scope and provides a proper management of modules and dependencies. I won’t bother you with the details, you can read more about it in Pedro Alves’s post.

This is truly a game changer when it comes to developing CDF/CDE dashboards. We have to write our code better, with a proper architecture in place to make all the pieces talk to each other, we need to use the proper methods to set or get parameter values, to run custom functions in pre-execution or post-execution code snippets, etc.

This is not a bad thing, far from it. But it doesn’t come without its challenges. We’ve started playing with the new require.js framework here at Ubiquis and we’ve learned useful tricks already.

Here’s one: remember how easy it was to upload that myAwesomeJqueryPlugin that you found on some github repository and allowed you to implement that super-duper feature on your dashboard? Well, it won’t work anymore. Not without a bit of work, at least. Here’s why: your typical jQuery plugin is a function of the type

(function ($) {
// your awesome code goes here
})(jQuery);

But now it will fail. The reason is that $ is no longer defined in the global scope. It’s still exposed and you can access it as usual from within any component’s internal methods, but not when you’re loading your jQuery plugins.

So, after some reading we got to a very interesting Stack Overflow post (see , that pretty much solves the problem. We just need some tweaks on the plugin code to make it require friendly:


(function (factory) {
if (typeof define === 'function' && define.amd) {
// AMD. Register as an anonymous module depending on jQuery.
define(['jquery'], factory);
} else {
// No AMD. Register plugin with global jQuery object.
factory(jQuery);
}
}(function ($) {
// your awesome code goes here
})(jQuery));

(note that the whole jQuery plugin is passed as a parameter to the factory function)

And there you have it, you can continue to use all the jQuery plugins you want, which are now loaded as require.js modules.

PDI: getting the previous value of a field, but only sometimes

In PDI we can easily retrieve the previous (or next) value of a field easily, by using the Transformation step Analytical Query. It allows us to fetch the value of a given field or fields N rows back or N rows forward. Quite useful for things like cumulative sums (although for that we could just use a Group by step), getting the previous attribute values of a SCD, etc to map a SCD of type II to a SCD of type III, etc.

However, this is a rather limited step:

  • The number of rows back or forth is fixed; it can be a variable, but can’t be dynamically changed;
  • We can only get a field’s value, not anything more complex, like operations on fields;
  • We can’t get values based on a condition (e.g., if customer number is null, take the previous customer number; otherwise, build it using some algorithm)

We were facing a problem where the answer seemed to be the Analytical Query step… only that it wasn’t. Here’s the problem: on an event tracking database we have a category_raw field; as a general rule, this is the value we want to use as the true category of the object, let’s call it category_new. However, if the category_raw field value is ‘X’, we should ignore it and instead use the previous category_new value. Sounds simple, right?

Here’s a sample of the data and the desired output:

category_raw;category_new
A;A
B;B
X;B
D;D
E;E

From here it seems quite obvious: we use the Analytical Query step, fetch the category_raw value of the previous row and with a simple condition we evaluate

category_raw == “X” ? prev_category_raw : category_raw

However, if we have various consecutive exceptions,

category_raw;category_new
A;A
X;A
X;A
D;D
E;E

This approach doesn’t work: the 3rd row would fetch “X”, the value of the previous row, not “A” which occurred 2 rows before.

We tried to trick the Analytical Query step into fetching the value of the same field it’s creating, but that doesn’t work either. The step has to read values from one field and write a new field, but they must be different.

In the end we decided to go with a simple Java Script step, not so much because it’s the only way to go (you can quite easily fetch the objects you need using a Java class, Java expression, etc.), but because it’s simple.

Here’s a Javascript code:

var category_new;
if( category_raw == “X”){
category_new = getVariable( “prev_category_new”, “”);
}else{
category_new = category_raw;
}
setVariable( “prev_category_new”, category_new, “s”);

We know we’re setting a variable and using it in the same transformation, which is something we were always told is wrong. And usually, it is. However, as setting and getting the variable both happen inside the same step, it’s actually harmless: the variable is set for row 1 before it’s called on row 2, and so on…

Here’s a sample of the result:

category_raw;category_new
A;A
B;B
X;B
X;B
X;B
F;F
X;F
H;H

Each row picks up the correct category_new value, taking into account the handling of the excepcional “X” values.

Hello world, again

Well, we finally got round to upgrade our Pentaho installation to 5.2 and decided to revamp our World Population dashboard in the process.

First thing you may notice: it’s called World Data now, not World Population. This is because we plugged in data available at the Gapminder and add a few selectors to allow you to choose different metrics. Anything from Overall population to GDP/capita in inflation-adjusted PPP dollars, to percentage of renewable energies, you name it and probably Gapminder has it.

The best part: there’s time series data, sometimes dating as far back as the Middle Ages, which allows us to show how those indicators evolved for any particular country. Just click on a country and you’ll see.

We also added a switch to allow you to change the color scale from linear to log (the algorithm decided automatically which one to use, but sometimes it may choose unwisely).

Here, enjoy it: World Data demo dashboard.

Reading/Writing files from a Kettle CDA datasource in Pentaho 5.x

Kettle datasources in CDA are the most versatile way of getting data from disparate systems into a dashboard or report. Both C-tools and PRD expose Kettle as a datasource, so we can call a transformation that queries several different systems (databases, web services, etc.), do some Kettle magic on the results and then ship the final resultset back to the server.

However… there’s no way out of the box to reference sub-transformations in Pentaho 5.x. Solution files are now hosted on Jackrabbit and not on the file system and ${Internal.Transformation.Filename.Directory} just doesn’t work. If you have a transformation calling a sub-transformation and upload both to the repository, when you run it you get an error message saying that it cannot find the file.

Although this usually isn’t a problem, as most Kettle transformations used by dashboards are relatively simple, for more complex tasks you may want to use mappings (sub-transformations), call a job or transformation executor or, and perhaps the best use case, use metadata injection. In these scenarios you _really_ need to be able to call ktr files from the BA server context, and as it happens you can’t.

Well, sort of. As it turns out, you can.

The problem: lets say you have your server installed in /opt/pentaho/biserver-ce (or biserver-ee, we at Ubiquis don’t discriminate), and your kettle file is in your repository in /public/myDashboard/files/kettle/myTransformation.ktr. You want to use Metadata injection on a transformation called myTemplate.ktr and so you reference it by ${Internal.Transformation.Filename.Directory}/myTemplate.ktr in your metadata injection step. When you run your datasource, your error message will say something like

Cannot read from file /opt/pentaho/biserver-ce/pentaho-solutions/public/myDashboard/files/kettle/myTemplate.ktr because it’s not a file.

So what seems to be happening is that Kettle is using your pentaho-solutions folder as the root folder and it just appends the JCR path to it. Which obviously fails, as there’s no physical counterpart in your filesystem for the JCR files.

But, if you use the Pentaho Repository Synchronizer plugin to copy the contents of JCR to the filesystem and vice-versa (which is perhaps the simplest way to have your solution files under GIT or SVN, at least until the Ivy GS plugin becomes fully functional), then the synchronisation process will create a folder repositorySynchronizer inside your pentaho-solutions folder and use it to create the copies of every JCR file.

So, after synchronising JCR with the file system, your template transformation can be found in /opt/pentaho/biserver-ce/pentaho-solutions/repositorySynchronizer/public/myDashboard/files/kettle.

And now, the funny bit: to get it working, just create a symbolic link to remove the extra repositorySynchronizer element from your path: go to your pentaho-solutions folder and just type ln -s repositorySynchronizer/public public.

Now the template transformation is found and you can reference it from a transformation called by a dashboard.

Credit for the discovery goes to Miguel Cunhal, our junior consultant, that had the tough task of getting our metadata injection transformation running inside a demo dashboard. But more on that in another post.