Reading/Writing files from a Kettle CDA datasource in Pentaho 5.x

Kettle datasources in CDA are the most versatile way of getting data from disparate systems into a dashboard or report. Both C-tools and PRD expose Kettle as a datasource, so we can call a transformation that queries several different systems (databases, web services, etc.), do some Kettle magic on the results and then ship the final resultset back to the server.

However… there’s no way out of the box to reference sub-transformations in Pentaho 5.x. Solution files are now hosted on Jackrabbit and not on the file system and ${Internal.Transformation.Filename.Directory} just doesn’t work. If you have a transformation calling a sub-transformation and upload both to the repository, when you run it you get an error message saying that it cannot find the file.

Although this usually isn’t a problem, as most Kettle transformations used by dashboards are relatively simple, for more complex tasks you may want to use mappings (sub-transformations), call a job or transformation executor or, and perhaps the best use case, use metadata injection. In these scenarios you _really_ need to be able to call ktr files from the BA server context, and as it happens you can’t.

Well, sort of. As it turns out, you can.

The problem: lets say you have your server installed in /opt/pentaho/biserver-ce (or biserver-ee, we at Ubiquis don’t discriminate), and your kettle file is in your repository in /public/myDashboard/files/kettle/myTransformation.ktr. You want to use Metadata injection on a transformation called myTemplate.ktr and so you reference it by ${Internal.Transformation.Filename.Directory}/myTemplate.ktr in your metadata injection step. When you run your datasource, your error message will say something like

Cannot read from file /opt/pentaho/biserver-ce/pentaho-solutions/public/myDashboard/files/kettle/myTemplate.ktr because it’s not a file.

So what seems to be happening is that Kettle is using your pentaho-solutions folder as the root folder and it just appends the JCR path to it. Which obviously fails, as there’s no physical counterpart in your filesystem for the JCR files.

But, if you use the Pentaho Repository Synchronizer plugin to copy the contents of JCR to the filesystem and vice-versa (which is perhaps the simplest way to have your solution files under GIT or SVN, at least until the Ivy GS plugin becomes fully functional), then the synchronisation process will create a folder repositorySynchronizer inside your pentaho-solutions folder and use it to create the copies of every JCR file.

So, after synchronising JCR with the file system, your template transformation can be found in /opt/pentaho/biserver-ce/pentaho-solutions/repositorySynchronizer/public/myDashboard/files/kettle.

And now, the funny bit: to get it working, just create a symbolic link to remove the extra repositorySynchronizer element from your path: go to your pentaho-solutions folder and just type ln -s repositorySynchronizer/public public.

Now the template transformation is found and you can reference it from a transformation called by a dashboard.

Credit for the discovery goes to Miguel Cunhal, our junior consultant, that had the tough task of getting our metadata injection transformation running inside a demo dashboard. But more on that in another post.

Back to blog

2 Comments

  • Andi says:

    Hi, so I am trying this out… but biserver does not seem to be able to access the data from the second ktr. The first ktr prepares the data to be injected and calls ktr2 and injects the data, but CDE does not seem to be able to obtain the data from ktr2. If I reference a Kettle Step in ktr1 I get the data returned but not when I reference any step in ktr2. Have you been able to solve this? Thank you so much, Andi

    • Nelson Sousa says:

      You don’t return the data from ktr2 directly to CDE. Instead, when you metadata inject ktr2 you set it to stream the data back to ktr1. So all the data will be available as the output stream of the Metadata Injection step in ktr1, you can put a dummy step after it and use that step to take the data back to CDE.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

  • Graph Databases: Loading Data with Neo4j

    Graph databases are becoming more popular as a way of storing and analysing large connected datasets. Neo4j is a...

    Read more
  • Date Dimension Revisited: Bank Holidays

    Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation...

    Read more
  • Setup Pentaho BA Server to use SSL Certificates

    SSL Certificate Java Truststore Setup SSL, or Secure Socket Layer, is a technology which allows web browsers and web...

    Read more