Archive for “March, 2019”

PDI and XML: Beyond the native steps

Motivation

In day to day development we are sometimes faced with having to not only parse data delivered in XML, but a. lso generate complex XML files containing said data. It is nowadays common to exchange database extracts in XML format, and, in many other cases, results to API calls are provided in XML. Moreover, in many occasions one needs to construct either simple or complex XML, either to transmit data or to notify other systems that an action is required.

All of this requires a data engineer to create processes that receive, parse, and construct XML in near real time.

Parsing XML in PDI

Constructing a process to parse XML in PDI can be a frustrating ordeal. The developer creates a transformation to read XML using the native steps, optimises it for large throughput to a database only to discover that, once the source files increase in size, the real bottleneck is in the reading process. Moreover, when splitting the original file into several streams, issues with recursion or complex structures, memory and CPU resources increase almost exponentially.

To overcome this problem we suggest reading XML using the StAX parser step and processing its output. All of the parsing needs to be done manually, of course, but now all of the data is being streamed before being interpreted. We also suggest staging data using one or several “Serialize to file” steps before doing any merges or joins before finally saving to a database. Storing the data locally during processing is considerably faster than storing it in a staging database over the network, especially for large files.

Constructing XML

Creating an XML to represent a complex relational database structure can be a daunting task, especially when several hierarchical levels need to be represented. We proposed that the best way to do this is to template the output corresponding to the data in the leaves of the hierarchies.

This, in turn, will become the data to be represented by the higher levels. This recursion can be easily constructed using PDI transformations.

Ordering requirements, such as XML sequences, can easily be catered for and joins between parts of the XML are now reduced to a sorting of snippets according to a pre-defined priority.

Not only does this reduce the memory requirements when generating large XML files, it also significantly reduces the number of queries one makes to the database where the data is stored, thus considerably speeding up XML generation.

Conclusion

PDI is a generic tool that allows not only the integration of data sources but also its export. By its visual nature, it allows the creation of complex pipelines without forcing the developers to do any type of advanced coding. XML, however, has traditionally been in the domain of library utilization and custom code. We present a number of patterns that bring both the parsing and construction of complex and large XML files to the domain of visual programming used in PDI.

You can download the code and example data here.

Setup Pentaho BA Server to use SSL Certificates

SSL Certificate Java Truststore Setup

SSL, or Secure Socket Layer, is a technology which allows web browsers and web servers to communicate over a secured connection. This means that during our initial attempt to communicate with a web server it will present our web browser with a set of credentials, in the form of a “Certificate”, as proof the site is who and what it claims to be. In certain cases, the server may also request a Certificate from our web browser, asking for proof that we are who we claim to be.

This post will go through the process required to have our Pentaho Server using SSL certificates.

Create your SSL key/certificate

First thing to do is to create the SSL certificate key for our tomcat server to use. And then we will need to tell our tomcat server to allow HTTPS connections. To do this in the Pentaho Server we edit a file named “server.xml”, referenced here:

And here :

For this exercise we will be using the Java KeyStore (JKS). A keystore manages the provision of the client private keys/certificates we use when we try to access a server, and a truststore (cacerts in Java) manages the verification of those keys/certificates. 

In order to do create the key we need to create a new JKS keystore, containing a single self-signed key/certificate.  Execute the following from a terminal command line:

$ keygen -genkey -alias tomcat -keyalg RSA

This command will create a new file, in the home directory of our user named “.keystore”. To specify a different location or filename, add the -keystore parameter, followed by the complete path to our keystore file. We will also need to reflect this new location in the server.xml configuration file.

        1. First we are prompted for a password for the keystore and later for a password for the actual key. (Strangely in older tomcat versions they need to be the same) This keystore password will have to be given to our server application so it can access our server side keys. Note – The default password used by Tomcat is “changeit” (all lower case).
        2. After deciding on a password, a number of questions need to be answered as to certify the authenticity of our server.
        3. Finally we are prompted for the password for the actual tomcat alias key. If any issues come up in this step, check if you need to have the same password as for the datastore.

Ok, we have a keystore where Java can find certificate keys to show servers when prompted, with one key with the alias tomcat. Now we need to tell Java to use this key in their trustore. We export the key to a *.cer file and go tell Java that he should incorporate this key in cacerts, Javas’ truststore.

$ keytool -export -alias tomcat -file tomcat.cer -storepass <password> -keypass <password> -keystore .keystore
Certificate stored in file <tomcat.cer>
$ cd $PENTAHO_JAVA_HOME/jre/lib/security/
$ keytool -import -alias tomcat -file ~/tomcat.cer -keystore cacerts -storepass changeit

Configure Tomcat for HTTPS

Now we need to tell Pentaho 8 we are ready to use HTTPS using our key. To do so we need to change our server.xml file in the tomcat folder: tomcat/conf/server.xml.

There we need to uncomment the connector for HTTPS SSL protocol in port 8443 and add the credentials  for the keystore location and password:

<Connector URIEncoding="UTF-8" port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol"
maxThreads="150" SSLEnabled="true" scheme="https" secure="true"
keystoreFile="<pentaho_user_home>/.keystore" keystorePass=<password>
clientAuth="false" sslProtocol="TLS" />

This configures a new connector at port 8443 using https, however we do not close the http connector at port 8080 yet. This allows both to be used, at the same time (we may close it later if we want to prevent HTTP connections). Do be careful as if you try to login on both and your browser executes cached code it can create a login error. Using a private browsing window helps avoid this issue.

Also we need to tell Pentaho 8 we now have a new port in the server.properties file: pentaho-solutions/system/server.properties

fully-qualified-server-url=http://localhost:8443/pentaho/

And this should be it. We should have gotten our server to respond in both port 8080 using HTTP and 8443 using HTTPS. Keep in mind what I mentioned regarding the browser cache, so if we simply use a private browsing window all should be ok.