PDI and XML: Beyond the native steps

Motivation

In day to day development we are sometimes faced with having to not only parse data delivered in XML, but a. lso generate complex XML files containing said data. It is nowadays common to exchange database extracts in XML format, and, in many other cases, results to API calls are provided in XML. Moreover, in many occasions one needs to construct either simple or complex XML, either to transmit data or to notify other systems that an action is required.

All of this requires a data engineer to create processes that receive, parse, and construct XML in near real time.

Parsing XML in PDI

Constructing a process to parse XML in PDI can be a frustrating ordeal. The developer creates a transformation to read XML using the native steps, optimises it for large throughput to a database only to discover that, once the source files increase in size, the real bottleneck is in the reading process. Moreover, when splitting the original file into several streams, issues with recursion or complex structures, memory and CPU resources increase almost exponentially.

To overcome this problem we suggest reading XML using the StAX parser step and processing its output. All of the parsing needs to be done manually, of course, but now all of the data is being streamed before being interpreted. We also suggest staging data using one or several “Serialize to file” steps before doing any merges or joins before finally saving to a database. Storing the data locally during processing is considerably faster than storing it in a staging database over the network, especially for large files.

Constructing XML

Creating an XML to represent a complex relational database structure can be a daunting task, especially when several hierarchical levels need to be represented. We proposed that the best way to do this is to template the output corresponding to the data in the leaves of the hierarchies.

This, in turn, will become the data to be represented by the higher levels. This recursion can be easily constructed using PDI transformations.

Ordering requirements, such as XML sequences, can easily be catered for and joins between parts of the XML are now reduced to a sorting of snippets according to a pre-defined priority.

Not only does this reduce the memory requirements when generating large XML files, it also significantly reduces the number of queries one makes to the database where the data is stored, thus considerably speeding up XML generation.

Conclusion

PDI is a generic tool that allows not only the integration of data sources but also its export. By its visual nature, it allows the creation of complex pipelines without forcing the developers to do any type of advanced coding. XML, however, has traditionally been in the domain of library utilization and custom code. We present a number of patterns that bring both the parsing and construction of complex and large XML files to the domain of visual programming used in PDI.

You can download the code and example data here.

1 Comment

  1. Dan Keeley

    There is actually a marginally better way of building XML in the javascript step – you can use E4X

    http://sandbox.kettle.be/wordpress/index.php/2009/01/19/e4x-xml-tweaking/

    That example isnt the best however – there’s actually a mode where you just write out the XML in the step without ” and concatenation, and it’s really easy to read!

Leave a Reply