It’s finally here. After a few months of work, a lot of bugs fixed and a last minute refactor of all database connections, we are proud to release the first version of GEM, Generic ETL Machine.
It’s an ETL framework based on Pentaho Data Integration, to automate most of the common tasks that take time to develop, are essential to ensure a proper maintainable ETL, but that are hard to explain when the project’s stakeholders want to see their data. Things like enabling a solid logging mechanism with email alerts; data lineage to allow us to trace the data from the reports and dashboards, to the cube, then the fact table, then the staging tables, then the source files or databases, complete with line numbers, date the record was extracted, etc; a way (still very archaic) to define the order in which several tasks should run; a way to quickly deploy from the development laptop to a server and switch environments, from dev to test, to prod; and all this, of course, configurable.
This is still the first version and a lot of work is left to do. For the time being there’s no user interface to display the status of each ETL run or the results, all rollbacks must be performed manually in the database and there’s no way to schedule different tasks at different intervals. That’s all on the roadmap, time permitting, together with increasing the number of available samples (CSV files and web services to name a couple), support for PDI clusters, output and input directly into Hadoop in its multiple shapes and forms.
But it’s usable as it is (well, sort of – we only support a couple databases on each stage for the time being) and it’s been allowing us to spend a significant larger amount of time dealing with what really matters when we have to develop an ETL from scratch: the data analysis and the design of the data model. Not enabling file or DB logging, or setting up the email step for the 500th time.
Clone it, fork it, commit to it, enjoy!