Neo4j Northwind Traders Database Schema

Graph Databases: Loading Data with Neo4j

Graph databases are becoming more popular as a way of storing and analysing large connected datasets.

Neo4j is a popular Graph DBMS because of its powerful querying language: Cypher and its growing community and excellent supporting tools.

A new paradigm comes with a new set of challenges. In this case we are focused on the challenge of creating a data pipeline to load data into Neo4j, thinking about how we might design our schema and how we might query it.

Today, we’ll take you through an ETL of The Northwind Traders Sample Database and some of the things we can do with Cypher that makes it special and worth a look.

We’ll be using Kettle to orchestrate our ETL and making use of the Neo4j Output Plugin to help us interact with our Neo4j server. We’ll also load the same dataset to PosgreSQL, to see how the two technologies compare.


Environment Setup

The installation of Kettle, Neo4j Output Plugin and Neo4j server is outside the scope of this post but it is important to note that the connection is not stored in the transformations as they usually are. It’s stored in the metastore which is found in your home folder. For that reason, you’ll have to create a new connection before you can load data into Neo4j.

Create connection is available the Neo4j Menu of Kettle
Neo4j Connection dialog allows you to add a connection to your metastore

Once you can test your connection you are good to go, just make sure the connection name is local-hardcoded so that you don’t need to change any of the transformations provided at the end of the post.

To set up your PostgreSQL connection you’ll need to edit the conf/kettle.properties files with the correct details for your connection. This will change the connection for each of the transformations and the main job as we have parameterised the connection for your convenience. If you do not have a password you can remove everything after DATABASE_PASS from the file and it will work fine.

With that out of the way, let’s get started with Northwind.


Northwind Traders Database

The Northwind Traders database is a sample database that comes with Microsoft Access. The database contains sales data for a fictitious company called Northwind Traders.

As you can see it represents an OLTP focused on parts of orders.

We will be using a version of Northwind compiled by Neo4j as csv files, available on GitHub, which is more or less the same as the original and the image above.


Loading Data

The ETL we have created is relatively simple; there is a single job, RUN.kjb, that runs, in sequence, the transformations that load nodes and relationships into Neo4j from the csv files.

Similarly, there is a single job for loading into Postgres; one important difference is that for Postgres we must create our dimension and fact schemas beforehand using SQL create statements. In Neo4j this is not necessary because graph databases are schema-less.

We’ll be using the Neo4j Output step to create nodes and relationships and the Neo4j Cypher step for lookups.

The sequence of steps below shows a common pattern to ensure that poorly delimited and enclosed csv files produce the expected columns and rows:

Handling poorly delimited files in Kettle

Address Nodes

Addresses occur often within Northwind: suppliers, customers, shipment receivers and employees all have addresses. For that reason we have tried to create a generic way to link an address, an actor that allows for performant geographic queries.

Rather than have addresses and their components (country, region, city, postcode, street and building name) stored as a property for each customer, supplier, etc. each address is stored as its own node. Each address node is linked to a city node and each city to a region and so on; this creates a hierarchy, using composite keys to make sure the same region isn’t linked to two countries. The result of this is easier indexing which speeds up queries because searches on nodes are faster than searches on properties.

Producing nodes and relationships from a table of addresses in Kettle

After we have collected all the addresses from all the files, we remove duplicates and replace any null regions (some countries have only one region, which is null) with the country name. We create the region nodes and link them to their appropriate country node, do the same for cities -> regions and addresses -> city nodes. 

We now have a more efficient way to query locations and slice our data. 

Example address heirarchy in Neo4j Browser

Relational Addresses

In Postgres we need to do little to get addresses into our dimensions. If it wasn’t for other complications such as poorly delimited and enclosed csv files and replacing null regions it would be as simple as table input -> table output. This is because each address is stored in the same dimension as the other information for that actor, i.e. customer addresses are stored in the customer dimension.


Date Nodes and Date Dimension

Creating date nodes in Kettle

Creating date nodes is identical to any date dimension that you have created in the past except that you are creating nodes instead of rows. The result is a node with many properties that allow you to query in a variety of ways without having to do on the fly date operations:

Date node properties in Neo4j Browser

In PostgreSQL, we populate our date dimension with the same fields as in Neo4j but each date is a row not a node.


Order Nodes and Relationships 

In our graph, order nodes are the most connected node; you could compare this node to a fact table that contains no additive fields (also known as a factless fact table)

Joining orders.csv and order-details.csv in Kettle

We start by joining order-details to orders so that we can create all the nodes and relationships we want in one go.

Date and shipment receiver node lookups in Kettle

Next, we do some lookups on previously created nodes so that we can link the order nodes to date nodes and shipment receivers.

Creating order nodes and the relationship between products before grouping in Kettle

Calculate totals from unitPrice, discount and quantity; this reduces query time because values are precalculated. Create our order nodes and their relationships to products (remembering that the CONTAINS_PRODUCT relationship uses part orders coming from order-details.csv

Define the CONTAINS_PRODUCT relationship and its properties in Kettle

All the additive fields apart from freight are stored in our CONTAINS_PRODUCT relationship between Order and Product nodes. This is the most logical location to store these properties unless we wanted to create a Part Order which would only increase traversal and reduce performance of queries.

Creating relationships between orders and many other nodes in Kettle

Finally, we create all the relationships between our order nodes and the other nodes we created before using a sequence of Neo4j Output steps.

Define the SHIPPED_TO relationship and its properties in Kettle

It’s worth noting that we store our freight costs, another additive field, in the relationship between the Order and ShipmentReceiver: SHIPPED_TO. This allows us to maintain additivity without introducing complications surrounding the freight field as you will see later on.

The schema we have created looks like this:

“call db.schema()” in Neo4j Browser

Part Fact Orders

In PostgreSQL, we still join orders to order details to get part orders; however, we must isolate a single freight value for each order so that freight is additive. Some of the options here are:

  • Split the freight evenly between the products contained in an order.
    • This is misleading as packaging and shipping costs are usually dependant on size and/or weight of the package so we want to avoid splitting evenly.
  • Split the freight proportionately between the products contained in an order.
    • This is the ideal scenario but it is not possible because we do not have weight or size information for the products, unfortunately.
  • Store freight with only a single part order.
    • This is the compromise we chose as it maintains the additivity of the freight field and is less misleading.

As you can see, all options mean we cannot do analysis of freight costs per product. 

We accomplish this using a changing sequence and a javascript calculation.

Remove freight from all but the first row of an order in Kettle

Relational data warehouses depend heavily on surrogate keys to join facts to dimensions. For each dimension we have created a sequence for this purpose. 

When we create the fact table we lookup these sequences so we can add them to the fact table. This is a distinct difference between Neo4j and relational databases as Neo4j manages its own keys to identify which relationships are connected to a given node.

Dimension lookups in Kettle

Finally, before loading to the table we create a sequence to be the primary key for the fact table.

Add a primary key sequence before table output in Kettle

Querying

Let’s look at how we can query our newly created databases.

Value of sales for each year from customers in the USA

In Cypher:

MATCH (p:Product)<-[r]-(o:Order)--(:Customer)--()--()--()--(c:Country)
WHERE toLower(c.country) = "usa"
WITH o AS order, c.country AS country, r AS rel
MATCH (order)-[:ORDERED_ON_DATE]->(d)
RETURN 
  country, d.calendarYear AS year
, count(DISTINCT order) AS number_of_orders
, apoc.number.format(sum(rel.netAmount), '$#,##0.00', 'en') AS 
  value_in_dollars
ORDER BY year ASC

In PostgreSQL:

SELECT
  customers.country AS country
, dates.calendar_year AS year
, count(DISTINCT orders.order_nk) AS number_of_orders
, cast(sum(orders.net_amount) AS money) AS value_in_dollars
FROM
  public.fact_part_orders AS orders
, public.dim_customers AS customers
, public.dim_date AS dates
WHERE orders.customer_id = customers.customer_id
AND   orders.order_date_id = dates.date_id
AND   lower(customers.country) = 'usa'
GROUP BY country, year
order BY year ASC;

We can refactor our schema to include direct relationships between orders and cities, orders and regions, orders and countries, giving us a quicker way to retrieve the same results. After you do this the matching pattern changes from (p:Product)<-[r]-(o:Order)--(:Customer)--()--()--()--(c:Country) to (p:Product)<-[r]-(o:Order)--(c:Country) and the performance boost would be significant as there are less hops to traverse and fewer searches to complete. 

Products most likely to be bought together

In Cypher:

MATCH 
  p=(original:Product)--(:Order)--(related:Product)
WHERE
  toLower(original.productName) = "teatime chocolate
  biscuits"
RETURN
  DISTINCT original.productName AS product
, related.productName AS most_likely_to_be_bought_with
, count(p) AS popularity 
ORDER BY
  popularity DESC
, most_likely_to_be_bought_with DESC
LIMIT 5

In PostgreSQL:

SELECT
  original.product_name AS product
, related.product_name AS most_likely_to_be_bought_with
, count(r_orders.order_nk) AS popularity
FROM
  public.dim_products AS original
, public.dim_products AS related
, public.fact_orders AS o_orders
, public.fact_orders AS r_orders
WHERE 
    original.product_id = o_orders.product_id
AND o_orders.order_nk =  r_orders.order_nk
AND r_orders.product_id = related.product_id
AND lower(original.product_name) = 'teatime chocolate biscuits'
AND lower(related.product_name) <> 'teatime chocolate biscuits'
GROUP BY
  original.product_name
, related.product_name
ORDER BY
  popularity DESC
, most_likely_to_be_bought_with DESC
LIMIT 5;

Isn’t that a mouthful.

This type of query has become common in online shopping; the shop will recommend products based on what you are looking at or what you have in your cart.

As you can see in Cypher the Products most likely to be bought together query is more compact. Importantly, this makes querying far less error prone; accidentally running a cross-join because you forgot a join condition can go unnoticed and be very costly.

In SQL, fewer joins will lead to the best performance, especially when your fact table has several billions of rows (or you’re joining the fact table to itself like we are here). Neo4j does not have the concept of joins because there are no tables. Graph queries are easier to write, read and modify which is why recommendation queries work well in graph databases.


Improvement Strategies 

This schema is a good start and allows us to think about how to use Neo4j to analyse our data. In loading the data into Neo4j, we have come up with new ideas that have not been implemented as of yet.

Firstly, aggregation nodes could be a useful way to query old data quickly by storing pre-calculated values for later; these nodes play the same role as aggregation tables do in a relational database. In a schema-less model, we can add new nodes easily without building new tables making aggregation a valuable strategy.

The most simple version of this is to calculate the total value of an order and store it in that order, this should improve query time.

We can create these nodes on several aggregation levels, e.g. Yearly Sales, Monthly Sales, Daily Sales, etc.

There is also a possibility of creating geographic aggregation nodes, e.g. USA Sales, London Sales etc.

Separating date nodes into year, month and day nodes is another strategy; this should allow performant querying for specific years and months as a search through all properties of date nodes is not necessary.

Finally, creating a LinkedList between nodes of the same type may prove to be valuable. For example, (:Year)->[:NEXT_YEAR]->(:Year) allows you to compare one years sales to the previous years sales; the same can be done for previous and next month or previous and next day. Thus we can make use of reduced hop traversal to improve query performance when interested in sequences of dates. This is quite difficult to implement in a relational model as each comparison to a previous period and future period will require an additional column on the date dimension.


Conclusion

  • Cypher queries are less error-prone because its more difficult to miss join conditions when SQL-style FROM and JOIN are expressed through a single pattern.
  • Kettle has a nice plugin to visualise and perform your output to Neo4j.
  • We can optimise our graph for a number of different queries without impacting overall performance.
  • The flexible, schema-less nature means changes can be made without refactoring the whole ETL.
  • Graphs produce efficient recommendation queries.
  • There are many improvements yet to explore.

Next Steps

  • Compare performance on large, connected datasets between relational and graph databases.
  • Load and query databases built from the ground up for connected use cases: social media, map navigation, city planning.
  • Explore hybrid schemas (relational when needed, graph when appropriate) with a virtualisation layer.
  • Optimise Date nodes for different use cases.

Where to get the code

The Neo4j ETL can be downloaded here: loading_northwind_neo4j.zip

The PostgreSQL ETL can be downloaded here: loading_northwind_postgres.zip

Date Dimension Revisited: Bank Holidays

Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation of an ETL and exists in almost every data warehouse.

Despite how ubiquitous it is, many still fall for some common pitfalls such as:

  • Weeks shouldn’t be children of months
  • Missing or wrong week of year

These and other common pitfalls are described in a previous post: http://ubiquis.co.uk/dwh/date-dimensions/

Another commonly faced pitfall is that the date dimension often requires a field containing the relevant national public holidays to make it easy to correlate with an increase or decrease in sales. However it is not trivial to add these to your date dimension.

Here, we tackle this issue by creating an ETL that demonstrates the calculation of public holidays for France with particular focus on Easter and related holidays without neglecting fixed date holidays.

What’s a Holiday?

For the purpose of this exercise, we divide holidays into three categories: fixed date, variable date and Easter based date holidays. We consider fixed date holidays to be those that occur on the same date every year; variable dates are those that do not have a set date, like Thanksgiving, which occurs on the 4th Thursday in November; Easter based dates are observed a fixed number of days before or after Easter.

Fixed Date Holidays

A naive approach to generating fixed date holidays in PDI uses a Modified JavaScript value containing a condition to set the holiday field value if the day and month matches, e.g., 

is_holiday = 0;
holiday_description = "";
if (month_number == 12 && day_of_month == 25) {
  is_holiday = 1;
  holiday_description = "Christmas Day";
}

This is good if you only have a few fixed holidays but the code quickly becomes unwieldy as their numbers grow. Ideally this will be generalised into a single condition using variables for month_number and day_of_month.

Variable Date Holidays

Holidays defined as “the first Monday in May”, as is the case with May Day Bank Holiday in the UK, can be calculated in a similar way to fixed date holidays with an added condition week_of_month == 1.

Memorial Day, which occurs on the last Monday in May will require you to calculate if it is the last week of the month, in JavaScript it may look like this:

if (month_number == 5 && day_of_month >= 25 && day_of_week = 1){ 
  is_holiday = 1;
  holiday_description = "Memorial Day";
}

This only works for months with 31 days, 30 day months require day_of_month >= 24 instead; some tweaking would also be necessary in February, which doesn’t have a fixed number of days.

Easter based holidays

Some holidays rely on the day that Easter falls that year, eg. Good Friday, Easter Monday, Whit Sunday.

The complication with calculating these dates is that Easter is not a fixed date.

Easter falls on the Sunday following the full moon on or after the northern hemisphere spring (vernal) equinox. However, the vernal equinox and the full moon are not determined by astronomical observation. The vernal equinox is fixed to fall on 21 March.

Luckily there is a calculation for Easter Sunday called Computus (Latin for “computation”) The name has been used for this procedure since the early Middle Ages, as it was considered the most important computation of the age. We used the version found on wikipedia: https://en.wikipedia.org/wiki/Computus#Gauss’_Easter_algorithm

Solution

We have produced a generic solution to calculate fixed date and Easter based holidays for France. We have not included variable date holidays as there are none in the French calendar.

Additionally, to get a better understanding of how different techniques could be implemented and how they perform; we created a transformation using both a Modified JavaScript value and a User defined Java class in PDI.

The following is the general algorithm we wrote for the purpose of generating a field containing the holidays for our date dimension: 

  1. On the first day of the year
    1. Store all of the fixed date holiday descriptions and their dates in a dedicated object for later reuse.
    2. calculate the date that Easter will fall on and store it for later use.
    3. As we are interested in Easter Monday, rather than Easter Sunday, we add a day to the previously calculated date and store it in our object.
  2. If the holiday is in our object, set the flag is_bank_holiday to 1, bank_holiday_desc_en to the name of that holiday in English and bank_holiday_desc_fr to the name of that holiday in French.
  3. Output the three new fields: is_bank_holiday, bank_holiday_desc_en, bank_holiday_desc_fr

The Result

The algorithm stated above has been introduced to our standard date dimension to give a transformation that looks like:

The effect of calculating holidays as we have is shown in the table snippet below (some columns have been omitted and others have been renamed to make it easier to see the result).

From the table, we can see that Good Friday has been successfully calculated as two days before Easter Sunday and Easter Monday occurs the day after Easter, as expected. 

The fixed date holidays: Labour Day and Victory Day 1945 are also present and correct.

Common Pitfalls

In Java and JavaScript (and almost all other languages) month numbers do not go from 1 to 12 like the output of Gauss’s Easter algorithm does. Instead the months run from 0 to 11, where 0 is January and 11 is December. You’ll need to add or subtract 1 to convert from or to Java / Javascript dates.

What Next

If two holidays coincide, only the last occurrence will be written to the stream. One solution to this would be to concatenate the strings of holidays that occur on the same day.

If you want to include holidays for multiple countries you’ll need an additional column for each country to avoid confusing days marked as holidays in France when analysing sales data from Germany, for example.

If you need many countries, you’re probably better off by having a holidays dimension that accounts for all countries and has a key on the fact table.

In some countries, like the UK, if a bank holiday is on a weekend, a ‘substitute’ weekday becomes a bank holiday, normally the following Monday. Accounting for this would be valuable in the UK and USA.

Where Can I Get the Transformation?

Our implementation of a date dimension with Easter based holidays and fixed date holidays is available to download as a PDI transformation here: dim_date_holidays.ktr

PDI and XML: Beyond the native steps

Motivation

In day to day development we are sometimes faced with having to not only parse data delivered in XML, but a. lso generate complex XML files containing said data. It is nowadays common to exchange database extracts in XML format, and, in many other cases, results to API calls are provided in XML. Moreover, in many occasions one needs to construct either simple or complex XML, either to transmit data or to notify other systems that an action is required.

All of this requires a data engineer to create processes that receive, parse, and construct XML in near real time.

Parsing XML in PDI

Constructing a process to parse XML in PDI can be a frustrating ordeal. The developer creates a transformation to read XML using the native steps, optimises it for large throughput to a database only to discover that, once the source files increase in size, the real bottleneck is in the reading process. Moreover, when splitting the original file into several streams, issues with recursion or complex structures, memory and CPU resources increase almost exponentially.

To overcome this problem we suggest reading XML using the StAX parser step and processing its output. All of the parsing needs to be done manually, of course, but now all of the data is being streamed before being interpreted. We also suggest staging data using one or several “Serialize to file” steps before doing any merges or joins before finally saving to a database. Storing the data locally during processing is considerably faster than storing it in a staging database over the network, especially for large files.

Constructing XML

Creating an XML to represent a complex relational database structure can be a daunting task, especially when several hierarchical levels need to be represented. We proposed that the best way to do this is to template the output corresponding to the data in the leaves of the hierarchies.

This, in turn, will become the data to be represented by the higher levels. This recursion can be easily constructed using PDI transformations.

Ordering requirements, such as XML sequences, can easily be catered for and joins between parts of the XML are now reduced to a sorting of snippets according to a pre-defined priority.

Not only does this reduce the memory requirements when generating large XML files, it also significantly reduces the number of queries one makes to the database where the data is stored, thus considerably speeding up XML generation.

Conclusion

PDI is a generic tool that allows not only the integration of data sources but also its export. By its visual nature, it allows the creation of complex pipelines without forcing the developers to do any type of advanced coding. XML, however, has traditionally been in the domain of library utilization and custom code. We present a number of patterns that bring both the parsing and construction of complex and large XML files to the domain of visual programming used in PDI.

You can download the code and example data here.

Setup Pentaho BA Server to use SSL Certificates

SSL Certificate Java Truststore Setup

SSL, or Secure Socket Layer, is a technology which allows web browsers and web servers to communicate over a secured connection. This means that during our initial attempt to communicate with a web server it will present our web browser with a set of credentials, in the form of a “Certificate”, as proof the site is who and what it claims to be. In certain cases, the server may also request a Certificate from our web browser, asking for proof that we are who we claim to be.

This post will go through the process required to have our Pentaho Server using SSL certificates.

Create your SSL key/certificate

First thing to do is to create the SSL certificate key for our tomcat server to use. And then we will need to tell our tomcat server to allow HTTPS connections. To do this in the Pentaho Server we edit a file named “server.xml”, referenced here:

And here :

For this exercise we will be using the Java KeyStore (JKS). A keystore manages the provision of the client private keys/certificates we use when we try to access a server, and a truststore (cacerts in Java) manages the verification of those keys/certificates. 

In order to do create the key we need to create a new JKS keystore, containing a single self-signed key/certificate.  Execute the following from a terminal command line:

$ keygen -genkey -alias tomcat -keyalg RSA

This command will create a new file, in the home directory of our user named “.keystore”. To specify a different location or filename, add the -keystore parameter, followed by the complete path to our keystore file. We will also need to reflect this new location in the server.xml configuration file.

        1. First we are prompted for a password for the keystore and later for a password for the actual key. (Strangely in older tomcat versions they need to be the same) This keystore password will have to be given to our server application so it can access our server side keys. Note – The default password used by Tomcat is “changeit” (all lower case).
        2. After deciding on a password, a number of questions need to be answered as to certify the authenticity of our server.
        3. Finally we are prompted for the password for the actual tomcat alias key. If any issues come up in this step, check if you need to have the same password as for the datastore.

Ok, we have a keystore where Java can find certificate keys to show servers when prompted, with one key with the alias tomcat. Now we need to tell Java to use this key in their trustore. We export the key to a *.cer file and go tell Java that he should incorporate this key in cacerts, Javas’ truststore.

$ keytool -export -alias tomcat -file tomcat.cer -storepass <password> -keypass <password> -keystore .keystore
Certificate stored in file <tomcat.cer>
$ cd $PENTAHO_JAVA_HOME/jre/lib/security/
$ keytool -import -alias tomcat -file ~/tomcat.cer -keystore cacerts -storepass changeit

Configure Tomcat for HTTPS

Now we need to tell Pentaho 8 we are ready to use HTTPS using our key. To do so we need to change our server.xml file in the tomcat folder: tomcat/conf/server.xml.

There we need to uncomment the connector for HTTPS SSL protocol in port 8443 and add the credentials  for the keystore location and password:

<Connector URIEncoding="UTF-8" port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol"
maxThreads="150" SSLEnabled="true" scheme="https" secure="true"
keystoreFile="<pentaho_user_home>/.keystore" keystorePass=<password>
clientAuth="false" sslProtocol="TLS" />

This configures a new connector at port 8443 using https, however we do not close the http connector at port 8080 yet. This allows both to be used, at the same time (we may close it later if we want to prevent HTTP connections). Do be careful as if you try to login on both and your browser executes cached code it can create a login error. Using a private browsing window helps avoid this issue.

Also we need to tell Pentaho 8 we now have a new port in the server.properties file: pentaho-solutions/system/server.properties

fully-qualified-server-url=http://localhost:8443/pentaho/

And this should be it. We should have gotten our server to respond in both port 8080 using HTTP and 8443 using HTTPS. Keep in mind what I mentioned regarding the browser cache, so if we simply use a private browsing window all should be ok.

Securing a Pentaho Server

How to customise security and access

This post will show you how you can restrict the access to your Pentaho BA Server, essentially creating a “guest” user role, and cherry picking what that user has access to.

Spring Security – Becoming very selective

We want to restrict the access of visitors to only specific demos, rather than the default permissions granted to the Authenticated users.

In order to do this we first need to adjust our BA Server’s Administration->User/Roles settings. What we want to achieve is a specific role for all company users, and a set of separate roles for visitors.

First we create a role for company users with almost all privileges, and remove all other user roles. After this is done we will need to restrict the privileges of Authenticated users. We restrict their access to a minimum so we only allow them to read and execute content.

Now we add a guest user, without a role associated, which means it will behave as an Authenticated user. Once this is done we need to hard code how access to web services are given to internal roles. Our specific goal in this post will be to allow a guest user access only to a single demo dashboard.

Opening Spring Framework’s “pentaho-solutions/system/applicationContext-spring-security.xml”, you will see that this file contains the configuration of the security filters for various services. Specifically we will be working on the “filterInvocationInterceptorForWS” and “filterInvocationInterceptor” beans, which grant access to specific user roles to webservices which are accessed through URI’s. Each one of these beans lists a set of patterns and the roles that have access to them. The first step is to restrict all access only for company users. Then add patterns that will be regex-matched to cherry picked dashboards and allow any Authenticated user access. 

By default this is what filterInvocationInterceptor looks like:

<bean id="filterInvocationInterceptor" class="org.springframework.security.web.access.intercept.FilterSecurityInterceptor">
<sec:filter-security-metadata-source request-matcher="ciRegex" use-expressions="false">
<sec:intercept-url pattern="\A/content/common-ui/resources/web/(.+/)*.+\.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/.*require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/.*require-js-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/content/common-ui/resources/web/require.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/content/common-ui/resources/web/require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/content/data-access/resources/gwt/.*css\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webcontext.js.*\Z" access="Anonymous,Authenticated" />
(...)

A service granted to both Authenticated and Anonymous will refer to a service required for any access (including prior to logging in), we must allow those to remain open as they are essential. We therefore focus on patterns that are granted only to Authenticated users and edit the “access” to list only our role “User” and “Admin”.

<sec:intercept-url pattern="\A/mantle/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/.*\Z" access="User,Admin" />

This security filter manages only basic BA Server components, like for instance the login services. However does not filter access to the home screen. Which means a user that is Authenticated has no access to mantle, however will be shown an empty homepage. In order to avoid this we add a pattern for /home, also restricted to our role “User” and “Admin”. It is important to note that these security beans are applied from top to bottom, which means order matters very much. We add our new pattern immediately after the mantle one. A snippet should look like the one below:

<sec:intercept-url pattern="\A/mantlelogin/.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/mantle/mantleloginservice/*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/mantle/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/home.*\Z" access="User,Admin"/>

<sec:intercept-url pattern="\A/welcome/.*\Z" access="Anonymous, Authenticated" />
<sec:intercept-url pattern="\A/public/.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/login.*\Z" access="Anonymous,Authenticated" />

Now we turn to the filterInvocationInterceptorForWS and follow the same first step, restrict the access to all patterns granted to Authenticated changing them to only to “User, Admin”.

<sec:filter-security-metadata-source request-matcher="ciRegex" use-expressions="false">
<sec:intercept-url pattern="\A/webservices/unifiedrepository\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userrolelistservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userroleservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/authorizationpolicy\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/rolebindingdao\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/scheduler\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/repositorysync\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/datasourcemgmtservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/.*require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*require-js-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/.*\Z" access="User,Admin" />

So at this stage we have restricted all access to almost all components and services to our company users using the role “User”. And we are now able to cherry pick the exact services we want to open. In order to allow access by visitors to our dashboard and nothing else, we add a URI pattern and allow access to “Authenticated”. We then add it on the TOP, which means it is the first applied filter and therefore the most constricting.

<sec:intercept-url pattern="\A/api/repos/:public:ubiquis:world_population:world_population.wcdf/generatedContent.*\Z" access="Authenticated"/>

At this stage if you try to open the URI the dashboard will fail to render. This happens because the visitor is trying to render the dashboard without access to any other services, so through trial and error use the browser console and check which other services are required to render that particular dashboard. Once you identify which ones are required you open those services too.

<sec:intercept-url pattern="\A/api/repos/:public:ubiquis:world_population:world_population.wcdf/generatedContent.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/api/repos/:public:ubiquis:world_population:.*\.png*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/plugin/cda/api/doQuery\?.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/api/repos/pentaho-cdf/.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/api/repos/pentaho-cdf-dd/.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/plugin/pentaho-cdf-dd/api/resources/public/ubiquis/.*\Z" access="Authenticated"/>
<sec:intercept-url pattern="\A/webservices/unifiedrepository\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userrolelistservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/userroleservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/authorizationpolicy\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/rolebindingdao\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/scheduler\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/repositorysync\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/datasourcemgmtservice\?wsdl.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/webservices/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/plugin/reporting/api/jobs/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="OPTIONS" />
<sec:intercept-url pattern="\A/api/repos/.*\Z" access="Anonymous,Authenticated" method="HEAD" />
<sec:intercept-url pattern="\A/api/.*require-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*require-js-cfg.js.*\Z" access="Anonymous,Authenticated" />
<sec:intercept-url pattern="\A/api/.*\Z" access="User,Admin" />
<sec:intercept-url pattern="\A/plugin/.*\Z" access="User,Admin" />

As you can see there were images that needed to be accessed in our repository, and we also needed to allow the user to run CDA queries. There are a number of javascript and css files also required for CDF to render our dashboard correctly. Notice that although we allowed access to CDA, we did not grant access to the CDA previewer or editor.

The resulting dashboard can be found clicking this link (you can see that username and password are being passed in the URL).

This principle can be applied to different guest user roles, where a subset of visitors will have access to one part of the server, and others another, allowing for a much more controlled approach to how to grant access to your server.

Filling in the gaps

Today I saw Harris Ward asking a question I’ve seen pop up over and over again. And somehow the solution is never obvious.

The problem is how to fill in the gaps when we have nulls in our data stream. For example, our data set looks like:

And we want to replace all those nulls by the last non-null value we’ve seen in the data set: all non null values stay as is, but Mar, Apr and May get the value 20, Jul gets 50 and Sep, Oct and Nov get 40.

If we’re ingesting the data from a CSV, Excel or XML file, it’s easy to achieve this (ever wondered what the “Repeat” flag means in the fields tab of the Text File input step?).

But suppose our data doesn’t come from files. We need to implement the equivalente of the repeat flag inside a PDI transformation.

Invariably, when any of us encounters this requirement we all go through the same attempts. Let’s use the Analytical query step! Or the Group by with a “pass all rows option”. Or the NVL operation of the calculator step. And then we learn these all fail.

This is perhaps one of the best examples where a Javascript step is the best option. Because it lets us persist variables between rows, and we can control their values at will.

So, to achieve the desired result, here’s our proposed solution: add a Javascript step and use the following code:

Variable tmp is only initialized with the first row of data. From then on it’ll keep the last non null value of sales (or 0).

Sure, we’ve all heard the motto “Don’t use Javascript, it kills PDI performance”. But the occasional simple JS script doesn’t hurt that much in most cases, you’re likely to have a slower step elsewhere in your transformation (a DB read or write operation, for example). And, if performance is vital, you can always re-code this as a User defined Java class. But for all but the most extreme applications Javascript will work just fine, though.

Don’t forget to specify on the configuration of the output fields that you want to overwrite the value of the sales field, though:

And here’s the transformed data:

Exactly as intended.

Download the sample transformation here.

Happy Pi Day!

Today is Pi Day, March 14 or 3.14 (using a rather awkward M.dd format).

So this morning, the following dialog occurred within our team:

– Happy Pi Day!
– You know what we should do? A PDI Monte Carlo simulator to calculate Pi…
– Uuuuh, nice. Yeah, I think I can do that.
– … without scripting steps.
– Challenge accepted!

Well, happy Pi Day everyone, and behold our awesome (though of limited use) Monte Carlo simulator to calculate an approximation of Pi. It uses only the following steps:

  • Generate rows
  • Generate random value
  • Calculator (x2)
  • Add sequence (x2)
  • Filter rows (x2)
  • Write to log

It runs “forever” and gives updated results on the logs for each 1.27 Million rows (approximately).

Remark: forever means until you reach 1b rows and the sequence counters roll over.

Oh, you may want to run it with Minimal Logging only.

Happy Pi Day!

Download here

Status change fact table – Part 5 (Conclusion)

(previous parts of this series are here: part 1, part 2, part 3, part 4)

The fact table described in the previous parts of this article allow us to count how many objects entered or exited a given state in a certain time period.

However, if we want to count how many objects are in a given state on a given point in time, we need to count all In and Out events since the beginning of time.

This is problematic and doesn’t scale well, as more calculations are required as more data is ingested.

However, we can add a Periodic Snapshot fact table to the mix and have those running totals counted for each state in each day. As such, questions such as “How many objects were Out for delivery on day X” can be answered by looking at a specific snapshot date and gouping all rows that have that specific attribute.

Even if we include attributes such as the age of objects or events, as we did in the implementation of the status change fact table, we can still keep track of everything.

In brief, the snapshot could be implemented by doing something like the following:

  • Take all rows for the snapshot on day D;
  • Increment snapshot date to D+1; increment all ages by 1;
  • Take all events from the status change table for day D+1
  • Add the two sets of events from steps 2 and 3
  • Insert the data into a new partition of the snapshot table.

The snapshot table will have a very high cardinality, especially because we will most likely need to keep it atomic, without aggregation on any attribute. Keeping the snapshot partitioned lets us maintain the snapshot more easily, allowing us to delete specific partitions when the data is found to be incorrect.
Bear in mind that, as in with any Periodic Snapshot algorithm, it is highly sensitive to data errors. Any incorrect number in one day will be propagated into the future and the only practical solution is to delete all snapshot data from the first day the totals are wrong and reprocess a significant amount of data.

But, with all its flaws, it’s still our best shot at keeping track of all measurable values within our data.

In conclusion, this model implements and extends functionality that used to be achieved by an Accumulating Snapshot. It features the ability to track an arbitrary number of stages of a process, and doesn’t require updates of past data, making it easy to implement in a Hadoop environment. It can be used to track data pertaining to shipments and deliveries, bug tracking software, or any other business process where objects go from one state to the next and we need to track its path.