Non-identifiable data: filtered measures in Mondrian

Here’s an interesting requirement (fictional, but inspired by a real world scenario):

We provide a mobile phone based service. People sign in, we know what phone, OS and browser they use, and provide them some uber feature they all love. It works great and we now have a few million users. We want to publish some service statistics, open to our customers. Such as how many people use the service in a given town? what’s the distribution of phone brands/models? what percentage of users uses a certain combination of mobile phone, operating system and browser? The problem resides in the following: we must respect users’ privacy and therefore cannot publish information that allows any given user to be identifiable. As such, we can only publish metrics if there’s a minimum of 10 users within any given category.

It’s easy to understand why this is relevant. If in a 500 people town only 1 person has a 1st generation iPhone, anybody with access to the data could identify that person’s usage pattern. However, if there’s a large number of users, then one individual’s actions are not easily identifiable.

The problem is: how can we implement that behaviour in a Mondrian cube?

Well, as it happens, there’s a way. To make matters easier to replicate, we’ll be using SteelWheelsSales, and will be limiting access to Sales information based on the Markets dimension, trying to provide aggregated Sales values for any given Market, Country or City, but without allowing a single invoice to be identifiable (well, at least not easily).

Part 1: conditionally display measures, based on fact counts

This is the core of the solution. The actual criteria to be implemented can be changed later, but first we need to prove we can even do this.

The idea is the following: Given two measures, [Sales] and [Count], define a CalculatedMember that only displays safe values; then, remove access to some of the measures

So we start by adding the measure definitions to our SteelWheelsSales cube:

  <Measure name="Count" column="TOTALPRICE" formatString="#,###" aggregator="count">
  </Measure>
  <CalculatedMember name="Curated Sales" dimension="Measures">
    <Formula>
      <![CDATA[
        Iif( [Measures].[Count] < 10, null, [Measures].[Sales] )
      ]]>
    </Formula>
  </CalculatedMember>

The [Curated Sales] measuge will only be visible if it’s supported by at least 10 facts from the fact table.

We now enable security by adding

  <Role name="Authenticated">
    <SchemaGrant access="none">
    </SchemaGrant>
  </Role>
  <Role name="Administrator">
    <SchemaGrant access="all">
    </SchemaGrant>
  </Role>
  <Role name="restricted">
    <SchemaGrant access="none">
      <CubeGrant cube="SteelWheelsSales" access="all">
        <HierarchyGrant hierarchy="[Measures]" access="custom">
          <MemberGrant member="[Measures].[Curated Sales]" access="all">
          </MemberGrant>
          <MemberGrant member="[Measures].[Quantity]" access="none">
          </MemberGrant>
          <MemberGrant member="[Measures].[Sales]" access="none">
          </MemberGrant>
          <MemberGrant member="[Measures].[Count]" access="none">
          </MemberGrant>
        </HierarchyGrant>
      </CubeGrant>
    </SchemaGrant>
  </Role>

With this we explicitly deny access to everybody, and only grant conditional access to users with the restricted role.

A quick remark: when setting the access to [Measures] as custom, access to any measure that is not explicitly granted is denied. Which means, even if we don’t explicitly deny access to the [Sales], [Quantity] and [Count] measures, they won’t be accessible to the user. However, as a good practice, I prefer to explicitly grant/deny access to every measure, which may become a bit of a headache if the cube has a lot of measures.

The important bit here is the fact that even though the formula references measures are not accessible, it still works (much to my surprise, I was fearing a Mondrian error in the logs which would make this approach a dead-end).

With these new definitions in our cube, the query

SELECT
  Descendants( [Markets].[EMEA].[Belgium], 2) on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

will display only data for Brussels, whereas Charleroi will be displayed as null. This is because the Count of rows that make up Charleroi is 8 only.

So we’re done, right? Well, not quite.

Part 2: partial information can be as dangerous

In the scenario above, Belgium has 2 cities. One is deemed unsafe and is therefore hidden, but the other is safe, so it’s visible.

However, the total for the country will still be considered safe and as such the query

SELECT
  [Markets].[EMEA].[Belgium] on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

displays the correct total for Belgium. Subtracting the value for Brussels from the total for Belgium we get the correct value for Charleroi, which should be hidden. So we’re not clear yet.

To add to the confusion, we need a solution that scales without damaging performance a lot. So, we want to avoid querying the lowest level of the dimension if we’re only querying at the country level.

Here’s a possible set of criteria:
– A member of the dimension is deemed unsafe if it has a fact count of at least 10;
– If any sibling of a member is unsafe, then the member is considered unsafe.

That way, because Charleroi has a too small underlying dataset, Brussels would also be considered unsafe to display, therefore preventing simple arithmetic to crack the numbers:

  <CalculatedMember name="Curated Sales" dimension="Measures">
   <Formula>
    <![CDATA[
     Iif( Min( [Markets].CurrentMember.Siblings, [Measures].[Count] ) < 10, null, [Measures].[Sales] )
    ]]>
   </Formula>
  </CalculatedMember>

We can now query the value for Belgium, and it’s correctly displayed, but we don’t show the values for any of its cities, as at least one of them is considered unsafe.

Ok, but maybe we’re throwing away the baby with the bath water. For example, if we try to see the value for all cities in California,

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  {[Measures].[Curated Sales], [Measures].[Count]} on Columns
FROM [SteelWheelsSales]
WHERE [Time].[2004]

will display nothing, as there are unsafe cities; however, there are 5 of those. Hiding only these few unsafe cities is still safe, as no single city value can be determined. Maybe we can get a better algorithm that still achieves the same result without limiting too much.

Part 3: safety in numbers; unsafe data can be made safe if there’s a lot of them.

Let’s refine our criteria:
– A member is deemed unsafe if it has a low count, of less than 10;
– If no sibling has a low count, the member is safe;
– But if the member has many unsafe siblings (more than 3) it may still be considered safe;

So, we define the measure as

  <CalculatedMember name="Curated Sales" dimension="Measures">
   <Formula>
    <![CDATA[
     Iif(
        [Measures].[Count] > 10
       AND (
          Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) = 0
        OR Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) > 3 
       )
     , [Measures].[Sales], null )
    ]]>
   </Formula>
  </CalculatedMember>

With this definition, the query

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales]

will not display the value for sales for any city of California, as there’s only one unsafe city; however,

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales]
WHERE [Time].[2004]

will display the value for most cities, as there’s a significant number of unsafe cities and therefore their real values are obscured by their numbers. Even if we subtract the known city values from the CA total we only get the sum of those small cities, not each one individually.

However…

Part 4: caveat emptor

Lets not go overboard with this approach. It’s not perfect. In fact, it’s perfectly feasible that a given city is considered safe for [Time].[All Years], [Time].[2003] and [Time].[2004][Time].[2005], in which case it would still be possible to determine the unsafe value by asking the right questions to Mondrian.

Determining the right set of criteria is very tough and you should probably ask your on duty statistician to help defining them and spend a significant amount of time testing the algorithm.

But it’s a start.

Back to blog

2 Comments

  • Alexander Schurman says:

    Great Post!!!,

    I liked how the ROLE did the trick so the advance users can not access the underlying original measure.

    Just as an extension or added comment to the post, [Measures].[Count] is directly related to Fact data.
    In some cases, we maybe have thousands of FACT rows under a member and still should not be shown based on the model restrictions.
    Following the same example explained in in post, same idea can be applied based on the Dimension information like [DimensionX].CurrenMember.Children.Count where showing the value can be linked to the number of Children below (Please note that based on the ROLE, the view user may not have access to the Children data as it is DETAILS).

  • Nelson Sousa says:

    The criteria can be fine tuned, of course. As long as you can write them in proper MDX, it won’t be a problem.

    One thing I tried to achieve here was preventing Mondrian from querying a lower level of the hierarchy unless explictly requested. By setting the criteria at the Siblings level, we ensure Mondrian only queries the City level when the query is at the City level. If we were to use .Children, we could have Mondrian firing queries one level below and cardinality of the levels could hamper performance.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

  • Graph Databases: Loading Data with Neo4j

    Graph databases are becoming more popular as a way of storing and analysing large connected datasets. Neo4j is a...

    Read more
  • Date Dimension Revisited: Bank Holidays

    Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation...

    Read more
  • Setup Pentaho BA Server to use SSL Certificates

    SSL Certificate Java Truststore Setup SSL, or Secure Socket Layer, is a technology which allows web browsers and web...

    Read more