All posts in “mdx”

Fizz buzz

So, today is the day everybody should blog about something really stupid…

Ok, so here’s ours: Fizz Buzz in MDX.

Fizz Buzz, for those that don’t know it, is one of those “tests” interviewers love to see whether a candidate can code or not. It’s useless as pretty much everybody knows it by now, but that’s besides the point.

There are a number of examples of “how to implement Fizz Buzz in language X”. Some are quite lame, others more impressive.

So, in the spirit of April Fools, here’s our Fizz Buzz implementation, in MDX. Run it against SteelWheelsSales, or any cube that has a dimension called [Time].


WITH

member [Time].[0] as 0
member [Time].[1] as 1
member [Time].[2] as 2
member [Time].[3] as 3
member [Time].[4] as 4
member [Time].[5] as 5
member [Time].[6] as 6
member [Time].[7] as 7
member [Time].[8] as 8
member [Time].[9] as 9
set DIGITS as { [Time].[0], [Time].[1], [Time].[2], [Time].[3], [Time].[4],
[Time].[5], [Time].[6], [Time].[7], [Time].[8], [Time].[9] }

set NUMBERS as Generate( DIGITS, Generate( DIGITS, DIGITS, ALL), ALL)

member [Measures].[Ordinal] as NUMBERS.CurrentOrdinal

member [Measures].[Fizz] as Iif( [Measures].[Ordinal] = Round( [Measures].[Ordinal]/3 ) *3, "fizz", "")
member [Measures].[Buzz] as Iif( [Measures].[Ordinal] = Round( [Measures].[Ordinal]/5 ) *5, "buzz", "")
member [Measures].[FB] as [Measures].[Fizz] || [Measures].[Buzz]

member [Measures].[FizzBuzz] as Iif( [Measures].[FB] = "", [Measures].[Ordinal], [Measures].[FB] )

SELECT
NUMBERS on Rows,
{ [Measures].[Ordinal], [Measures].[FizzBuzz] } on Columns

FROM
[SteelWheelsSales]

Although a trivial and quite futile exercise, it does have one feature that is worth mentioning: the All argument in the Generate function. Generate applies each element of the first set to each element of the second set. However, without the ALL argument, duplicates from the resulting set are removed, which results in us getting only a 10 element set.

Non-identifiable data: filtered measures in Mondrian

Here’s an interesting requirement (fictional, but inspired by a real world scenario):

We provide a mobile phone based service. People sign in, we know what phone, OS and browser they use, and provide them some uber feature they all love. It works great and we now have a few million users. We want to publish some service statistics, open to our customers. Such as how many people use the service in a given town? what’s the distribution of phone brands/models? what percentage of users uses a certain combination of mobile phone, operating system and browser? The problem resides in the following: we must respect users’ privacy and therefore cannot publish information that allows any given user to be identifiable. As such, we can only publish metrics if there’s a minimum of 10 users within any given category.

It’s easy to understand why this is relevant. If in a 500 people town only 1 person has a 1st generation iPhone, anybody with access to the data could identify that person’s usage pattern. However, if there’s a large number of users, then one individual’s actions are not easily identifiable.

The problem is: how can we implement that behaviour in a Mondrian cube?

Well, as it happens, there’s a way. To make matters easier to replicate, we’ll be using SteelWheelsSales, and will be limiting access to Sales information based on the Markets dimension, trying to provide aggregated Sales values for any given Market, Country or City, but without allowing a single invoice to be identifiable (well, at least not easily).

Part 1: conditionally display measures, based on fact counts

This is the core of the solution. The actual criteria to be implemented can be changed later, but first we need to prove we can even do this.

The idea is the following: Given two measures, [Sales] and [Count], define a CalculatedMember that only displays safe values; then, remove access to some of the measures

So we start by adding the measure definitions to our SteelWheelsSales cube:

  <Measure name="Count" column="TOTALPRICE" formatString="#,###" aggregator="count">
  </Measure>
  <CalculatedMember name="Curated Sales" dimension="Measures">
    <Formula>
      <![CDATA[
        Iif( [Measures].[Count] < 10, null, [Measures].[Sales] )
      ]]>
    </Formula>
  </CalculatedMember>

The [Curated Sales] measuge will only be visible if it’s supported by at least 10 facts from the fact table.

We now enable security by adding

  <Role name="Authenticated">
    <SchemaGrant access="none">
    </SchemaGrant>
  </Role>
  <Role name="Administrator">
    <SchemaGrant access="all">
    </SchemaGrant>
  </Role>
  <Role name="restricted">
    <SchemaGrant access="none">
      <CubeGrant cube="SteelWheelsSales" access="all">
        <HierarchyGrant hierarchy="[Measures]" access="custom">
          <MemberGrant member="[Measures].[Curated Sales]" access="all">
          </MemberGrant>
          <MemberGrant member="[Measures].[Quantity]" access="none">
          </MemberGrant>
          <MemberGrant member="[Measures].[Sales]" access="none">
          </MemberGrant>
          <MemberGrant member="[Measures].[Count]" access="none">
          </MemberGrant>
        </HierarchyGrant>
      </CubeGrant>
    </SchemaGrant>
  </Role>

With this we explicitly deny access to everybody, and only grant conditional access to users with the restricted role.

A quick remark: when setting the access to [Measures] as custom, access to any measure that is not explicitly granted is denied. Which means, even if we don’t explicitly deny access to the [Sales], [Quantity] and [Count] measures, they won’t be accessible to the user. However, as a good practice, I prefer to explicitly grant/deny access to every measure, which may become a bit of a headache if the cube has a lot of measures.

The important bit here is the fact that even though the formula references measures are not accessible, it still works (much to my surprise, I was fearing a Mondrian error in the logs which would make this approach a dead-end).

With these new definitions in our cube, the query

SELECT
  Descendants( [Markets].[EMEA].[Belgium], 2) on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

will display only data for Brussels, whereas Charleroi will be displayed as null. This is because the Count of rows that make up Charleroi is 8 only.

So we’re done, right? Well, not quite.

Part 2: partial information can be as dangerous

In the scenario above, Belgium has 2 cities. One is deemed unsafe and is therefore hidden, but the other is safe, so it’s visible.

However, the total for the country will still be considered safe and as such the query

SELECT
  [Markets].[EMEA].[Belgium] on Rows,
  [Measures].[Curated Sales] on Columns
FROM [SteelWheelsSales]

displays the correct total for Belgium. Subtracting the value for Brussels from the total for Belgium we get the correct value for Charleroi, which should be hidden. So we’re not clear yet.

To add to the confusion, we need a solution that scales without damaging performance a lot. So, we want to avoid querying the lowest level of the dimension if we’re only querying at the country level.

Here’s a possible set of criteria:
– A member of the dimension is deemed unsafe if it has a fact count of at least 10;
– If any sibling of a member is unsafe, then the member is considered unsafe.

That way, because Charleroi has a too small underlying dataset, Brussels would also be considered unsafe to display, therefore preventing simple arithmetic to crack the numbers:

  <CalculatedMember name="Curated Sales" dimension="Measures">
   <Formula>
    <![CDATA[
     Iif( Min( [Markets].CurrentMember.Siblings, [Measures].[Count] ) < 10, null, [Measures].[Sales] )
    ]]>
   </Formula>
  </CalculatedMember>

We can now query the value for Belgium, and it’s correctly displayed, but we don’t show the values for any of its cities, as at least one of them is considered unsafe.

Ok, but maybe we’re throwing away the baby with the bath water. For example, if we try to see the value for all cities in California,

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  {[Measures].[Curated Sales], [Measures].[Count]} on Columns
FROM [SteelWheelsSales] WHERE [Time].[2004]

will display nothing, as there are unsafe cities; however, there are 5 of those. Hiding only these few unsafe cities is still safe, as no single city value can be determined. Maybe we can get a better algorithm that still achieves the same result without limiting too much.

Part 3: safety in numbers; unsafe data can be made safe if there’s a lot of them.

Let’s refine our criteria:
– A member is deemed unsafe if it has a low count, of less than 10;
– If no sibling has a low count, the member is safe;
– But if the member has many unsafe siblings (more than 3) it may still be considered safe;

So, we define the measure as

  <CalculatedMember name="Curated Sales" dimension="Measures">
   <Formula>
    <![CDATA[
     Iif(
        [Measures].[Count] > 10
       AND (
          Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) = 0
        OR Count( Filter( [Markets].CurrentMember.Siblings, [Measures].[Count] < 10 ) ) > 3 
       )
     , [Measures].[Sales], null )
    ]]>
   </Formula>
  </CalculatedMember>

With this definition, the query

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales]

will not display the value for sales for any city of California, as there’s only one unsafe city; however,

SELECT
  [Markets].[NA].[USA].[CA].Children on Rows,
  { [Measures].[Curated Sales], [Measures].[Count] } on Columns
FROM [SteelWheelsSales] WHERE [Time].[2004]

will display the value for most cities, as there’s a significant number of unsafe cities and therefore their real values are obscured by their numbers. Even if we subtract the known city values from the CA total we only get the sum of those small cities, not each one individually.

However…

Part 4: caveat emptor

Lets not go overboard with this approach. It’s not perfect. In fact, it’s perfectly feasible that a given city is considered safe for [Time].[All Years], [Time].[2003] and [Time].[2004][Time].[2005], in which case it would still be possible to determine the unsafe value by asking the right questions to Mondrian.

Determining the right set of criteria is very tough and you should probably ask your on duty statistician to help defining them and spend a significant amount of time testing the algorithm.

But it’s a start.