Everyone familiar with data warehousing knows about the date dimension. It’s one of the first steps in the creation of an ETL and exists in almost every data warehouse.
Despite how ubiquitous it is, many still fall for some common pitfalls such as:
- Weeks shouldn’t be children of months
- Missing or wrong week of year
These and other common pitfalls are described in a previous post: https://ubiquis.co.uk/dwh/date-dimensions/
Another commonly faced pitfall is that the date dimension often requires a field containing the relevant national public holidays to make it easy to correlate with an increase or decrease in sales. However it is not trivial to add these to your date dimension.
Here, we tackle this issue by creating an ETL that demonstrates the calculation of public holidays for France with particular focus on Easter and related holidays without neglecting fixed date holidays.
What’s a Holiday?
For the purpose of this exercise, we divide holidays into three categories: fixed date, variable date and Easter based date holidays. We consider fixed date holidays to be those that occur on the same date every year; variable dates are those that do not have a set date, like Thanksgiving, which occurs on the 4th Thursday in November; Easter based dates are observed a fixed number of days before or after Easter.
Fixed Date Holidays
A naive approach to generating fixed date holidays in PDI uses a Modified JavaScript value containing a condition to set the holiday field value if the day and month matches, e.g.,
is_holiday = 0;
holiday_description = "";
if (month_number == 12 && day_of_month == 25) {
is_holiday = 1;
holiday_description = "Christmas Day";
}
This is good if you only have a few fixed holidays but the code quickly becomes unwieldy as their numbers grow. Ideally this will be generalised into a single condition using variables for month_number
and day_of_month
.
Variable Date Holidays
Holidays defined as “the first Monday in May”, as is the case with May Day Bank Holiday in the UK, can be calculated in a similar way to fixed date holidays with an added condition week_of_month == 1
.
Memorial Day, which occurs on the last Monday in May will require you to calculate if it is the last week of the month, in JavaScript it may look like this:
if (month_number == 5 && day_of_month >= 25 && day_of_week = 1){
is_holiday = 1;
holiday_description = "Memorial Day";
}
This only works for months with 31 days, 30 day months require day_of_month >= 24
instead; some tweaking would also be necessary in February, which doesn’t have a fixed number of days.
Easter based holidays
Some holidays rely on the day that Easter falls that year, eg. Good Friday, Easter Monday, Whit Sunday.
The complication with calculating these dates is that Easter is not a fixed date.
Easter falls on the Sunday following the full moon on or after the northern hemisphere spring (vernal) equinox. However, the vernal equinox and the full moon are not determined by astronomical observation. The vernal equinox is fixed to fall on 21 March.
Luckily there is a calculation for Easter Sunday called Computus (Latin for “computation”) The name has been used for this procedure since the early Middle Ages, as it was considered the most important computation of the age. We used the version found on wikipedia: https://en.wikipedia.org/wiki/Computus#Gauss’_Easter_algorithm
Solution
We have produced a generic solution to calculate fixed date and Easter based holidays for France. We have not included variable date holidays as there are none in the French calendar.
Additionally, to get a better understanding of how different techniques could be implemented and how they perform; we created a transformation using both a Modified JavaScript value and a User defined Java class in PDI.
The following is the general algorithm we wrote for the purpose of generating a field containing the holidays for our date dimension:
- On the first day of the year
- Store all of the fixed date holiday descriptions and their dates in a dedicated object for later reuse.
- calculate the date that Easter will fall on and store it for later use.
- As we are interested in Easter Monday, rather than Easter Sunday, we add a day to the previously calculated date and store it in our object.
- If the holiday is in our object, set the flag
is_bank_holiday
to 1,bank_holiday_desc_en
to the name of that holiday in English andbank_holiday_desc_fr
to the name of that holiday in French. - Output the three new fields:
is_bank_holiday
,bank_holiday_desc_en
,bank_holiday_desc_fr
The Result
The algorithm stated above has been introduced to our standard date dimension to give a transformation that looks like:
The effect of calculating holidays as we have is shown in the table snippet below (some columns have been omitted and others have been renamed to make it easier to see the result).
From the table, we can see that Good Friday has been successfully calculated as two days before Easter Sunday and Easter Monday occurs the day after Easter, as expected.
The fixed date holidays: Labour Day and Victory Day 1945 are also present and correct.
Common Pitfalls
In Java and JavaScript (and almost all other languages) month numbers do not go from 1 to 12 like the output of Gauss’s Easter algorithm does. Instead the months run from 0 to 11, where 0 is January and 11 is December. You’ll need to add or subtract 1 to convert from or to Java / Javascript dates.
What Next
If two holidays coincide, only the last occurrence will be written to the stream. One solution to this would be to concatenate the strings of holidays that occur on the same day.
If you want to include holidays for multiple countries you’ll need an additional column for each country to avoid confusing days marked as holidays in France when analysing sales data from Germany, for example.
If you need many countries, you’re probably better off by having a holidays dimension that accounts for all countries and has a key on the fact table.
In some countries, like the UK, if a bank holiday is on a weekend, a ‘substitute’ weekday becomes a bank holiday, normally the following Monday. Accounting for this would be valuable in the UK and USA.
Where Can I Get the Transformation?
Our implementation of a date dimension with Easter based holidays and fixed date holidays is available to download as a PDI transformation here: dim_date_holidays.ktr
2 Comments
Amazing post. For this problem i use this API https://holidayapi.com/ about the holiday per country do you think is enough with one column for country?
Thanks a lot! Using an api is also a good option, especially if you need the holidays for many countries and to account for countries changing their public holidays in the future. However you will have to adhere to their terms of use. One column per country can work if you concatenate when there are multiple holidays on the same day. The best option for multiple countries would be to have a holiday dimension with a key on the fact table, which means there’s no need to concatenate holidays together. It will always depend on your use case and what the client needs.